Page 1 of 1

Need Help Deciphering this Trap

Posted: Sat Jun 30, 2012 8:09 pm
by MasterFrmMO88
I have a MOD5270. Here's what it does: On startup, it uses the DIP switch on the DEV board to set its own IP. Then it starts up a TCP client that requests a connection every second from the server. When the server connects, it receives commands from the server that trigger GPIO pins and then it sends a ready signal back to the server. Say, it receives a "1" from the server, it sets GPIO pin 15 to high, then to low again then sends a "1" back to the server to state that it is ready for the next command. I have been running longevity tests on it that sends a command every second to the NB. The test keeps sending commands and takes note of the time at which a failure occurs (the server does not receive the ready command). The times have been sporadic and disappointing to say the least. Anywhere from 2.7 Hours to 38 seconds. So I hooked it up to the MTTTY and this is the trap that occurs, causing these failures.

Code: Select all

Waiting 2sec to start 'A' to abort
Configured IP = 192.168.100.3
Configured Mask = 255.255.255.0
MAC Address= 00:03:f4:05:a9:a5
The netburner id is 1
Static IP address of192.168.100.3
Connecting to: 192.168.100.2 : 43001
Connecting to: 192.168.100.2 : 43001
Connecting to: 192.168.100.2 : 43001
Connecting to: 192.168.100.2 : 43001
Connecting to: 192.168.100.2 : 43001
Connecting to: 192.168.100.2 : 43001
Connected!

Read 1 bytes: 1
.
.
.
Read 1 bytes: 1

Trap occured
 Vector=77FMT =00 SR =2004 FS =00
 Faulted PC = 02004F32
D0:00000000 00012000 0000002E 200027C4 000000D4 000000D5 000000D6 000000D7
A0:20000664 20000538 02009C40 02008588 02008588 000000A5 200027FC 200027A4
Waiting 2sec to start 'A' to abort
Configured IP = 192.168.100.3
Configured Mask = 255.255.255.0
MAC Address= 00:03:f4:05:a9:a5
The netburner id is 17
Static IP address of192.168.100.19
Connecting to: 192.168.100.2 : 43001
Connecting to: 192.168.100.2 : 43001
Connecting to: 192.168.100.2 : 43001
Can anyone tell me what might be going on to cause this trap? Would it be hardware or code related?

EDIT: There does seem to be a potential correlation with the testing failure times. The shortest times always occur on the first several of a group of tests and the times get progressively longer as each test is run. (shorter = quick failure i.e. 38 seconds, longer = successful for awhile before failure i.e. 2.7 hours.)

Re: Need Help Deciphering this Trap

Posted: Sat Jun 30, 2012 9:48 pm
by Ridgeglider
Try using the WinAddr2Line tool in PC Tools, described in NetBurnerPcTools.pdf) with the faulted PC address. If you have a bad pointer though, this won't help. You could also single step thru your code w/ the debugger to see where things fail.

Re: Need Help Deciphering this Trap

Posted: Sun Jul 01, 2012 11:27 am
by pbreed
Turn on smart traps in your init code...

#include <smarttrap.h>

#ifndef _DEBUG
EnableSmartTraps();
#endif


Use winadd2line as described earlier.

Some questions:

Are you overflowing your stack?
This is one of the more common reasons for this sort of failure...
Realize that any local classes, structs or buffers are put on the task stack.
The default task stack is about 8K
So a single char mybuffer[10000]; would blow the stack....
If the variable is only used in one task ie the fucntion is olny called in one line of logic use static variables.
IE static myBuffer[10000] does not go on the stack.



Do you have any interrupt routines you have written yourself?
Inside an interrupt routine you may not call and functions that block, so no printf, OSTaskDly, or
any OS function other than the "post" functions.
Teh right way to structure this is to have the interrupt tickle a Semaphore with a OSSemPost then have a task wake up and
do the actual dirty work.

Are you doing anything else that might be considered strange?

Re: Need Help Deciphering this Trap

Posted: Sun Jul 01, 2012 7:36 pm
by MasterFrmMO88
I do have smarttraps enabled but I call it outside the debug. Will this cause it not to work properly?

The way the program works is, on startup, it starts up the TCP task and inside the TCP task, before it hits the request connection loop, starts a DMA timer. After starting the TCP task the main just enters an empty while(1) loop. The TCP task blocks until it receives a command which sets a variable to zero and switches the corresponding GPIO pin to high. While the TCP task is running, the DMA timer triggers an interrupt every ~0.002 seconds which checks the value of the variable. If the variable has reached 7, it sets the GPIO pin back to low. If it is less than 7, it increments it. There are 32 of these individual variables. With the current test, only one is in use at a time (being incremented or checked). I ended up using the DMA interrupt because I had trouble figuring out how to use the semaphore. (All of my work up to this point has been taking examples and altering them to fit my needs, I'm still kind of learning about how individual things work)

I dont believe a stack overflow it occurring, but I could be wrong. The TCP task stack size is 4096 and the buffer for sending and receiving is dependent on the data size which never exceeds a byte or 2.

WinAddr2Line using the faulted PC address points to the main, where the call is made to start the TCP task.

To give the code in a nutshell we basically have this:

Here's the code for the TCP task after the DMA startup and a connection has been made (Set_01 is just J[15].set();):

Code: Select all

do {
				//Reads messages from the server and executes the matching command.
				n = read( fdnet, RXBuffer, RX_BUFSIZE-1);
				com = RXBuffer;
				switch (*com){
				case '1':
					Set_01();
					numbers[1] = 0;
					writestring(fdnet, "\r1\n");
					break;
                                RXBuffer[n] = '\0';
				iprintf( "Read %d bytes: %s\n", n, RXBuffer );
				} while ( n > 0 );
Here's the interrupt:

Code: Select all

INTERRUPT( func_isr, 0x2100 )
{
sim.timer[0].ter |= 0x02; // Clear the DTIM0 reference event flag
numbers[1]++;
if (numbers[1] > 7)
	J2[15].clr();
}

Re: Need Help Deciphering this Trap

Posted: Mon Jul 02, 2012 5:29 am
by pbreed
Do you turn the DMA timer interrupt on and off?

If so you probably have a race condition in that code and are triggering a spurious interrupt.

Another thing to look at, are your pulses the length you expect?
Its possible that the DMA timer is hung on IE your not clearing it correctly...
The ISR happens and then returns , executes one assembly instruction from the main code then happens again....

I always put a counter in my interrupt routines so I know how often they are happening....


Lastly make sure that all of the variables shared between normal code and interrupt are declared volatile.

Is the behavior different between release and debug builds?

Re: Need Help Deciphering this Trap

Posted: Mon Jul 02, 2012 5:41 am
by MasterFrmMO88
The behavior isn't any different between release and debug. The DMA timer gets turned on and off only if the module is told to do so by the server which is once on startup to test the condition of each GPIO. Pulses appear to be the length I expect but it's difficult for me to accurately time ~15 milliseconds. The pulses all appear to be even, none of them seem drastically short or anything. One thing I am trying is I have gotten rid of all extraneous code like printf's, etc. None of them are in the interrupt, the TCP program is built off the example one so it would print any bytes it received to the mttty. Maybe it's getting hung up trying to execute a line while trying to trigger a pin. I dont know if that would affect anything but at this point it's worth a shot. and I'm going to run it again.

One thing that's probably worth noting is I dropped an old app into the module and got similar results. The old one didn't use a DMA timer, it used OSTimeDly's instead. A lot slower because the code was single task and linear but it still gave me the same traps at random times.

Re: Need Help Deciphering this Trap

Posted: Mon Jul 02, 2012 4:03 pm
by MasterFrmMO88
Taking away the printf's didn't change the results. Would anyone be willing to take a look at my code, if I send you a zip file with it, and possibly help shed some light on this? I'm just stumped and there is a good chance it is just because I could have coded it a better way but I have no idea how.

Thanks.

Re: Need Help Deciphering this Trap

Posted: Mon Jul 02, 2012 5:34 pm
by tod
I think you can help yourself more by following Paul's advice and turning on Smart Traps. The call should be wrapped in the #ifndef _DEBUG as Paul showed because you can't have smart traps and the network debugger at the same time. In case it's not obvious you should make a RELEASE build of your app (not a Debug build) and run until it traps. It will work even if it's not wrapped but the MTTY output you showed indicated that smart traps were not enabled. You should get the standard output for smart traps as shown in the NetBurnerPcTools manual in the section on using WindAddr2Line (p 24). It might not help but if it does it's certainly going to be a lot faster than trying to comb through your code.

I'm not sure what your level of expertise is but using global variables (I'm assuming that's what numbers[] array is) from multiple tasks without any type of mutual exclusion is not a good idea. While I don't see how what you are doing with numbers[] can cause a crash, if you are using other global memory data (especially pointers) that would be the first place I look. There are also ways to check your remaining stack space to make sure that's not the problem. If you have the memory try doubling or quadrupling your stack size, if the problem goes away that was it. If it delays the problem you're probably slowing leaking stack memory.

I saw you removed printf's and since they and their brethren are not type safe they are a common source of traps. However, that type of trap usually happens pretty consistently every time you hit the printf.

A common debugging technique is to use cout (OK you can use iprintf because you won't be passing any parameters) to put out messages solely for the purposes of finding the trapping code. Take a divide and conquer binary approach. Put in a couple to narrow down the possibilities. Then add a few more, rinse and repeat eventually you'll narrow in on the code that's trapping. As Paul mentioned you can't use this technique in Interrupts. Most of us keep very little code in interrupts anyway. In your nburn\docs folder is a manual ucOSLibrary.pdf with the syntax on setting up a semaphore. A more complete description and approach is available in Jean Labrosse's MicroC/OS-II book but there are differences in syntax between the two. The world's shortest (yet still helpful) tutorial on using semaphores in interrupts is available in the response posted by Paul in this thread.

Re: Need Help Deciphering this Trap

Posted: Mon Jul 02, 2012 8:10 pm
by MasterFrmMO88
Okay. Thanks for the thorough answer. Unfortunately, my expertise is rather limited. Everything I know about it now I learned from taking example programs, tweaking them, and seeing what effect my tweaks had on the programs. The good news is I do understand enough about it to get where you are going with your answer. One thing that might be worth mentioning, I halved the interrupt frequency (interrupt code trips half as often) and halved the threshold on numbers[] at which it turns the pins off. So they are, effectively, left high for close to the same amount of time. Doing that got me a test out to 2 hours and only 2 hours because I had to stop the test to move from one computer to another. My question being, if by turning the interrupt frequency down, and having a much more successful test, could this possibly be an indicator as to what is/was causing the issue?

As for your suggestions, I will start working on narrowing down the area where the traps are occurring. I am having an issue with the debug build of the app, it wont turn the pins back off, it turns them on then leaves them running, so I'll need to figure that out. On that note, I have been running the release app for the previous tests. But I'll try to sort that out and get smarttraps running and see if that can give a little bit more information than I have at the moment.