How do I troubleshoot traps without a line number?

Discussion to talk about software related topics only.
User avatar
mx270a
Posts: 80
Joined: Tue Jan 19, 2010 6:55 pm

Re: How do I troubleshoot traps without a line number?

Post by mx270a »

I haven't created any interrupt routines.

The most active task is QUADSERIAL, which reads data from three serial ports, parses that data, does some math, then sends some data back out one of those serial ports. The messages are coming in at 10Hz on two of the serial ports, so it doesn't sit idle for very long.

The HTTP task allows the user to see the status of the system, which is the data that QUADSERIAL is working with. A couple days ago I added UCOS_ENTER_CRITICAL(); and UCOS_EXIT_CRITICAL(); around the code in the HTTP task that needs to access variables that get modified by the QUADSERIAL task in hopes that that would prevent variables from being modified while the HTTP task is reading them. I'm still getting traps though.

The TELNET allows a console connection via TELNET, and the FTP task allows read/write access to the SD card. Neither of these tasks should be doing anything right now as I'm not making connections to the device on those ports.

Thanks,
Lance
User avatar
mx270a
Posts: 80
Joined: Tue Jan 19, 2010 6:55 pm

Re: How do I troubleshoot traps without a line number?

Post by mx270a »

Resurrecting an old thread for an update. 10 months in and I'm finally making progress on this.

I was 99% certain that the trap was occurring in a certain block of code where I was generating some HTML for an AJAX request. Since I couldn't get a line number, I decided to use a serial port for diagnostics and log one character at various points in that block of code. Then when a trap would occur, I would be able to see how far through the code block it got. The result was inconsistent, it would trap at various spots.

Someone suggested that I may be overflowing the available memory in the stack, so at the beginning of this code block I added some code to check the lengths of the string variables that will be going into the HTML string and output the result to my diagnostic serial port. Traps would hopefully occur after this point, so I could see if the strings were larger than usual on the iteration when the trap occurs. The result is that I found the strings were all the expected lengths.

Next I decided to write some of those strings to the serial port so I could actually "see" the data before it goes into the HTML. Below is an excerpt of the output of two strings that should always contain "SOL_COMPUTED" and "NARROW_INT":

SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOLROW_INTD,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_CO_INTD,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMINTD,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL.9673TED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,»ÌOW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_ÿÿÿÿ"00,,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT

As you can tell, something isn't right. If I measured the length of the string, it would always be correct, but actually reading the string would sometimes result in garbage.

So I have two tasks running here. One is a serial port task that reads data from the serial ports, parses it, and writes values to some variables. The second task (with lower priority) services HTTP requests, reads some of the variables that the serial port task writes to, and builds a string of HTML output. I'm pretty sure what was happening was that my HTTP task reading the shared strings was sometimes being interrupted so the higher priority serial port task could parse incoming data. That task would update the variables. String variables have an unknown length, so I suspect every time they are written, they get put in a different place in memory. Meanwhile, their old location is now free, and other data can be written to that location. When the serial process is done, the HTTP task picks up where it left off, presumably reading data from the original memory address which now contains unknown data. Small variables like INT and DOUBLE can be read in one clock cycle, so they're immune to this interruption, but reading strings requires many clock cycles.

The really odd part is that I have used UCOS_ENTER_CRITICAL and UCOS_EXIT_CRITICAL around this code block and still experienced the trap. Those commands should prevent task switching, and thus prevent my HTTP task from being interrupted mid-read. Upon reading the user manual section on shared variables, I opted to try semaphores as a better form of locking. Looks like semaphores are a better solution anyway as they don't block all tasks.

WinAddr2Line has always said the trap was being caused by std::string::_Rep::_S_create. I believe that is correct and that _S_create has a fatal problem when it encounters some specific character that cannot be converted to ASCII. Obviously it shouldn't see non-ASCII characters in a string because they shouldn't be in the string in the first place. In my case, the underlying data was being changed mid-read because I didn't understand how to work with shared variables.

I currently have 10 hours of uptime on new code using semaphores and no corrupted data or traps yet. Fingers crossed that this was it.

-Lance
User avatar
pbreed
Posts: 1087
Joined: Thu Apr 24, 2008 3:58 pm

Re: How do I troubleshoot traps without a line number?

Post by pbreed »

Should be USER_ENTER_CRITICAL() and USER_EXIT_CRITICAL() not the UCOS versions...
The UCOS_EXIT_CRITICAL has some sideeffects... and should probably not be exposed in the header files...
Its intended for use by the RTOS itself is some very limited cases...

Also the std:: libary stuff calls new for allocation that calls malloc that relies on a protecting critical section...

if you are in a USE_ENTER_CRITICAL, OSLock or UCOS_ENTER_CRITICAL situation when you try to call anything that
will call malloc then the critical section around the malloc is ignored any other action would cause an immediate dead lock...
so calling ANYTHING in the way of I/O ir std:: stuff inside locks or is a bad idea...

Much better to set up a OS_CRIT and use that to protect the specific variables....

In C++ there is even a nice OS_CRIT managment object to so can ignore the release going out of scope free it...


Something like

OS_CRIT The_CritGuardingMyStuff;
//One time someplace initalize this....
OSCritInit(The_CritGuardingMyStuff);


//then when you need to access the objects or variables you are guarding...

{
//this constructor will obtain ownership of the critical section before returning...
OSCriticalSectionObj LockingObject(The_CritGuardingMyStuff);
*
*
*
*
*
//So no matter how you leave this scope return, break , flow of controll etc...
//When the LockingObject goes out of scope the destructor will release the hold on the critical section
//You don't have to rember to do so..
}//end of scope of LockingObject the critical section protection is not valid beyond this }








Paul
User avatar
mx270a
Posts: 80
Joined: Tue Jan 19, 2010 6:55 pm

Re: How do I troubleshoot traps without a line number?

Post by mx270a »

New problem, now the Netburner will lock up randomly. The Ethernet link light shows that the cable is plugged in or not, but I lose the ability to ping the unit and all tasks appear to be hung.

I have used both OS_CRIT and Semaphores to control access to the variables that are shared between my serial and http tasks. Both methods do appear to resolve the issue with strings being written by one task when another task is reading them. Without this, the corrupted strings will cause a Trap. When I add either control method in, no Traps, but now random lockups.

Now what do I do?
rnixon
Posts: 833
Joined: Thu Apr 24, 2008 3:59 pm

Re: How do I troubleshoot traps without a line number?

Post by rnixon »

Just a guess, but it sounds like you are still contending with the same memory problem. By adding the critical sections you may have a priority inversion that locks up the task switching. As difficult as it may be, if it was me I would remove the critical sections and take a look at my code architecture to determine why the memory corruption is occurring. Otherwise the bug may continue to manifest itself in different ways.
User avatar
mx270a
Posts: 80
Joined: Tue Jan 19, 2010 6:55 pm

Re: How do I troubleshoot traps without a line number?

Post by mx270a »

I've taken the verbose logging approach to trying to figure this out. I have two tasks running, one for the serial ports, the other for http requests. I am now logging a single character at various spots through out each task, so that I can figure out where in the task it is dying. My serial task is logging numbers, my http task logs letters. The output from each looks like "1234567890" and "abcdefghijklmn" respectively. A sample of the output is below:

Code: Select all

1234567890
1234567890
1234567890
1234567890
abcdefghijklmn
1234567890
1234567890
1234567890
1234567890
abcdefghijklmn
1234567890
1234567890
abcdefghijklmn
1234567890
Running that code, with a Semaphore to do blocking for shared variables, below is the last output I get when it locks up.

Code: Select all

abcdefghijklmn
1234567890
1234567890
1234567890
abcdef
12gh34567890ijklmn
I can see it starts running the http task by the "abcdef", but is interrupted by the serial task which has higher priority. The serial task runs some code which logs "12", but then runs into a semaphore block so it goes back to the http task to run "gh" at which point the semaphore is released. At that point it goes back to the higher priority serial port task to finish, then goes back to the http task to finish. Then the box locks up.

So the logging indicates that it isn't locking up inside one of my suspected tasks, but somewhere else.
User avatar
tod
Posts: 587
Joined: Sat Apr 26, 2008 8:27 am
Location: Southern California
Contact:

Re: How do I troubleshoot traps without a line number?

Post by tod »

You might want to cast the net even wider. Are these the only two tasks running? Do you have a main event loop running? Do you know if the NetBurner has crashed or are you stuck in some race condition. I would at least add a single character output to any other loops (e.g "." , "*") that way you'll see if the NB is still up and churning away and it's only these two tasks that are locked up. If you have lint you could try linting your code to see if you get any useful warnings.
rnixon
Posts: 833
Joined: Thu Apr 24, 2008 3:59 pm

Re: How do I troubleshoot traps without a line number?

Post by rnixon »

This is a really long thread so I don't know if its been asked before, but can you run task scan? Ping the device?
User avatar
mx270a
Posts: 80
Joined: Tue Jan 19, 2010 6:55 pm

Re: How do I troubleshoot traps without a line number?

Post by mx270a »

rnixon wrote:This is a really long thread so I don't know if its been asked before, but can you run task scan? Ping the device?
No, not when using a semaphone or OSCrit to lock the shared variables that two tasks use. All activity stops, I cannot ping it, and I have to power cycle it to bring it back online. The only indication of life I see are the link and activity lights on the Ethernet jack.
User avatar
mx270a
Posts: 80
Joined: Tue Jan 19, 2010 6:55 pm

Re: How do I troubleshoot traps without a line number?

Post by mx270a »

tod wrote:You might want to cast the net even wider. Are these the only two tasks running? Do you have a main event loop running? Do you know if the NetBurner has crashed or are you stuck in some race condition. I would at least add a single character output to any other loops (e.g "." , "*") that way you'll see if the NB is still up and churning away and it's only these two tasks that are locked up. If you have lint you could try linting your code to see if you get any useful warnings.
I had two tasks that I just removed, one for FTP, one for Telnet. I just removed them and put the main task at a higher priority than the serial and web tasks. I'll be curious to see what it does now.

Previously, the main task would lock up too. The main task blinks the two LEDs on the side of the PK70, and when it locks up, those LEDs quit blinking too.
Post Reply