Page 3 of 5

Re: How do I troubleshoot traps without a line number?

Posted: Thu May 30, 2013 10:10 am
by mx270a
I haven't created any interrupt routines.

The most active task is QUADSERIAL, which reads data from three serial ports, parses that data, does some math, then sends some data back out one of those serial ports. The messages are coming in at 10Hz on two of the serial ports, so it doesn't sit idle for very long.

The HTTP task allows the user to see the status of the system, which is the data that QUADSERIAL is working with. A couple days ago I added UCOS_ENTER_CRITICAL(); and UCOS_EXIT_CRITICAL(); around the code in the HTTP task that needs to access variables that get modified by the QUADSERIAL task in hopes that that would prevent variables from being modified while the HTTP task is reading them. I'm still getting traps though.

The TELNET allows a console connection via TELNET, and the FTP task allows read/write access to the SD card. Neither of these tasks should be doing anything right now as I'm not making connections to the device on those ports.

Thanks,
Lance

Re: How do I troubleshoot traps without a line number?

Posted: Mon Jan 13, 2014 9:03 pm
by mx270a
Resurrecting an old thread for an update. 10 months in and I'm finally making progress on this.

I was 99% certain that the trap was occurring in a certain block of code where I was generating some HTML for an AJAX request. Since I couldn't get a line number, I decided to use a serial port for diagnostics and log one character at various points in that block of code. Then when a trap would occur, I would be able to see how far through the code block it got. The result was inconsistent, it would trap at various spots.

Someone suggested that I may be overflowing the available memory in the stack, so at the beginning of this code block I added some code to check the lengths of the string variables that will be going into the HTML string and output the result to my diagnostic serial port. Traps would hopefully occur after this point, so I could see if the strings were larger than usual on the iteration when the trap occurs. The result is that I found the strings were all the expected lengths.

Next I decided to write some of those strings to the serial port so I could actually "see" the data before it goes into the HTML. Below is an excerpt of the output of two strings that should always contain "SOL_COMPUTED" and "NARROW_INT":

SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOLROW_INTD,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_CO_INTD,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMINTD,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL.9673TED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,»ÌOW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_ÿÿÿÿ"00,,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT
SOL_COMPUTED,NARROW_INT

As you can tell, something isn't right. If I measured the length of the string, it would always be correct, but actually reading the string would sometimes result in garbage.

So I have two tasks running here. One is a serial port task that reads data from the serial ports, parses it, and writes values to some variables. The second task (with lower priority) services HTTP requests, reads some of the variables that the serial port task writes to, and builds a string of HTML output. I'm pretty sure what was happening was that my HTTP task reading the shared strings was sometimes being interrupted so the higher priority serial port task could parse incoming data. That task would update the variables. String variables have an unknown length, so I suspect every time they are written, they get put in a different place in memory. Meanwhile, their old location is now free, and other data can be written to that location. When the serial process is done, the HTTP task picks up where it left off, presumably reading data from the original memory address which now contains unknown data. Small variables like INT and DOUBLE can be read in one clock cycle, so they're immune to this interruption, but reading strings requires many clock cycles.

The really odd part is that I have used UCOS_ENTER_CRITICAL and UCOS_EXIT_CRITICAL around this code block and still experienced the trap. Those commands should prevent task switching, and thus prevent my HTTP task from being interrupted mid-read. Upon reading the user manual section on shared variables, I opted to try semaphores as a better form of locking. Looks like semaphores are a better solution anyway as they don't block all tasks.

WinAddr2Line has always said the trap was being caused by std::string::_Rep::_S_create. I believe that is correct and that _S_create has a fatal problem when it encounters some specific character that cannot be converted to ASCII. Obviously it shouldn't see non-ASCII characters in a string because they shouldn't be in the string in the first place. In my case, the underlying data was being changed mid-read because I didn't understand how to work with shared variables.

I currently have 10 hours of uptime on new code using semaphores and no corrupted data or traps yet. Fingers crossed that this was it.

-Lance

Re: How do I troubleshoot traps without a line number?

Posted: Thu Jan 16, 2014 7:57 am
by pbreed
Should be USER_ENTER_CRITICAL() and USER_EXIT_CRITICAL() not the UCOS versions...
The UCOS_EXIT_CRITICAL has some sideeffects... and should probably not be exposed in the header files...
Its intended for use by the RTOS itself is some very limited cases...

Also the std:: libary stuff calls new for allocation that calls malloc that relies on a protecting critical section...

if you are in a USE_ENTER_CRITICAL, OSLock or UCOS_ENTER_CRITICAL situation when you try to call anything that
will call malloc then the critical section around the malloc is ignored any other action would cause an immediate dead lock...
so calling ANYTHING in the way of I/O ir std:: stuff inside locks or is a bad idea...

Much better to set up a OS_CRIT and use that to protect the specific variables....

In C++ there is even a nice OS_CRIT managment object to so can ignore the release going out of scope free it...


Something like

OS_CRIT The_CritGuardingMyStuff;
//One time someplace initalize this....
OSCritInit(The_CritGuardingMyStuff);


//then when you need to access the objects or variables you are guarding...

{
//this constructor will obtain ownership of the critical section before returning...
OSCriticalSectionObj LockingObject(The_CritGuardingMyStuff);
*
*
*
*
*
//So no matter how you leave this scope return, break , flow of controll etc...
//When the LockingObject goes out of scope the destructor will release the hold on the critical section
//You don't have to rember to do so..
}//end of scope of LockingObject the critical section protection is not valid beyond this }








Paul

Re: How do I troubleshoot traps without a line number?

Posted: Fri Feb 21, 2014 8:33 am
by mx270a
New problem, now the Netburner will lock up randomly. The Ethernet link light shows that the cable is plugged in or not, but I lose the ability to ping the unit and all tasks appear to be hung.

I have used both OS_CRIT and Semaphores to control access to the variables that are shared between my serial and http tasks. Both methods do appear to resolve the issue with strings being written by one task when another task is reading them. Without this, the corrupted strings will cause a Trap. When I add either control method in, no Traps, but now random lockups.

Now what do I do?

Re: How do I troubleshoot traps without a line number?

Posted: Fri Feb 21, 2014 10:54 am
by rnixon
Just a guess, but it sounds like you are still contending with the same memory problem. By adding the critical sections you may have a priority inversion that locks up the task switching. As difficult as it may be, if it was me I would remove the critical sections and take a look at my code architecture to determine why the memory corruption is occurring. Otherwise the bug may continue to manifest itself in different ways.

Re: How do I troubleshoot traps without a line number?

Posted: Tue Feb 25, 2014 9:36 am
by mx270a
I've taken the verbose logging approach to trying to figure this out. I have two tasks running, one for the serial ports, the other for http requests. I am now logging a single character at various spots through out each task, so that I can figure out where in the task it is dying. My serial task is logging numbers, my http task logs letters. The output from each looks like "1234567890" and "abcdefghijklmn" respectively. A sample of the output is below:

Code: Select all

1234567890
1234567890
1234567890
1234567890
abcdefghijklmn
1234567890
1234567890
1234567890
1234567890
abcdefghijklmn
1234567890
1234567890
abcdefghijklmn
1234567890
Running that code, with a Semaphore to do blocking for shared variables, below is the last output I get when it locks up.

Code: Select all

abcdefghijklmn
1234567890
1234567890
1234567890
abcdef
12gh34567890ijklmn
I can see it starts running the http task by the "abcdef", but is interrupted by the serial task which has higher priority. The serial task runs some code which logs "12", but then runs into a semaphore block so it goes back to the http task to run "gh" at which point the semaphore is released. At that point it goes back to the higher priority serial port task to finish, then goes back to the http task to finish. Then the box locks up.

So the logging indicates that it isn't locking up inside one of my suspected tasks, but somewhere else.

Re: How do I troubleshoot traps without a line number?

Posted: Tue Feb 25, 2014 11:13 am
by tod
You might want to cast the net even wider. Are these the only two tasks running? Do you have a main event loop running? Do you know if the NetBurner has crashed or are you stuck in some race condition. I would at least add a single character output to any other loops (e.g "." , "*") that way you'll see if the NB is still up and churning away and it's only these two tasks that are locked up. If you have lint you could try linting your code to see if you get any useful warnings.

Re: How do I troubleshoot traps without a line number?

Posted: Tue Feb 25, 2014 12:37 pm
by rnixon
This is a really long thread so I don't know if its been asked before, but can you run task scan? Ping the device?

Re: How do I troubleshoot traps without a line number?

Posted: Tue Feb 25, 2014 1:00 pm
by mx270a
rnixon wrote:This is a really long thread so I don't know if its been asked before, but can you run task scan? Ping the device?
No, not when using a semaphone or OSCrit to lock the shared variables that two tasks use. All activity stops, I cannot ping it, and I have to power cycle it to bring it back online. The only indication of life I see are the link and activity lights on the Ethernet jack.

Re: How do I troubleshoot traps without a line number?

Posted: Tue Feb 25, 2014 1:03 pm
by mx270a
tod wrote:You might want to cast the net even wider. Are these the only two tasks running? Do you have a main event loop running? Do you know if the NetBurner has crashed or are you stuck in some race condition. I would at least add a single character output to any other loops (e.g "." , "*") that way you'll see if the NB is still up and churning away and it's only these two tasks that are locked up. If you have lint you could try linting your code to see if you get any useful warnings.
I had two tasks that I just removed, one for FTP, one for Telnet. I just removed them and put the main task at a higher priority than the serial and web tasks. I'll be curious to see what it does now.

Previously, the main task would lock up too. The main task blinks the two LEDs on the side of the PK70, and when it locks up, those LEDs quit blinking too.