How do I troubleshoot traps without a line number?

rnixon · Post by **rnixon** » Wed Feb 26, 2014 8:23 am

What are the actual priority numbers of your task in the current configuration?

mx270a · Post by **mx270a** » Wed Feb 26, 2014 8:49 am

rnixon wrote:What are the actual priority numbers of your task in the current configuration?

Yesterday afternoon I striped the app down to these three tasks and moved Main up to a higher priority than the other two. I've seen the lockup twice since then.

Main = 41
QuadSerial = 42
HTTP = 45

rnixon · Post by **rnixon** » Wed Feb 26, 2014 10:17 am

You have a potential lockup situation here on HTTP if your main and quad serial do not block. For example, if your main had a while loop in it that ran all the time with no blocking calls, then http and quad serial would never run at all. That is an unusual priority architecture, main usually should not be a higher priority than anything else. I know you are just doing a test, but I think the test is showing you have a blocking and architecture problem in your code caused by adding the critical sections. Going back further, the critical sections seem to be added to try and stop a memory corruption problem. So there is a possibility here you could be far down the wrong path and should be removing the critical sections and determining what the memory corruption problem is. If its shared variables maybe use a RTOS semaphore instead of critical sections. How experienced are you with a real-time preemtive os?

dpursell · Post by **dpursell** » Wed Feb 26, 2014 10:55 am

mx270a wrote: Running that code, with a Semaphore to do blocking for shared variables, below is the last output I get when it locks up.
Code: Select all
abcdefghijklmn
1234567890
1234567890
1234567890
abcdef
12gh34567890ijklmn
I can see it starts running the http task by the "abcdef", but is interrupted by the serial task which has higher priority. The serial task runs some code which logs "12", but then runs into a semaphore block so it goes back to the http task to run "gh" at which point the semaphore is released. At that point it goes back to the higher priority serial port task to finish, then goes back to the http task to finish. Then the box locks up.

The fact that the system locks up directly after the HTTP task gets preempted seems to indicate that this is what's causing the problem. You may want to look into forcing this situation to occur more often so that you don't have to wait for hours to test any potential fixes.

A few ideas off the top of my head:
1. Try increasing the QuadSerial task rate to faster than 10Hz, even if you have to just make up fake data yourself instead of actually reading the serial lines. Look into NetBurner's HiResTimer library to get rates faster than the default OS 20Hz
2. Artificially slow down the HTTP task. A simple way would be to wrap the entire function in a loop to repeat the HTML generation routine several times before actually writing it to the socket (If you do it this way, verify details on resetting your ostreamstring, as I recall that is slightly tricky to get right)

-David

mx270a · Post by **mx270a** » Wed Feb 26, 2014 11:05 am

All three tasks will sleep until there is something to do. I had Main at a priority of 50 until yesterday, moved to a higher priority to confirm that it wasn't waiting on one of the other tasks.

The memory corruption issue stemmed from the QuadSerial task writing data to a string variable that is read by the HTTP task. The problem was that reading that variable takes more than one CPU cycle, and sometimes the HTTP task would be interrupted so that the higher priority QuadSerial task could process new data. When a new string is written, it apparently goes to a new memory location, and when the HTTP task returns, the old memory location may now contain characters that crash the string reader function. So I added Semaphores to control access to the variables. This way the string cannot be written while in the middle of being read by the HTTP task.

My experience with real-time OS stuff is essentially zero. Just what I've learned by working with the netburner platform so far. It's highly likely that I'm doing something wrong, but dang it is hard to troubleshoot an issue without an error message and occurs hours after boot up.

mx270a · Post by **mx270a** » Wed Feb 26, 2014 11:52 am

dpursell wrote:The fact that the system locks up directly after the HTTP task gets preempted seems to indicate that this is what's causing the problem. You may want to look into forcing this situation to occur more often so that you don't have to wait for hours to test any potential fixes

Clever idea. I added a for() loop to the active code in my HTTP task to run it 20x before returning data to the web browser. I'm now seeing the HTTP task preempted by the serial task 2 or 3 times each time the loop is called. 24 minutes in and I have a lockup. The last line indicates that the HTTP task was preempted in one of the sections protected by the semaphore, a consistent trend I've been seeing.

Code: Select all

abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefg
12h34567890ijklmn
abcdefg
12h34567890ijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdef
12gh34567890ijklmn
abcdef
12gh34567890ijklmn

dpursell · Post by **dpursell** » Wed Feb 26, 2014 12:48 pm

mx270a wrote:The last line indicates that the HTTP task was preempted in one of the sections protected by the semaphore, a consistent trend I've been seeing.
Code: Select all
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefg
12h34567890ijklmn
abcdefg
12h34567890ijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdefghijklmn
abcdef
12gh34567890ijklmn
abcdef
12gh34567890ijklmn

Can you provide any more code or details on what you mean by "protected by a semaphore"? I usually see semaphores as indicators that data is ready for other tasks that are waiting on it, which would be useful for example if you had a task that only wanted to do some processing every time new QuadSerial data was available.

This situation seems different in that the website doesn't care whether the data is new or has been reported many times already, it just needs the most recent snapshot of data without that data changing out from under it, which is a use case that lends itself better to locks (implemented by OS_CRIT for the NetBurner).

Just to make absolutely sure we're on the same page, what you'll want to do is:

Make sure both QuadSerial and HTTP tasks have access to the same lock (making it an extern global should be fine)
Initialize the lock using OSCritInit()
In both the HTTP and QuadSerial tasks:
- Before the first access to any QuadSerial global data in every function, lock the data using OSCritEnter()
- After the last access to any QuadSerial global data in every function, unlock it using OSCritLeave()

This is a very coarse locking scheme that will lead to relatively long delays while the tasks wait on each other, but the refresh rate seems very low so it probably won't even be noticeable. It should essentially prevent the task preemption we're seeing, and hopefully solve the problem. Apologies if you're already tried this exact thing!

-David

tod · Post by **tod** » Wed Feb 26, 2014 3:28 pm

In the AJAX V3 sample codeI posted here on the forums I do something I think might be similar to what you need. The TcpServer::ProcessMessage() method reads in data and then passes it off to another task by posting to a mailbox. After it posts its internal buffer, it pends on a semaphore, waiting until that buffer is released by the other task. I've put the relevant snippets of code on a gist so you don't have to download the Ajax sample if you don't want. I'll reference the GIST by line number but if you want to see the full code you'll need to download the sample.

In TcpServer::Processmessage() you can see that once the task has data it will post the data in _rxBuffer to a mailbox (line 9). It checks for errors and then pends on a semaphore (line 12).

In Startup::MainEventLoop() you can see the code pends on a mailbox waiting for data from the TCP Task (line 30). Once it has the data it calls ParseIncomingMsg() passing along that buffer. Now ParseIncomingMessage is pretty simple and just sends the data with cout, but the important part is that once it's done with the buffer it posts to the semaphore being waited upon by the TCP task (line 51).

This completes the loop and makes for a safe (albeit synchronous) way to share a buffer between two tasks. You could do something similar with two semaphores, although you would need a more intrusive way of getting to the underlying buffer.

rnixon · Post by **rnixon** » Thu Feb 27, 2014 9:30 am

This is turning into a great example of how to debug shared system resources. A key thing to consider is how long the writes take, you don't want to add a crit section that adds a lot of system latency. What type of data are you writing, how much data, and how long do you think it takes? If its a continuous serial stream is is a circular buffer?

mx270a · Post by **mx270a** » Thu Feb 27, 2014 11:06 am

@ dpursell: I have tested this with OSCrit as well. Same issue. The Semaphore code looks like this:

Code: Select all

OS_SEM MySemaphore;
OSSemInit(&MySemaphore, 1);

//This goes around code that accesses the shared variables.
OSSemPend(&MySemaphore, 0);
//Do stuff with shared vars
OSSemPost(&MySemaphore);

@ tod: I do want to implement mailboxes to separate the serial port reading from the data processing. I'll have a read through your example. I'd like to get this current issue sorted out before I add more complexity to the situation.

@ rnixon: Heh, it will be a good example when I figure out a resolution. The data being read from the serial port is a string of comma-delimited text. The serial task splits those into separate strings and writes them to variables that are readable by the HTTP task. 20 messages per second, each with 12 strings. Latency doesn't seem to be an issue right now, random odd behavior is.

My current plan of attack is to remove all the unrelated code that I can until I get down to something that works properly.

NetBurner Community Forum

How do I troubleshoot traps without a line number?

Re: How do I troubleshoot traps without a line number?

Re: How do I troubleshoot traps without a line number?

Re: How do I troubleshoot traps without a line number?

Re: How do I troubleshoot traps without a line number?

Re: How do I troubleshoot traps without a line number?

Re: How do I troubleshoot traps without a line number?

Re: How do I troubleshoot traps without a line number?

Re: How do I troubleshoot traps without a line number?

Re: How do I troubleshoot traps without a line number?

Re: How do I troubleshoot traps without a line number?