SOMRT1061 system crashes

Discussion to talk about software related topics only.
Post Reply
ephogy
Posts: 40
Joined: Fri Aug 29, 2008 12:53 pm

SOMRT1061 system crashes

Post by ephogy »

I've made a break out board for the SOMRT1061 so that I can use it as a drop in replacement for systems using the MOD5441X board.

I'm using the same code base to compile for the MOD5441X board under the 2.9.7 kernel and compiling for the SOMRT1061 under the 3.5.7 kernel.

Both system run, but the code base on the SOMRT1061 would hit a trap at random intervals, sometimes within a few minutes, and sometimes it would take an hour or more.

More often than not, I get a trap very similar to this:

Code: Select all

-------------------Trap information-----------------------------

Trap Vector        =  (04)

MMFSR               = 82
MMFAR               = A9000008
FPCAR              = 602BEF80
xPSR               = 60000004
PriMask            = 01
FaultMask          = 00
BasePri            = 00
Faulted PC         = 00009082

-------------------Register information-------------------------
R0     =A9000000 R1     =20001CEC R2     =00000000 R3     =00000001
R4     =00000009 R5     =20001CEC R6     =20001D0C R7     =20001D2C
R8     =60340BEC R9     =00000004 R10    =20001E80 R11    =202018C4
IP[R12]=00000000 SP[R13]=2000BFD0 LR[R14]=6006684B PC[R15]=00009082
XPSR   =81000200
-------------------RTOS information-----------------------------
Priority masking indicates trap from within ISR or CRITICAL RTOS section

Current task prio  = 00000026
Current task TCB   = 20002050
This looks like a valid TCB
The current running task is: Enet#26
-------------------Task information-----------------------------
Task    | State    |Wait| Call Stack
SPI#21|Mailbox   |0001|00008CF6,000095BE,000090D8,6002A7A8,000112BC
FIFO#22|Mailbox   |0008|00008CF6,000095BE,000090D8,600115CA,000112BC
TIMER#24|Mailbox   |0014|00008CF6,000095BE,000090D8,6005A4B6,000112BC
Enet#26|Running   |    |00009082,00000000,2000FF40
HTTP#27|Semaphore |0014|00008CF6,000095BE,00009010,60066B80,60066C14,60066C44,600657FA,000112BC
Config Server#28|Semaphore |0077|00008CF6,000095BE,00009010,60066B80,60066BDC,6006DBA2,60060F22,000112BC
MODB#2C|Timer     |0001|00008CF6,000095BE,00008D68,6004A918,000112BC
LCD#31|Mailbox   |0001|00008CF6,000095BE,000090D8,60012534,000112BC
Main#32|Mailbox   |04AA|00008CF6,000095BE,000090D8,60008DC0,6004A5B2,000112BC
SSDP#3D|Fifo      |000E|00008CF6,000095BE,0000930C,0000BCA0,6004C2C8,000112BC
BKG#3E|Queue     |000E|00008CF6,000095BE,000091E4,6000B74C,000112BC
Idle#3F|Ready     |    |6007A142,000112BC

-------------------Process Stack Dump----------------------------
2000BFB0: A9000000 20001CEC 00000000 00000001 00000000 6006684B 00009082 81000200 
2000BFD0: 6032C698 00000001 6006684B 2000FF40 00000000 2000C048 00000083 202B8036 
2000BFF0: 6034078C 6006687B 00000000 000040FF 20003B74 20002050 6034078C 00009601 
2000C010: 6034078C 0000953B 2000C048 000020A5 00000001 202B7FC0 00000036 202B8000 
2000C030: 00000014 202029A8 20001E84 00000000 202018C4 00004899 202B7FC0 202B8022 
2000C050: 202B800E 202B7F00 202B8036 202B0000 202B7FC0 0000B081 20200C40 00000001 
2000C070: 202018B0 20001E80 202029A8 00007273 00000000 00000000 00000E7D 00000E7E 
2000C090: 00000000 00000004 00000005 00000006 00000007 00000008 00000009 0000000A 
2000C0B0: 0000000B 000112BD 60365941 00000000 00000000 00000000 00000000 00000000 
2000C0D0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
2000C0F0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
2000C110: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
2000C130: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
2000C150: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
2000C170: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
2000C190: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
2000C1B0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
2000C1D0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
2000C1F0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
2000C210: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 

-------------------Stack dump------------------------------------
20001CA0: 00000003 00000000 00000003 00000000 00000003 00000000 60360B00 0000000F 
20001CC0: 0000000F 0000000F 00000000 00000000 00000000 00000000 00000000 00000000 
20001CE0: 00000000 00000000 2000FF40 00000009 00000000 00000000 00000000 00000088 
20001D00: 00000000 00000000 00000000 00000008 00000000 00000000 00000000 00000080 
Always a trap with PriMask 1.

The Ethernet Task is typically the one where the trap occurs. Looking at the memory map, the Ethernet send stack ENDS at 200C0C8.

Since the stack here appears full, I figured compiling with NBRTOS_STACKCHECK, NBRTOS_STACKOVERFLOW, and NBRTOS_STACKUNDERFLOW would immediately throw exceptions, but it makes no difference.

As a last ditch, I doubled the Ethernet stack size, and now the crashes have stopped (at least for the last 4 hours which is longer than I've ever had it go)

I don't have any of my Tasks or variables running in fast memory. Everything is statically allocated, but pointers are used.

I do have an ISR (running at a priority of 2) which posts to a mailbox, and I'm using one of the Timers, which also only posts to a Mailbox, I don't think these are the culprits.

Just looking for some thoughts on where else I should be looking -- the MOD5441X has no crash/reboot issues, I've had this code running out in the field for years...

Thanks!
ephogy
Posts: 40
Joined: Fri Aug 29, 2008 12:53 pm

Re: SOMRT1061 system crashes

Post by ephogy »

Increasing the Ethernet stack just simply increased the time it took to cause an error

With TaskScan, I noticed one of my tasks was reporting only 800 bytes free of 16kB on the SOMRT1061 vs the MOD5441X unit reporting 15436 bytes free.

The function has a switch statement in it with some scoped variables -- I'd noticed these switch statements causing issues in another task specifically on the SOMRT1061. I've since removed the switch statement, and I've crossed my fingers.

I'm just trying to understand why a switch statement could potentially cause these issues? I understand that gcc can sometimes be a little brain dead with scoped variables here, but since 2.97 and 3.57 both use the same gcc version, I'm not sure why this wouldn't be an issue on both architectures?
User avatar
TomNB
Posts: 622
Joined: Tue May 10, 2016 8:22 am

Re: SOMRT1061 system crashes

Post by TomNB »

Looks like your checking all the right things. The switch statement itself won't be a problem, but the code in it might be. Stack corruption is the first thing I would check too. I haven't actually seen a trap in the Ethernet task in all my time here. The NBRTOS is not the same underneath, so its not as simple as just going from 2.x, but I don't recall anything similar to this. The fact that making the Ethernet stack bigger makes it take longer is a clue that its memory related. Possibly not the Ethernet task itself overflowing, but something else in the program could be writing where it should not be and any change in the memory map has different behavior. If all else fails I hunker down and start commenting things out until I see a difference to try and locate the issue.
User avatar
TomNB
Posts: 622
Joined: Tue May 10, 2016 8:22 am

Re: SOMRT1061 system crashes

Post by TomNB »

You could also run the 3.x code directly one the MOD5441x you have to see if there is any difference.
User avatar
TomNB
Posts: 622
Joined: Tue May 10, 2016 8:22 am

Re: SOMRT1061 system crashes

Post by TomNB »

Everything here is a wild guess since I don't know anything about your app, but maybe it will help.

Couple of things:
- Have you always used the default stack sizes specified by the NetBurner system, or have you specified your own?
- The switch statement is where you stack frame must be getting the biggest. What is going on in the switch?
- What is in your ISR? Hopefully not floating point?

Trap Vector 04 on Cortex-M is the MemManage fault.
MMFSR = 0x82 = MMARVALID (0x80) + DACCVIOL (0x02): a data access to an illegal address, and MMFAR is valid.

MMFAR = 0xA9000008, and R0 = 0xA9000000. So the faulting code did effectively "load [R0 + 8]" with R0 holding a garbage pointer. 0xA9000000 is not a valid region on the RT1061 (RAM is 0x2000xxxx, ITCM code is the low 0x0000xxxx addresses, flash is 0x60xxxxxx). That 0xA9 looks like a byte smeared into the top of a pointer, which is a classic corruption signature.

The faulting PC (0x9082) and the repeated 0x8CF6 / 0x95BE / 0x90D8 return addresses are ITCM-resident system/RTOS code (copied to fast RAM). So a system/driver routine was handed a corrupted pointer that something clobbered earlier.
PriMask = 1 ("trap from within ISR or CRITICAL section") does not implicate your ISR. It just means that by the time the bad pointer was dereferenced, execution was inside a protected region. The corruption happened earlier; this is just where it detonated. That is also why it keeps landing on whatever task happens to be running (often Enet) rather than the task that actually overflowed.

Why the RT1061 burns so much more stack than the 5441X:

Same GCC front end, completely different back end. "Same compiler version" does not mean "same code generation." The target machine description, ABI, and FPU presence all differ:

FPU register spilling. The Cortex-M7 has a hardware FPU; the 5441X (ColdFire V2) does not. Any float/double work, including a stray %f in a print or doubles in your switch cases, pulls in FP registers and 8-byte doubles that get spilled to the stack. Different, larger frames.

FPU lazy interrupt stacking. This is the big one. On the M7, every interrupt that preempts a task pushes an exception frame onto that task's stack. With the FPU active (your FPCAR is populated, so it is), that is the extended frame: up to 26 words / 104 bytes, lazily stacking S0-S15 plus FPSCR. You have an ISR at priority 2 and a timer ISR. So each task's true worst case is deepest call nesting plus one or more nested interrupt frames, each up to ~104 bytes. ColdFire's interrupt frame is tiny by comparison and there is no FP lazy stack. A task that lived comfortably for years on the 5441X can sit right at the edge on the RT, and an interrupt arriving at the deepest moment is what tips it over, corrupting the neighbor below it.
8-byte stack alignment (AAPCS) adds padding the ColdFire ABI does not.

The switch / scoped-variable issue you noticed. GCC does not always overlap stack slots for variables in mutually-exclusive case blocks. At low optimization (-fstack-reuse is off at -O0) the frame is sized for the sum of the branches, not the max. Large per-case locals (arrays, big structs) stack up. This behavior is optimization-level and target dependent, so it can absolutely show on the RT build and not the ColdFire build if the SDK's default flags or -O level differ between the 2.9.7 and 3.5.7 toolchains. Are you doing debug or release builds?

Why NBRTOS_STACKCHECK / OVERFLOW / UNDERFLOW might not fire:

Those checks validate the guard pattern at context-switch time. Your failure mode either (a) corrupts a neighbor and then dereferences the bad pointer within the same time slice, or (b) faults during hardware exception stacking. Both happen before the scheduler ever re-runs the check, and the fatal access is inside a critical section (PriMask=1) where the scheduler can't preempt. So passing the stack checks does NOT prove you aren't overflowing. They catch slow leaks, not a single deep-nesting plus interrupt spike.
Post Reply