Omega OM-2SP in the field boot failures

handg

Situation
My company relies upon the Onion Omega model OM-2SP in order to take responsibility for Ethernet communications between different types of our equipment. The Omega is housed on a custom PCB. Our application runs under Python3.

During our manufacturing process we make use of the sysupgrade facility to add to the standard Onion version of openWRT 18.06.4 some additional OS facilities such as DNSSD/zeroconf, mosquitto MQTT etc. and our Python code, together with some customer-specific configuration files. All equipment is soak-tested before dispatch.

We have many hundreds of units in the field, and most of them have been operating correctly for some time. But recently we have received a distressing number of units which our customers and technical support personnel have found not to be completing their boot sequence as expected.

All of these units are power-cycled several times a day; they are installed on public transport vehicles in more than one country, and these vehicles experience a wide range of operating practices and environmental conditions. It occurred to us that this may be different to a more standard sort of router application, where power is not cycled as frequently.

So we ran tests with a handful of units, both "virgin" and repaired, power cycling many thousands of times under many different timing regimes and power supply fluctuations.
Unfortunately, this has not produced any failures. In almost all cases, except for those retained for forensic examination, it was possible to restore normal operation by using the failsafe boot mode, factory reset and reprogramming with sysupgrade. However, this is definitely not an option for in-the-field repair because of the difficulty of gaining physical access to installed equipment - no boot, no Ethernet access, so a plain old serial connection has to be used in order to recover.

Forensics on failed units
All units appear to fail in the same way: during the very late stages of boot, the jffs2 system fails to reconstruct the overlay file system at some point. Using debug level 4, I obtained the following info (truncated here):

Board: Onion Omega2 APSoC DRAM:  128 MB
relocate_code Pointer at: 87f60000
flash manufacture id: ef, device id 40 19
find flash: W25Q256FV
*** Warning - bad CRC, using default environment

============================================ 
Onion Omega2 UBoot Version: 4.3.0.3
-------------------------------------------- 
ASIC 7628_MP (Port5<->None)
DRAM component: 1024 Mbits DDR, width 16
DRAM bus: 16 bit
Total memory: 128 MBytes
Flash component: SPI Flash
Date:Mar  1 2019  Time:11:20:35
============================================ 
icache: sets:512, ways:4, linesz:32 ,total:65536
dcache: sets:256, ways:4, linesz:32 ,total:32768 
CPU freq = 575 MHZ
Estimated memory size = 128 Mbytes
Resetting MT7628 PHY.
Initializing MT7688 GPIO system.
wifi mac address = 40A36BC38CAB.


**************************************
* Hold Reset button for more options *
**************************************



Boot Linux from Flash NO RESET PRESSED.
## Booting image at bc050000 ...
   Image Name:   MIPS OpenWrt Linux-4.14.97
   Image Type:   MIPS Linux Kernel Image (lzma compressed)
   Data Size:    1465911 Bytes =  1.4 MB
   Load Address: 80000000
   Entry Point:  80000000
   Verifying Checksum ... OK
   Uncompressing Kernel Image ... OK
No initrd
## Transferring control to Linux (at address 80000000) ...
## Giving linux memsize in MB, 128
...
[    4.753428] jffs2: notice: (362) jffs2_build_xattr_subsystem: complete building xattr subsystem, 421 of xdatum (2 unchecked, 419 orphan) and 424 of xref (419 dead, 0 orphan) found.
[    4.791477] mount_root: switching to jffs2 overlay
[    4.814203] overlayfs: upper fs does not support tmpfile.
...
[    6.706185] procd: Message:
[    6.709095] procd:  ACTION=add
[    6.712336] procd:  DEVPATH=/devices/platform/10000000.palmbus/10000b00.spi/spi_master/spi0/spi0.0/mtd/mtd6/mtdblock6
[    6.723321] procd:  SUBSYSTEM=block
[    6.726942] procd:  SYNTH_UUID=0
[    6.730362] procd:  MAJOR=31
[    6.733361] procd:  MINOR=6
[    6.736268] procd:  DEVNAME=mtdblock6
[    6.740121] procd:  DEVTYPE=disk
[    6.743471] procd:  SEQNUM=341
[    7.247465] procd: Coldplug complete
[    7.251255] procd: Change state 1 -> 2
[    7.255133] procd: - watchdog -
[    7.258393] procd: Set watchdog timeout: 30s
[    7.262954] procd: Ping
[    7.265539] procd: Opened watchdog with timeout 30s
[    7.270819] procd: - ubus -
[    7.273806] procd: Create service ubus
[    7.277749] procd: Start instance ubus::instance1
[    7.283266] procd: Started instance ubus::instance1[426]
[    7.288872] procd: running /etc/init.d/ubus running
[    7.311790] procd: glob failed on /etc/init.d/ubus
[    7.471054] random: ubusd: uninitialized urandom read (4 bytes read)
[    7.555354] random: ubusd: uninitialized urandom read (4 bytes read)
[    7.562425] random: ubusd: uninitialized urandom read (4 bytes read)
[    7.569935] procd: Connected to ubus, id=0f43e641
[    7.574841] procd: Change state 2 -> 3
[    7.578715] procd: - init -
[    7.582663] procd: Parsing inittab - ::sysinit:/etc/init.d/rcS S boot
[    7.589382] procd: Parsing inittab - ::shutdown:/etc/init.d/rcS K shutdown
[    7.596623] procd: Parsing inittab - ::askconsole:/usr/libexec/login.sh
[    7.604412] procd: Launched new askconsole action, pid=427
[    7.659470] procd: running /etc/rc.d/S* boot
Please press Enter to activate this console.

Failed to executPlease press Enter to activate this console.

Whenever ENTER is pressed, the final message re-appears, as in the last line of the above.

Forensics on working vs failed units
I observed that the jffs2 message is different on working and non-working units.
On newly programmed units, the message is of the form:

[    6.643306] jffs2: notice: (364) jffs2_build_xattr_subsystem: complete building xattr subsystem,
0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.

On each following reboot (s/w or power cycle) the count of orphan & dead references increments by 1. This surprised me.
For most failed units, the message contains much higher counts and looks like this:

289 of xdatum (2 unchecked, 287 orphan) and 291 of xref (287 dead, 0 orphan)
421 of xdatum (2 unchecked, 419 orphan) and 424 of xref (419 dead, 0 orphan)
228 of xdatum (2 unchecked, 226 orphan) and 233 of xref (226 dead, 0 orphan)

However, some failed units gave this, a lower and less consistent count:

26 of xdatum (2 unchecked, 24 orphan) and 29 of xref (24 dead, 0 orphan)
24 of xdatum (2 unchecked, 22 orphan) and 78 of xref (73 dead, 0 orphan)

The only common feature I can see between all the failed units is that they report a number of "unchecked" nodes. None of the working units appear to report this.

Final Questions
(If you've read this far, thank you!)

Can anyone explain why the orphan & dead references increment on every boot, even when I have disabled all application-level activity? @Lazar-Demin I believe you have also noted this anomaly in another thread, but because it relates to xrefs, did not believe it should cause problems?
Does anyone know the significance of the "unchecked" items in the failed units' messages? Would you expect this to cause problems?
Has anyone met a similar problem and found a robust solution?
(If you've read this far, thanks again!)

crispyoz

@handg I don't have a solution for you however I recall reading about an issue with jffs2 on some kernel version. I don't have time to look for it right now but maybe you can look into that.

uname -r to see your kernel version.

handg

@crispyoz Thanks for responding. Our OS on faulty units is reported as:
Linux version 4.14.97 (amery@72f72a0014fa) (gcc version 7.3.0 (OpenWrt GCC 7.3.0 r0+7414-86a74e73e4)) #0 Fri Feb 1 16:26:17 2019
I will search the community for this, if possible.

handg

@crispyoz My search for problems with jffs2 and particular versions of OS did not reveal anything useful, I'm afraid.
Can anyone throw any light on this topic?

Julian1

I was searching for a similar data, had a go at doing likewise yet it didn't worked!