Onions failing to boot

Cooper Chan

Hello,

I have scoured the internet, several forums for some guidance or solution but have come up empty handed. The current issue at hand is multiple units that are failing to boot and we are unable to find a cause or solution.

We are using the Omega2S+, which is running vanilla OpenWRT 18.06. We build the sysupgrade image ourselves which contain our application (based on Python3). This is placed on our own carrier board that contains its own step down power supply capable of supplying 3v3 at 1.5A (OKI-78SR-3.3/1.5). Only the serial, Ethernet pins and necessary pins are routed. SPI is left floating.

We have shipped thousands of units with Onions integrated into them worldwide and we are seeing the same problem in all those locations. They leave our factory perfectly functioning and then after a while they will be reported as broken. Please see enclosed bootlog:


   ____       _             ____
  / __ \___  (_)__  ___    / __ \__ _  ___ ___ ____ _
 / /_/ / _ \/ / _ \/ _ \  / /_/ /  ' \/ -_) _ `/ _ `/
 \____/_//_/_/\___/_//_/  \____/_/_/_/\__/\_, /\_,_/
 W H A T  W I L L  Y O U  I N V E N T ? /___/"

Board: Onion Omega2 APSoC DRAM:  128 MB
relocate_code Pointer at: 87f60000
flash manufacture id: ef, device id 40 19
find flash: W25Q256FV
*** Warning - bad CRC, using default environment

============================================
Onion Omega2 UBoot Version: 4.3.0.3
--------------------------------------------
ASIC 7628_MP (Port5<->None)
DRAM component: 1024 Mbits DDR, width 16
DRAM bus: 16 bit
Total memory: 128 MBytes
Flash component: SPI Flash
Date:Mar  1 2019  Time:11:20:35
============================================
icache: sets:512, ways:4, linesz:32 ,total:65536
dcache: sets:256, ways:4, linesz:32 ,total:32768
CPU freq = 575 MHZ
Estimated memory size = 128 Mbytes
Resetting MT7628 PHY.
Initializing MT7688 GPIO system.
wifi mac address = 40A36BC383AE.


**************************************
* Hold Reset button for more options *
**************************************



Boot Linux from Flash NO RESET PRESSED.
## Booting image at bc050000 ...
   Image Name:   MIPS OpenWrt Linux-4.14.97
   Image Type:   MIPS Linux Kernel Image (lzma compressed)
   Data Size:    1465911 Bytes =  1.4 MB
   Load Address: 80000000
   Entry Point:  80000000
   Verifying Checksum ... OK
   Uncompressing Kernel Image ... OK
No initrd
## Transferring control to Linux (at address 80000000) ...
## Giving linux memsize in MB, 128

Starting kernel ...

[    0.000000] Linux version 4.14.97 (amery@72f72a0014fa) (gcc version 7.3.0 (OpenWrt GCC 7.3.0 r0+7414-86a74e73e4)) #0 Fri Feb 1 16:26:17 2019
[    0.000000] Board has DDR2
[    0.000000] Analog PMU set to hw control
[    0.000000] Digital PMU set to hw control
[    0.000000] SoC Type: MediaTek MT7688 ver:1 eco:2
[    0.000000] bootconsole [early0] enabled
[    0.000000] CPU0 revision is: 00019655 (MIPS 24KEc)
[    0.000000] MIPS: machine is Hanover Displays 71020 Onion
[    0.000000] Determined physical RAM map:
[    0.000000]  memory: 08000000 @ 00000000 (usable)
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
[    0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x0000000007ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x0000000007ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000007ffffff]
[    0.000000] random: get_random_bytes called from start_kernel+0x8c/0x474 with crng_init=0
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 32512
[    0.000000] Kernel command line: console=ttyS0,115200 rootfstype=squashfs,jffs2
[    0.000000] PID hash table entries: 512 (order: -1, 2048 bytes)
[    0.000000] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
[    0.000000] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
[    0.000000] Writing ErrCtl register=00000009
[    0.000000] Readback ErrCtl register=00000009
[    0.000000] Memory: 125096K/131072K available (3385K kernel code, 171K rwdata, 816K rodata, 164K init, 200K bss, 5976K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS: 256
[    0.000000] intc: using register map from devicetree
[    0.000000] CPU Clock: 580MHz
[    0.000000] timer_probe: no matching timers found
[    0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6590553264 ns
[    0.000011] sched_clock: 32 bits at 290MHz, resolution 3ns, wraps every 7405115902ns
[    0.007537] Calibrating delay loop... 385.84 BogoMIPS (lpj=1929216)
[    0.073497] pid_max: default: 32768 minimum: 301
[    0.078197] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.084546] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.097259] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.106750] futex hash table entries: 256 (order: -1, 3072 bytes)
[    0.112699] pinctrl core: initialized pinctrl subsystem
[    0.118270] NET: Registered protocol family 16
[    0.148932] mt7621_gpio 10000600.gpio: registering 32 gpios
[    0.154539] mt7621_gpio 10000600.gpio: registering 32 gpios
[    0.160170] mt7621_gpio 10000600.gpio: registering 32 gpios
[    0.169842] clocksource: Switched to clocksource MIPS
[    0.175966] NET: Registered protocol family 2
[    0.181050] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
[    0.187742] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
[    0.193929] TCP: Hash tables configured (established 1024 bind 1024)
[    0.200178] UDP hash table entries: 256 (order: 0, 4096 bytes)
[    0.205777] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
[    0.212126] NET: Registered protocol family 1
[    0.219688] Crashlog allocated RAM at address 0x3f00000
[    0.226291] workingset: timestamp_bits=30 max_order=15 bucket_order=0
[    0.238433] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.244055] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[    0.263393] io scheduler noop registered
[    0.267127] io scheduler deadline registered (default)
[    0.273059] Serial: 8250/16550 driver, 3 ports, IRQ sharing disabled
[    0.280442] console [ttyS0] disabled
[    0.283876] 10000c00.uartlite: ttyS0 at MMIO 0x10000c00 (irq = 28, base_baud = 2500000) is a 16550A
[    0.292645] console [ttyS0] enabled
[    0.292645] console [ttyS0] enabled
[    0.299626] bootconsole [early0] disabled
[    0.299626] bootconsole [early0] disabled
[    0.308386] 10000d00.uart1: ttyS1 at MMIO 0x10000d00 (irq = 29, base_baud = 2500000) is a 16550A
[    0.317993] 10000e00.uart2: ttyS2 at MMIO 0x10000e00 (irq = 30, base_baud = 2500000) is a 16550A
[    0.327482] cacheinfo: Failed to find cpu0 device node
[    0.332734] cacheinfo: Unable to detect cache hierarchy for CPU 0
[    0.339691] spi-mt7621 10000b00.spi: sys_freq: 193333333
[    0.360060] m25p80 spi0.0: w25q256 (32768 Kbytes)
[    0.364887] 4 fixed-partitions partitions found on MTD device spi0.0
[    0.371343] Creating 4 MTD partitions on "spi0.0":
[    0.376207] 0x000000000000-0x000000030000 : "u-boot"
[    0.382208] 0x000000030000-0x000000040000 : "u-boot-env"
[    0.388459] 0x000000040000-0x000000050000 : "factory"
[    0.394565] 0x000000050000-0x000002000000 : "firmware"
[    0.495310] 2 uimage-fw partitions found on MTD device firmware
[    0.501358] 0x000000050000-0x0000001b5e77 : "kernel"
[    0.507307] 0x0000001b5e77-0x000002000000 : "rootfs"
[    0.513262] mtd: device 5 (rootfs) set to be root filesystem
[    0.520545] 1 squashfs-split partitions found on MTD device rootfs
[    0.526836] 0x000000a10000-0x000002000000 : "rootfs_data"
[    0.534310] libphy: Fixed MDIO Bus: probed
[    0.547196] rt3050-esw 10110000.esw: link changed 0x00
[    0.556789] mtk_soc_eth 10100000.ethernet eth0: mediatek frame engine at 0xb0100000, irq 5
[    0.567072] NET: Registered protocol family 10
[    0.575373] Segment Routing with IPv6
[    0.579207] NET: Registered protocol family 17
[    0.583838] 8021q: 802.1Q VLAN Support v1.8
[    0.597458] VFS: Mounted root (squashfs filesystem) readonly on device 31:5.
[    0.605579] Freeing unused kernel memory: 164K
[    0.610118] This architecture does not have kernel memory protection.
[    1.287648] init: Console is alive
[    1.291446] init: - watchdog -
[    1.623130] kmodloader: loading kernel modules from /etc/modules-boot.d/*
[    1.772568] usbcore: registered new interface driver usbfs
[    1.778231] usbcore: registered new interface driver hub
[    1.783817] usbcore: registered new device driver usb
[    1.794796] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    1.802778] ohci-platform: OHCI generic platform driver
[    1.818419] phy phy-10120000.usbphy.0: remote usb device wakeup disabled
[    1.825241] phy phy-10120000.usbphy.0: UTMI 16bit 30MHz
[    1.830566] ohci-platform 101c1000.ohci: Generic Platform OHCI controller
[    1.837485] ohci-platform 101c1000.ohci: new USB bus registered, assigned bus number 1
[    1.845643] ohci-platform 101c1000.ohci: irq 26, io mem 0x101c1000
[    1.924866] hub 1-0:1.0: USB hub found
[    1.929049] hub 1-0:1.0: 1 port detected
[    1.945759] MTK MSDC device init.
[    2.000076] mtk-sd: MediaTek MT6575 MSDC Driver
[    2.011481] mtk-sd 10130000.sdhci: no support for card's volts
[    2.017431] mmc0: error -22 whilst initialising SDIO card
[    2.023153] kmodloader: done loading kernel modules from /etc/modules-boot.d/*
[    2.032141] random: fast init done
[    2.035986] init: - preinit -
[    2.039984] mtk-sd 10130000.sdhci: no support for card's volts
[    2.045916] mmc0: error -22 whilst initialising MMC card
[    2.192298] mtk-sd 10130000.sdhci: no support for card's volts
[    2.198232] mmc0: error -22 whilst initialising SDIO card
[    2.218053] mtk-sd 10130000.sdhci: no support for card's volts
[    2.224004] mmc0: error -22 whilst initialising MMC card
[    2.305391] mtk-sd 10130000.sdhci: no support for card's volts
[    2.311344] mmc0: error -22 whilst initialising SDIO card
[    2.338911] mtk-sd 10130000.sdhci: no support for card's volts
[    2.344876] mmc0: error -22 whilst initialising MMC card
[    3.296354] rt3050-esw 10110000.esw: link changed 0x00
Press the [f] key and hit [enter] to enter failsafe mode
Press the [1], [2], [3] or [4] key and hit [enter] to select the debug level
[    3.481109] random: procd: uninitialized urandom read (4 bytes read)
[    6.856494] jffs2: notice: (364) jffs2_build_xattr_subsystem: complete building xattr subsystem, 23 of xdatum (2 unchecked, 21 orphan) and 77 of xref (72 dead, 0 orphan) found.
[    6.969440] mount_root: switching to jffs2 overlay
[    6.991789] overlayfs: upper fs does not support tmpfile.
[    7.002686] urandom-seed: Seeding with /etc/urandom.seed
[    7.213809] procd: - early -
[    7.216848] procd: - watchdog -
[    7.820755] procd: - watchdog -
[    7.824249] procd: - ubus -
[    8.005464] random: ubusd: uninitialized urandom read (4 bytes read)
[    8.089616] random: ubusd: uninitialized urandom read (4 bytes read)
[    8.096690] random: ubusd: uninitialized urandom read (4 bytes read)
[    8.104218] procd: - init -
Please press Enter to activate this console.

Failed to executPlease press Enter to activate this console.

It appears to stop the boot process at this point, pressing Enter to activate the console always returns "Failed to execut".

We had assumed that this was caused by unexpected power cuts as we do not gracefully shutdown the system (We do not know when the power will be cut) and rely on the filesystem being robust enough to handle unexpected power cuts. Tests were carried out with multiple units where the power was cut at random intervals and rebooted. After several thousand cycles the units were still functioning as expected. So we ruled that out.

Environmental conditionals were considered as well, so there were put into environmental test cycles where we heated and froze the units several times over but still the function.

Most of the time, it is possible to go into safemode and do a firstboot. Whilst this fixes the Onion, the better solution would be to prevent it from needing a firstboot in the first place.

Does anyone have any insight as to why it is failing to boot? Or what is causing it. We reckon it has something to do with the overlay failing to mount but other than that I am not an Linux expect whatsoever. Is there a way to gracefully shutdown the Onion? Although this cannot always be guaranteed due to the nature of their installation.

Any pointers or suggestions are hugely appreciated. Thanks for taking the time to read this.

luz

@Cooper-Chan I'm using an app based on 19.07.7 myself on >1000 units in the field, so I can feel with you that this is an unpleasant problem to have...

My first guess would be exhausted memory caused by something happening early in the boot process.

My rationale for this would be the following: From your log this happens quickly after procd begins processing the init scripts in /etc/init.d/.

And the truncated message Failed to execut is almost certainly the output of askfirst which is trying to execute /usr/libexec/login.sh after you press return. Unfortunately, askfirst immediately terminates itself after printing the message, so that's why it gets truncated (askfirst is a 43 line C program, easy enough to understand, see build_dir/target-mipsel_24kc_musl/procd-xxxxxxxx/utils/askfirst.c). But askfirst is never called with anything else than /usr/libexec/login.sh, so that's almost certainly the thing that fails to execute. And unless your rootfs is severely damaged, I can't imagine anything except simply not enough memory as the cause.

From here, I would wildly guess that you start your python app early in the boot process (via a /etc/init.d/xx script), and depending on its own persisted state from previous runs it simply consumes too much memory too quickly, and brings everything else to a halt.

While this is pure speculation regarding your (mostly unknown to me) setup&app, I once had a socket leak in my own main app (written in C++), that gobbled up all memory not immediately, but over time. The result was very similar to what you describe.

I might be totally wrong, but ruling out exhausted memory might help in your search for the real cause.

BTW: From my experience, ruining the jffs2 overlay just by powering off is not something that happens. The worst outcome I'd expect would be an inconsistency in files being written right at the moment the power goes down, but no general corruption beyond that.

Lazar Demin

@Cooper-Chan The memory scenario suggested by @luz is definitely worth investigating.

On the hardware side, I would suggest looking for any stray sources of electromagnetic interference.

There was a post earlier this year about a situation similar to yours - deployed Omega2 units would stop booting after some time because of what looked like corrupted firmware. It turned out a cable connected to the cellular modem in the device was emitting electromagnetic interference that would corrupt the filesystem.

See the original post for more details:
https://community.onion.io/topic/4448/onion-omega-2-jffs2-newly-erased-block-contained-word-0x0-at-offset-0x01180000-warning/

Let us know if that helps!

Cooper Chan

@luz many many thanks for your insightful reply. You are indeed correct that the Python script is invoked via an /etc/init.d script:

#!/bin/sh /etc/rc.common

START=90
STOP=90

USE_PROCD=1

respawn_threshold=0
respawn_timeout=0
respawn_retry=0

start_service() {
	procd_open_instance
	procd_set_param command /usr/bin/onionApp --conf "/etc/hanip" --data "/usr/share"

	procd_set_param respawn ${respawn_threshold:-3600} ${respawn_timeout:-5} ${respawn_retry:-5}
	# forward stdout and stderr to logd
	procd_set_param stdout 1
	procd_set_param stderr 1

	procd_close_instance
}

This was all generated by the Linux engineer that used to work for us. For reference, onionApp is a glorified IP to serial converter. As far as I can tell, nothing really persists except a few file the application generates. The application will go through the motions of setting itseful up for operation at every bootup. I wonder though, perhaps due to the inefficiencies of Python if it is doing other things when its called, or perhaps the modules used are inefficient.

If this is started too early and thus consuming too much memory (Are we talking flash memory or RAM?) then would delaying the script alleviate the issue? Otherwise I am thinking whether a process before the application is called to "clean up" remnants of the previous iteration. Whether relevant or not I have enclosed the details for top and df

Regarding the jffs2 overlay, I'm glad that in your experience power loss isn't the cause of the issues I am having. I can rule that out as a potential cause. It is usually the goto cause due to problems we've had in the past with irregular power supplies.

@Lazar-Demin , many thanks for your reply as well. From what i know, the devices have been professionally EMC tested and have passed emissions and immunity. We are not using anything on the same devices that have any radios. Communications to the outside world is all via Ethernet and internally is via a serial port. The power supplies we have upstream have noise filtering and TVS protection.

luz

@Cooper-Chan said in Onions failing to boot:

For reference, onionApp is a glorified IP to serial converter.

That makes me curious to know what kind of glorification is possible for that usecase… I might get ideas for my serialfwd tool (just kidding)

If this is started too early and thus consuming too much memory (Are we talking flash memory or RAM?)

RAM. OpenWrt itself is configured to not write to flash except for config changes and package installations. All logging goes to /tmp which is in RAM.

However, your app seems to have data in /usr/share, which is in jffs2.

If the app does write there frequently (even if not much in terms of bytes written), that could lead to flash wear. But unless the app can really get into a crazy writing loop, this still seems an unlikely explanation to me.

[…] then would delaying the script alleviate the issue?

If you delay enough so you always have a time window after boot to get into the console before it starts, that would help debugging the issue, e.g. by firing up top and maybe see the system running into the problem, live. Or then, completely rule out that memory theory, in case you can‘t log in even when you know the python app has not yet started.

[…] Whether relevant or not I have enclosed the details for top and df

These look fine and with a lot of margin for both RAM and flash.

But I assume these are from a correctly working unit, right? Unless I misunderstood, you have no way to login into a crashed one so far?

On the hardware side I‘d also check what @Lazar-Demin said. Especially if your setup is (designed) all wired, it might be important that the WiFi hardware is actually turned off.

Cooper Chan

@luz,

Unfortunately my application is proprietary and completely unrelated to yours, if you are curious here is a link: ITxPT. It converts that stuff back to our legacy serial comms.

The data in /usr/share is stuff that must persist after a poweroff, namely customer related files, this isnt written to very often, if at all. Only when the customer decides they want to update their files. The second directory /etc/hanip there is a file generated at every boot up of a few kb.

Yes I obtained those from a working unit. There is no way to "talk to" a failed unit except going into safe mode and selecting the firstboot option. When that is selected then it goes back into the state as if it just left our factory. On the rare occasion it is completely unrecoverable but I suspect that's a hardware issue than a software one. I'll speak to our linux devs to delay its start up (its above my paygrade )

I think the WiFI hardware isnt even enabled as we have not put the WiFi drivers in our image, unless of course it is always on and needs to be explicit disabled.

crispyoz

@Cooper-Chan Have you tried starting in failsafe mode, mount_root and see if you can examine the file system?

Cooper Chan

@crispyoz , thanks for your input. Yes I have tried that before, in some cases it fails to mount, in other causes it mounts, but all directories appear to be empty

carfarmer

@Cooper-Chan I know this topic is almost a year old, but figure I'd ask - did you make any progress on identifying the root cause here? We are seeing exactly the same behavior you describe (devices work great, then start randomly failing in the field after a period of time). Boot log is exactly what you describe, failsafe behavior the same, etc. Just curious if you have any additional detail you can share as we are just starting the troubleshooting process on our side.

crispyoz

@carfarmer did you see this post:

https://community.onion.io/topic/4448/onion-omega-2-jffs2-newly-erased-block-contained-word-0x0-at-offset-0x01180000-warning/