Wifi driver(?) seems to crash when wifi network unavailable.



  • Hi guys,

    We started field testing with our custom PCB containing the Omega2s+ ,
    Firmware is stock 0.3.2 b233 with custom software on top.

    Our use scenario is "use standalone unless surrounding wifi network present"
    by standalone i mean a tablet or phone directly connected to the omega2s+ AP SSID itself.

    Okay..
    all goes well as long as the connected surrounding wifi is present.
    However when we walk out of reach of the surrounding wifi network, on the "direct connection" from phone to omega the response becomes sluggish too.

    This behaviour can also be simulated by changing the "option ssid" in wifi-iface and wifi-config to a non existing ssid like 'MyNetwork_CANNOTBEFOUND'

    After a while the "direct connection" (the omega AP side), breaks completely and the omega SSID even disappears from the air. It does not fix itself.
    Serial console is still available.

    dmesg shows a crash (or oom?) related to the wifi warp-core?, see attached screenshot
    (edit: i'm not sure if this line actually relates to the warp core, but it is definately triggered by the wifi setting 'apsta' and does not occur when just 'ap')

    IMAGE 2022-02-11 17:36:13.jpg

    Our application sends a constant stream of data so it's easy to see when the connection starts to decay. It misses beats and gets very sluggish.

    • the problem disappears when i change the omega from 'apsta' to 'ap' but this would break our favorite use scenario.

    • the crash takes longer when i do echo "1000" > /proc/sys/vm/min_free_kbytes but then also connection response on the direct connection is still terrible.
      (https://github.com/dongqifan/lede-mt7628/issues/1), thanks @pparent76

    • free memory falls to a minimum before things go wrong. this low mem does not happen when in "just AP mode"

    • edit: i just found out the "low mem" problem also seems to exist when the surrounding wifi network IS available and connected. It just does not end that catastrophically.

    i can not reproduce this behaviour every single time, but 9 out of 10.
    i'm obviously still looking for the reason why it sometimes does NOT break. 🙂

    Does anybody know if this is a known problem specifically related to the warp-core? Or maybe i'm looking at the wrong suspect?
    Is there a known fix maybe?

    Thanks a lot, Tony

    wifi config.jpg

    IMAGE 2022-02-13 11:08:25.jpg



  • Hi guys,

    Sorry, i had to reassess some of the statements i made before
    (and i'm not allowed to edit the original post anymore)

    The problem definitely still exist.
    in 'apsta' mode:
    the tablet-to-omega dirext connection breaks after a couple of minutes when omega can't find a surrounding wifi network.

    However.

    • I have not been able to recreate a definite OOM in dmesg again.

    Also i have been reading the output of the free command wrong because i'm used to to procps-ng free command, and not the busybox free so:

    • There is no low mem situation, at least not every time.

    • Also, in some cases the direct wifi connection can be re established by switching of and on the wifi on an android tablet, and only then.
      Does not occur in just 'ap' mode.

    • the kernel message "MtPsWatchDog(): HIT PsmPktCount/Dur: 0/1 on Wcid 2 (SW_PSM:[0|0])" seems related but might be just a side effect.
      But it does NOT occur when just on 'ap' mode, so without the problem present.
      Also it doesn't show during omega searching for surrounding network AND if there's no device "directly" connected.

    We will be testing next couple of days with different types of tablets/phones to hopefully learn more.

    Now used:
    Galaxy Tab A, android 11, chrome
    Apple Ipad mini 5th gen, IOS 15.2.1, both chrome and safari

    Meanwhile ofcourse i'm open to any suggestions, if only to find out where to look.

    thanks
    Tony



  • @tony-ter-neuzen I'm not aware of any issue similar to your report, however I do have some thoughts.

    The MtPsWatchDog does come from the Warp Core driver but perhaps because it utilises some of the MT WiFi driver code. Since the Warp Core driver is closed source that is about as much as I can deduce.

    I can't help but wonder if there is a combination of issues here. When the Omega2+ is searching for WiFi it does consume more resources, if at the same time your app is also consuming more resources due to the failing connection, perhaps this exacerbates the issue. Did you try the same test without your software running, and perhaps just run a script to utilise the connection?

    You can also use the standard MT WiFi driver which allows you to turn off ap or sta separately.



  • @crispyoz thanks for your thoughts.

    I'm also afraid that this is a pile-up of different issues.

    Our app is ofcourse using resources, but it -should- only send data through a connection that is really established. So if you only ask for that data in the direct ap connection, there will be no data through the sta connection.

    Also without the app present it it's hard to notice the sluggishness of data transfer without building another way to test (and maybe make the same pitfall ;).
    I have read left and right about applications where people noticed the same latency problems, for instance in a camera application.

    It seems to simply surface more if you really start using the 'ap' connection next to the 'sta' connection

    That's why we didn't notice before we started shipping.
    (yup, i agree, failing test protocols, but still)

    warp core is probably not designed to be used this way, it makes more sense that a user wants to use it with both connections in place at the same time, or with them putting 'ap' in the config to disable that functionality.

    You say that the standard MT driver allows for turning off ap or sta seperately.

    Do you mean that this setting does not work in the omega version at all?
    I did notice that changing things like "option disable" under config wifi-iface 'ap' or config wifi-iface 'sta' don't give me the results i would expect, but it seems that changin option device_mode 'apsta' to just 'ap' under "config wifi-device 'radio0' " actually makes a difference.

    Our problem also disappears when doing just that.

    Anyway, i'll give your suggestions a try, thanks

    Do you have any idea why Onion chose not to use the standard MT driver?

    cheers
    Tony

    PS i love your case for the dash, thanks for that too 😉



  • @tony-ter-neuzen I like to use the top command to monitor my devices when troubleshooting, also for some tricky stuff I like to use a script that dumps resource usage to an SD Card every x seconds.

    The Warp Core driver can't disable individual WiFi but the MT driver can. Onion built the Warp Core driver as the MT driver was well known to be flakey, but in recent times the MT driver seems to be pretty reliable, but you don't get to use the nice Onion tools to configure your WiFi but you can of course configure it manually or if you use teh LUA interface this is simpler. I'm using the MT driver on my OpenWrt 20 based devices and it seems to work reliably.



  • @crispyoz

    our specific problem involves latency and packetloss, so i'm currently making graphs with continuous ping times while sending a steady stream of randomness...

    top does not really show any weird behaviour as far as i can see, but maybe i just don't know where too look

    thanks a lot for the info on the choice of driver, it sounds like a sensible consideration for onion to make. i was wondering about this for some time now.

    in our application we don't use the nice gui that onion built, and if you think the MT driver has become more mature it might be a path we could examine further.

    It would however mean that I have to educate myself first on how to swap those drivers around.... i don't think it's in my skill set yet how to do that, unless it is simply handled by opkg...
    someone i know always says: "hope is just delayed disappointment"

    manually configuring is not a problem, unless you mean the /etc/config/ methods disappear. most of the time i use the uci get/set/commit trickery anyway

    i guess that this will become the first time i actually need to learn how to crosscompile or build a custom firmware, but there's no reason postpone the inevitable 🙂

    thanks for your input, the search continues

    greetings, tony



  • I can only second what @crispyoz says - while at the time Onion introduced WarpCore, the opensource mt76 driver was not really usable. But thanks to a company that needed a certified wifi and thus invested some money into mt76 development about 3 years ago, it has become perfectly stable. mt76 had already improved a lot before (I found it usable for my needs since 2017), but that development boost made it really production reliable. As it is based on standard nl80211/cfg80211, there is more generic openwrt documentation about the config details, and it has independent ap and sta modes. But the Onion UI is tailored at WarpCore, so it does not work for cfg80211.



  • @tony-ter-neuzen yes, replacing the wifi driver is beyond opkg. However, once you have an openwrt build environment, enabling it is just selecting it from the ˋmake menuconfigˋ menu. ˋmt76ˋ, being the official openwrt driver for mt7688 targets, fully supports uci and /etc/config.

    In general, I can only recommend having a openwrt build environment for any project going beyond a singular installation. Its simply super relaxing to know a single ˋmakeˋ can (re)build the complete firmware of a device in a totally predictable way 😃



  • @tony-ter-neuzen @luz said it perfectly. If you are releasing a product you need to have a clear understanding of the mechanism by which it functions. When Windoze breaks you're at the mercy of "Toilet Maker In Chief", but with Linux/OpenWrt/OnionOS you have access to the nuts and bolts that keep the wheels turning reliably.

    Put on your floaties and set up a build environment, then you can build the firmware the way you want it and when things go wrong you have a much clearer understanding of where the issues may be and where the solutions may lie.



  • thanks @luz and @crispyoz for your input.

    I fully agree with you, corners have been cut and this issue sets the record straight 🙂

    you've helped me a lot by pointing out the possibility to adopt the mt76 driver as a replacement for the warp core, i will give that a try.

    And it's not like this is completely voodoo wizardry to me, in this case it has been really cutting corners because of lack of time...
    Very bad reason i know, but at least i'm honest about it 🙂

    I'm going for that "simply super relaxing" feeling that @luz describes but that simply asks for some investments too...

    So... Putting on the floaties 😉

    Thanks guys for the pointers...

    cheers tony



  • @luz and @crispyoz

    thanks you guys, that was the best advice i ever followed 🙂
    i just needed to be dragged over that line...

    it has been a lot of trial and error, and i haven't answered all questions yet, but at this point the build environment spits out a complete firmware, including our own app with all custom files and dependencies...

    i end up with a mainline openwrt with mt76 that gives me a console in 12 seconds, and is completely up and connected at the sta-end in 55...
    i2c, gpio, alsa, everything in working order.

    hopefully this also solves the problem i started this topic about... but that i have to test yet... i call this a big win anyway even without the problem fixed...

    i did just realize that i'm building SNAPHOT at the moment, but hopefully i'll find an easy way to step back to current stable without hours of tinkering on menuconfig again 😉

    greetz, tony



  • @tony-ter-neuzen Congratulations on getting up to speed with building your first custom firmware!! You should now feel "super relaxed".

    It may not fix your immediate issue, only time will tell, but it will allow you to tinker and experiment with all of what the Omega2+ and OpenWrt have to offer, then build the best solution for your devices. One of the things I love about the Omega2+ is that it gets you started quickly with a clearly defined set of hardware and software and once you are ready to take off the floaties you can step up to the next level.

    Now you need to learn how to build your own repository so you can deploy your custom updates to your devices. Lots of fun!



  • @tony-ter-neuzen happy to hear you enjoy the freedom of making your own firmware (as I hoped you will) 😉

    i did just realize that i'm building SNAPHOT at the moment, but hopefully i'll find an easy way to step back to current stable without hours of tinkering on menuconfig again

    Specifically for that, OpenWrt has a very useful mechanism called diffconfig (of course, it originates, like menuconfig, from the Linux kernel's build system). It allows to capture the changes in your .config (the one and only file make menuconfig touches) relative to the target's default configuration:

    ./scripts/diffconfig.sh >/my/safe/place/my_diffconfig
    

    Now you can checkout another version of OpenWrt, and then:

    cp /my/safe/place/my_diffconfig .config
    make defconfig
    

    The make defconfig expands your specific config changes in the context of the default config of the current openwrt into a full .config. The result is a config with all the subtleties that might have changed between OpenWrt versions, with your changes on top. Of course, there might be conflicts, but these are much easier to spot and solve this way.

    Inspired by this mechanism (and way of thinking it implies), in my various OpenWrt based Omega2 projects, I nowadays strictly refrain from forking the OpenWrt tree, but keep all changes I need as a "diff" (consisting of diffconfig, but also a set of patches to OpenWrt I might need for one reason or another) separately.

    After having done this manually for a while, I built a bash script named p44build that automates most of it. With p44build, lifting a project from an older OpenWrt version to a newer one is usually a minor task - IMHO a very important thing for OpenWrt based firmware to avoid sticking with an outdated OpenWrt too long just because upgrading a fork is a pain. p44build also allows switching between targets and projects on the same tree easily.



  • hi @luz

    thanks for handing out that next step in the puzzle...

    the learning curve feels quite steep to me (as a just starting out newbie) but i think i'm getting the hang of it.

    first i had to break my current snapshot build system hopelessly, but that was a ritual offer that apparently had to be made...

    as soon as i did a new pull instead of trying to switch over my current snapshot environment things were improving fast.

    so, using that diff you mentioned it was easy to adapt my working config to 21.02, and things actually built without one warning or error... that was an immediate improvement already.

    i added your plan44 feed too, that ledchain looked to good to pass up 🙂

    i will be looking into p44build also, i'm already getting close to that "super relaxed" feel @crispyoz mentioned and who knows it will even get better...

    Circling back to the wifi topic this started with, i noticed that with the mt76 i had to take a different approach than i used to.
    i kind of got used to the fact that warp-core could (at least temporarily before it broke in my case) kept up the 'ap' end when 'sta' couldn't connect

    mt76 on the other hand seems to just down the 'ap' end if 'sta' can't connect...
    so... that was also not really what i was looking for either.

    i did find however a nice plugin called travelmate that seems to offer some kind of solution.
    it tries to connect the sta end, and if it doesn't succeed it stops trying and keeps the ap end up.

    we'll see where that takes us the next couple of days....

    anyaways, cheers guys, thanks for all the fish

    tony



  • @tony-ter-neuzen can you share the content of your network and wireless files, I've not experienced the ap going down if sta can't connect.



  • Hi @crispyoz,

    Sorry for the late reply, i have been trying to roll-back to the version where i observed this problem, but that seems a bit harder than i thought.

    We've been testing with travelmate as a plugin these last couple of days, but if you say you haven't experienced the problem i described it gives me hope again to solve this without extra plugins...

    I'll try to get back to you as soon as i achieve breaking it again...

    bye tony



Looks like your connection to Community was lost, please wait while we try to reconnect.