Request for improvement on driver for ttyS_ (or help to do it myself)
-
Update:
I've written a stateful read implementation that does not use poll at all (posted at the bottom).
From this implementation, I've been able to determine that I'm losing exactly 16 bytes. So the first load of the 16 byte HW buffer seems to be getting discarded. If the buffer is already partially full, and read is blocking on new bytes, the next packet that arrives will be read successfully.
So with this implementation, I lose every other packet. This is not workable, and I will need to find a fix, but nonetheless it is a step forward in diagnosing and solving the problem. Clearly, it's either the driver or some HW issue. I'll need to scope the lines next.
Here's an output of the program ("otter") being run in interactive mode. The second packet is correctly received. On the first packet, the first 15 or 16 bytes are being discarded.
First packet: the first 15/16 bytes are discarded
DEBUG: read(): 00
DEBUG: read(): 0A
DEBUG: read(): 00
DEBUG: read(): 00
DEBUG: read(): 11
DEBUG: read(): 11
DEBUG: read(): 04
DEBUG: read(): 36
DEBUG: read(): 00
DEBUG: read(): 00
DEBUG: read(): 00
DEBUG: read(): 02Second packet: received correctly from partially full buffer.
DEBUG: read():FF
DEBUG: read():FF
DEBUG: read():55
DEBUG: read(): 3D 8E 00 13 02 00
DEBUG: read(): C0 0F 01 35 00 00 00
DEBUG: read(): 00 0A 00 00 11 11 04 36 00 00 00 02
DEBUG: pkt : C0 0F 01 35 00 00 00 00 0A 00 00 11 11 04 36 00 00 00 02
DEBUG: 25 Bytes Queued
3D 8E 00 13 02 00 C0 0F 01 35 00 00 0000 0A 00 00 11 11 04 36 00 00 00 02In case anyone is curious, here is the total hack I used to get around poll() incurring too much delay. The problem may be HW, in which case if I fix the HW (i.e. the other device sending data), the polling code might work again. Nonetheless, for using a single interface, this code below is probably slightly more performant -- albeit sloppy.
{ int unused_bytes; struct timespec ref; struct timespec test; mpipe_reader_INIT: rbuf_cursor = rbuf; unused_bytes = 0; payload_left = 1; while (1) { /// Check size of bytes in the hopper, and compare with payload_left (requested bytes) /// Only do read() if we need more bytes // Read from interface, and time how long it takes to return if (clock_gettime(CLOCK_MONOTONIC, &ref) != 0) { errcode = -1; goto mpipe_reader_ERR; } new_bytes = (int)read(fds[0].fd, rbuf_cursor, payload_left); if (new_bytes <= 0) { errcode = 5 - (new_bytes == 0); goto mpipe_reader_ERR; } if (clock_gettime(CLOCK_MONOTONIC, &test) != 0) { errcode = -1; goto mpipe_reader_ERR; } // If read took longer than 50ms (should be configurable ms), // we need to restart the reception state machine. if (timespec_diffms(ref, test) > 50) { errcode = 0; for (; (new_bytes>0)&&(*rbuf_cursor!=0xFF); new_bytes--, rbuf_cursor++); if (new_bytes == 0) { goto mpipe_reader_INIT; } unused_bytes = new_bytes-1; new_bytes = 1; } else { unused_bytes = 0; } mpipe_reader_STATEHANDLER: syncinput = *rbuf_cursor; rbuf_cursor += new_bytes; unused_bytes -= new_bytes; payload_left -= new_bytes; switch (errcode) { case 0: payload_left = 1; if (syncinput != 0xFF) { break; } errcode = 1; break; case 1: payload_left = 1; if (syncinput == 0xFF) { break; } if (syncinput != 0x55) { errcode = 0; break; } // Sync-Found: realign rbuf if necessary for (int j=0; j<unused_bytes; j++) { rbuf[j] = rbuf_cursor[j]; } rbuf_cursor = rbuf; errcode = 2; payload_left = 6; break; // Header (6 bytes) case 2: if (payload_left > 0) { break; } errcode = 3; payload_length = rbuf[2] * 256; payload_length += rbuf[3]; payload_left = payload_length; header_length = 6 + 0; if ((payload_length == 0) || (payload_length > (1024-header_length))) { errcode = 2; goto mpipe_reader_ERR; } break; // Payload (N bytes) case 3: if (payload_left <= 0) { goto mpipe_reader_READDONE; } break; default: rbuf_cursor = rbuf; payload_left = 1; errcode = 0; unused_bytes = 0; break; } if (unused_bytes > 0) { new_bytes = (payload_left < unused_bytes) ? payload_left : unused_bytes; goto mpipe_reader_STATEHANDLER; } } }
-
I've not done any work on the TTY device but what you are describing looks like a handshaking issue, does the TTY do handshaking?
-
@crispyoz
No, it's just direct TX/RX without usage of tcflow()
-
Update:
I found the problem. tcflush() does not appear to work properly on this platform. I will investigate further, submit a bug report, and perhaps fix it. Right now I'm 2 weeks behind schedule, though, so it will need to wait.
I worked-around the problem by removing all calls to tcflush(). This is suboptimal, but there won't be any packet loss unless the other end is sending packets with incorrect framing. For this project, that's something I don't really need to worry about.
So, I finally got around to scoping the TX & RX signals. You can see the TX and RX packets. If I transmit a packet after calling tcflush(fd, TCIFLUSH), the tty driver will:
1: discard the first 16 bytes of the received response -- this is one frame of the HW FIFO.
2: At ~3.5ms after the request transmission, Onion will transmit a giant, erroneous packet that contains some clobbered data -- the data, here, is in part the request (tx) and in part the response (rx).This smells so much like a driver bug.
-
@JP-Norair thx for posting your findings! That's very useful to know.
Regarding your questions above: yes, I've been using poll(), yes I've transmitted binary packets with all values 0..0xFF. And the baudrate was 115200bd.
But, obviously, I haven't used TCIFLUSH.
I agree this looks like a driver bug.
But still I am surprised such a significant bug would be in Linux' 8250 standard driver without somebody having noticed and fixed it loooong ago...
Maybe I'm missing something and it's not the standard 8250 driver that is included in OnionOS builds?
-
@luz there are so many flavors of Linux (onion uses musl libraries) that it’s very easy for bugs to slip through the cracks for a specific platform. I’ve run the original code on other platforms using the same kernel version and ostensibly same serial driver, and it works great.
Anyway, I will plod on.
-
Just a note that the issues I have are re-occurring. The removal of tcflush() calls oddly worked the day of testing, but it no longer works. The same error has returned. My conclusion for the moment is that Onion Omega 2 is buggy and not suitable for usage with binary serial port.
Notes:
- Using poll() and/or tcflush() results in nearly 100% failure.
- Using the read loop without poll results in less failure, but still far too much failure.
-
@JP-Norair @luz We took a look and confirmed that the default Omega2 firmware uses the standard Linux 8250 serial driver included with kernel 4.14, with the DTS file setting
NS16550A
for the port type for all 3 UARTs.Questions:
- You mentioned the polling code works well on other platforms. Which platforms did you try out? Did they have NS16550A serial ports and were they running Linux kernel 4.14?
- Can you try using a USB-to-Serial adapter (connected to the Omega2's USB Host port) to check if the same problem appears?
-
@Lazar-Demin
Here is the list of issues I've encountered with the Omega2+ serial port. I'm a big fan of the Onion project, and I'd like to help if possible. I can send you hardware and software to replicate this, if you'd like.Current workarounds:
- Do not use poll() or tcflush()!
- Use read(fd, buf, 1) only. That is, size param = 1.
-
Receiver on UART0 seems to have a lot of capacitance on it. This can be overcome with a strong pull-up, but I thought I'd mention this in any case.
-
I have a half-duplex binary packet protocol. It's much like SLIP or PPP. If I send a Request TX packet to the endpoint, and it generates a Response RX packet, problems can occur during the RX.
A. If poll() is used during reception of the Response RX Packet, the serial port will spew a bunch of data on the TX line (see behavior in logic analyzer capture).
B. If tcflush(..., TCIFLUSH) is used after the packet is fully received and read (via read()), similar behavior as in (2A) will ensue.
C. Usage of tcflush(), in any way, anywhere in the program seems to cause issues with the fd it is used-on. -
There is another issue with reception. When I'm receiving a binary packet, the 16th byte is often dropped.
A. When I scope the lines, I can verify that the 16th byte is always on the line.
B. Here's an example of a correct packet. Bytes 0:1, sync. Bytes 2:7, header. Bytes 8:..., payload. This packet has 0x39 payload size (57 bytes).
[FF 55] [F9 F4 00 39 08 00] [C0 35 01 35 01 00 00 00 30 00 00 36 00 4A 30 06 ...]
C. Here's the same packet, with byte 16 missing. The last byte of the payload will be read as FF, because this is the idle state of the serial lines.
[FF 55] [F9 F4 00 39 08 00] [C0 35 01 35 01 00 00 30 00 00 36 00 4A 30 06 ... FF]
D. If I change the way read() is done (i.e. changing the size boundary to smaller values), this often causes more bytes to be dropped. HOWEVER, if I use size=1, and only size=1, reception generally works without dropping bytes. This is one workaround I'm using. -
tcdrain(...) does not appear to cause any problems, although I'm skeptical if it actually blocks program flow as it should.
-
tcdrain() and tcflush() both return 0 (no errors)
Final notes.
I've run the original SW over USB-serial (ttyUSBx) and USB-CDCACM (ttyACMx) interface without issue.
I've run the original SW on ttySx, with a 4.x kernel (might have been 4.4.xxx) on an OrangePi without issue. Not sure what HW they have, but definitely it is a 32 bit ARM Cortex A7, and definitely it's NOT running musl. So the libc is different. I expect the problems are probably not so much in the driver but in musl, but that's just my guess.
-
@JP-Norair I'm still wondering what could be the difference between your Omega2 setup and mine.
Because in my current project (means: I heavily using the code every day), I am using
poll()
all the time. I also found that the library I'm using even occasionaly usestcflush(TCIFLUSH)
and tcflush(TCIOFLUSH) - I thought it didn't, in my previous post.The read code is simple, when
poll()
returns a POLLIN, a singleread()
reads everything currently available.tcflush()
is used only after detecting receiving errors (unexpected message lengths, crc errors).What kernel, musl and openwrt version are you using?
My versions are linux 4.14.95, musl 1.1.19 and openwrt 18.06.2
All openwrt v18.06.x use musl 1.1.19. LEDE v17.01 was using musl 1.1.16, upcoming openwrt v19.07 will use musl 1.1.23, but I found no commit in musl since 1.1.19 I could relate to anything fixed regarding serial ports.
From the fact that @Lazar-Demin confirms the use of the standard 8250 driver, and the MIPS platform is not an obscure corner case for linux, I would agree that the problem is more likely a musl thing.
-
@luz said in Request for improvement on driver for ttyS_ (or help to do it myself):
My versions are linux 4.14.95, musl 1.1.19 and openwrt 18.06.2
I compiled the buildroot in June, but I will try to rebuild it with whatever is the latest commit from GitHub. I'm wondering now if I was using the master branch, which would be the wrong branch to use by the looks of it. I added some of libs from the OpenWRT universe (libfts, libtalloc), but I don't think these would impact anything.
I was following the instructions here. There isn't an instruction about changing branches, so unless I built HEAD, it probably was master. I think Onion team should update the docs, if so.
https://docs.onion.io/omega2-docs/cross-compiling.htmlUpdate:
No, the buildroot is from the 18.06 branch.@luz are you cross-compiling or building on the onion? Maybe you could share your tty configuration. I've attached mine at the bottom of this post.
Another crazy thing I noticed on Onion is that signal(SIGINT, &handler_fn) will return non-zero if the process is launched from within a shell script. If process is launched from tty, it's fine. I ported my code to use SIGTERM instead, and to ignore the SIGINT handler. I'm curious if, somehow, this may lead to the issues.
static int sub_opentty(mpipe_intf_t* ttyintf) { mpipe_tty_t* ttyparams = (mpipe_tty_t*)ttyintf->params; struct termios tio; int i_par; int rc = 0; // first open with O_NONBLOCK ttyintf->fd.in = open(ttyparams->path, O_RDWR | /*O_NDELAY*/ O_NONBLOCK | O_EXCL); if (ttyintf->fd.in < 0 ) { rc = -1; goto sub_opentty_EXIT; } if ( !isatty(ttyintf->fd.in) ) { rc = -2; goto sub_opentty_ERR; } // TTY device has output and input on the same file descriptor ttyintf->fd.out = ttyintf->fd.in; // then reset O_NDELAY if ( fcntl(ttyintf->fd.in, F_SETFL, O_RDWR) ) { rc = -3; goto sub_opentty_ERR; } // clear the datastruct just in case bzero(&tio, sizeof(struct termios)); // Framing parameters are derived in mpipe_opentty() ///@todo Currently ignores parity on RX (IGNPAR) i_par = IGNPAR; tio.c_cflag = ttyparams->data_bits | ttyparams->stop_bits | ttyparams->parity | CREAD | CLOCAL | ttyparams->flowctl; tio.c_iflag = IGNBRK | i_par; tio.c_oflag = CR0 | TAB0 | BS0 | VT0 | FF0; tio.c_lflag = 0; tio.c_cc[VMIN] = 1; // smallest read is one character tio.c_cc[VTIME] = 0; // Inter-character timeout (after VMIN) is 0.1sec tcflush( ttyintf->fd.in, TCIOFLUSH ); cfsetospeed(&tio, ttyparams->baud); cfsetispeed(&tio, ttyparams->baud); // Using TCSANOW will do [something] if (tcsetattr(ttyintf->fd.in, TCSANOW, &tio) != 0) { rc = -4; goto sub_opentty_ERR; } // RTS/CTS are not available at this moment (may never be) //if( flowctrl != FLOW_HW ) { // if (rts) tiocmbis(table->intf[id].fd.in,TIOCM_RTS); // else tiocmbic(table->intf[id].fd.in,TIOCM_RTS); //} // DTR is not available at this moment (may never be) //if (dtr) tiocmbis(table->intf[id].fd.in,TIOCM_DTR); //else tiocmbic(table->intf[id].fd.in,TIOCM_DTR); sub_opentty_EXIT: return rc; sub_opentty_ERR: close(ttyintf->fd.in); return rc; }
-
@JP-Norair What does your hardware setup look like? Is using a USB-to-Serial chip with the Omega an option?
-
No: for size, cost, and power reasons we need to use the built-in UART(s).
I'm interested that @luz is having no trouble doing basically the same thing. It would be nice to know what he/she is doing that is different from what I am doing.
I will rebuild buildroot and see if this makes a difference.