High retry/fault rate on low duty cycle radios (timeout issue in PX4?)

Hi,

I’m testing some different telemetry radio’s and see some results I have difficulty to understand. I have a feeling something might be wrong in the timeout behavior of PX4.
Could someone help me, or point me in the right direction?

TLDR:
I have a feeling there is some funny thing going on in PX4 (and possibly QGC), which result in an increasing number of requests when handshaking messages are exchanged. This is in particular catastrophic with radios that are set for low speed or low duty cycle.
Ardupilot doesn’t seem to be effected by this.

The case is as follows:
I have 1 radio that has a significantly lower duty cycle (10% instead of 100%) due to EU regulations, other then that all radio’s I use have a similar setup.

To run a sort of “benchmark” I’ve created a flightplan with 100 Waypoints which I uploaded and downloaded a couple of times in the same setup and environment. While doing so I looked (in QGC) at the number of messages (mission_item/mission_item_request) required to finish an download/upload and timed how long it took. Additionally I looked in detail in the MAVLink messaging logs (with self made tool).

The results show that both uploading and downloading a flightplan for the radios with a duty cycle of 100% will take about 25-30 seconds (where uploading always takes a few seconds longer), but the interesting thing is that the upload has and average fail rate of 17% (number of mission_items_request received), while the download has and average fail rate of more then 80% (number of mission_item being received)! (for 100 waypoints in less then 30 seconds it is requesting the mission item more then 500 times!).

For the radio that has a duty cycle of 10% it is far worse (probably as expected)!
Downloading the flightplan takes over 2 minutes! with a average fail rate of 81%,
Weirdly enough Uploading takes half the time (~62s), but with a higher retry rate more then 90%.

By it self it is already interesting to see that when the duty cycle is lowered, the retry rate is going up, however what I don’t understand is, the time it takes to download with a lower duty cycle is double of what it takes to upload, while with the higher duty cycle the download is on average quicker then the upload!
Additionally when looking at the numbers it almost looks as if PX4 expects a MAVLink reply within 100ms (timeout) or even less. because the total number of mission requests send by PX4 within 60s is about 1000! which means on average every 60ms (this number also applies for the 100% duty cycle radio)!

At this point I created a simple tool that would log all Mavlink messages received, which also confirms for example that PX4 is requesting high numbers of the same mission item (sometimes more the 15 requests within 1 second, some with only a few millisecondes apart), all with an incremented MavLink sequence number (which implies deliberate requests)!

As a comparison (suggested by the radio vendor) I also tried Ardupilot. Which had an interesting result! With Ardupilot the downloads were a little bit slower and had a slightly higher retry rate for the 100% duty cycle (30-50s for and a failure rate of 40-50%), the 10% duty cycle performed exactly the same for download.
However, for the upload it was a completely different story!
both the 100% and 10% duty cycle performed the same! An upload took on average 25-30s with a <1% retry rate (most of the time it was 100 out of 100 message)!!!

With this data my assumption is that it seems likely it fails/times-out on receiving long messages (mission_item) and that QGC seem to use the same/similar (but possibly with a longer timeout) method as PX4 which (based on the data) seems to be significantly different from what Ardupilot is using.

One theory I have for this behavior is that PX4 might not correctly be dealing with “broken” UART data streams (duty cycle 10%) when parsing to MAVLink message, which means if a complete message is not received in as one “solid” bytestream, it resets the input buffer and throws away all “old” bytes causing incomplete messages to be parsed (and fails), resulting in timeouts on the request and triggering a resend.
However I haven’t been able prove this theory, in particular because the interval between messages (sometimes less then 10ms) seem to contradict this theory (timeout values in PX4 seem to be set by default to at least 250ms).

Another theory is that somewhere in PX4 (and possibly in QGC as well) some timeout or failure occurs and is triggering resends in a cascading effect.

A side affect that I also noticed when digging into the MAVLink message is that (only with the 10% duty cyle radio) very soon after starting an upload the GCS receives an mission_ack with an “MAV_MISSION_ERROR”, is received on the GCS side, but it still continues requesting all mission items (and finishes successfully and another success ACK !?!)…

I’m a bit lost here, does anyone else have seen these problems?

Any help would be appreciated.

I’m replying to my own thread, for anyone else having similar issues.

After some investigation, we have noticed enabling hardware flow control (if your radio supports it) can help a lot!

However PX4 does not easily allow configuring it.
By default PX4 sets flow control to “Auto” which means it is trying to detect whether the device you are using is supporting hardware flow control, but as soon as a time out in MavLink communication occurs (which can already happens if you booted your AP before you started QGC for example), it drops hardware flow control and will not be enabled until the AP gets rebooted.

It is relatively easy to force enable flow control, when you know where to look. There is a module.yaml file in “src/modules/mavlink” add a ‘-z’ argument to the start command to force hardware flowcontrol.