How to reset PX4 SITL state

Hi,

When you are using PX4 SITL with a Simulator like AirSim and you want to execute some training runs (e.g., as part of training a control policy via Reinforcement Learning), there is a need to reset the PX4 SITL state without having to restart the PX4 process again and again.

Is there a way to do this currently with PX4 SITL?

Mavlink doc for https://mavlink.io/en/messages/common.html#HIL_SENSOR
seem to suggest that you can do this by sending HIL_SENSOR message with specific “fields_updated”

fields_updated uint32_t Bitmap for fields that have updated since last message, bit 0 = xacc, bit 12: temperature, bit 31: full reset of attitude/position/velocities/etc was performed in sim.

Would this work with PX4 SITL?

@bmalhi Do you mean by resetting PX4 simply restarting the process? or the capability of starting the Firmware at a certain state?

If this is about the prior why don’t you simply kill and restart the process? Why do you need a mavlink interface for this?

If this is the latter, it might be a bit more complicated, since PX4 has a lot of internal states and you would need to be able to save and relaunch all the internal states.

The fields_updated field in HIL_SENSOR message is used for being able to update the sensor states asynchronously - meaning that you can send IMU states not necessarily including e.g. diff_pressure update. and this is currently working in PX4 SITL. Therefore, this is not related to restarting the PX4 SITL process

@Jaeyoung-Lim - in ML training scenarios, one would run thousands of episodes (at least), with each episode requiring starting from a specific state – restarting the process is possible but adds significant extra time to the training. Hence, I am looking for a way to reset the state without requiring a restart of the process.

The documentation for HIL_SENSOR message, “fields_updated” field, seem to indicate that if you set bit 31, it should reset bunch of states–

“bit 31: full reset of attitude/position/velocities/etc was performed in sim.”

How does PX4 react to bit-31 being set?

Thanks!

@bmalhi I don’t think PX4 currently handles that bit in that message yet.

But even if it does, I don’t understand why you need to to run SITL as part of your ML training. What do you mean by “resetting the state?”. What are the states you are considering?

If you put SITL in the loop as part of your your training, you are basically extending your hidden state into thousands of dimensions, since your model + PX4 firmware has more states than just the dynamic state (e.g. PX4 commander has it’s own state machine, navigator has it’s own state machine, ecl has its own states etc). This results in behaviors hard to understand from the network unless you have those states as part of your network input. If this is intentional, why is this useful?

Why not just train your network without SITL in the loop based on high level commands(e.g. velocity commands, body rate commands) based on dynamic states? Then you can use that trained network to use the state esstimates of EKF2 in PX4 and input commands to run your trained network. In my experience, this way you get much more shorter training times since you have less hidden states.

@Jaeyoung-Lim - By “resetting the state” I mean taking it to the initial state that PX4 is in when the process is just started and the sim connected to it.

To clarify the scenario - we want to train an RL policy to control a drone. The RL policy takes camera images (plus IMU etc.) an input state and issues velocity commands as the output. That velocity command is then sent to PX4. In the deployment/runtime mode, the policy would be running on the drone on the onboard computer.

To train this policy using simulation, we would like to replicate the same flow - Policy sends vel commands to PX4; PX4 sends motor outputs to Sim and then Sim state (camera images, IMU, etc) is sent back to policy.

To train this policy to be as closer to the eventual deployment model, it would make a lot of sense to train this policy with PX4 in the training loop. Eventually, the RL policy does the end-to-end control. It does NOT need to learn the dimensions of the PX4 state, but it would make a lot of sense to use the same components that would be used on real drone so that you are generating training data that is closer to real world (especially in the RL case where the policy is not just reacting to current state but learning how to achieve the eventual goal).

It is not clear to me why we want to exclude PX4 from training loops. It seems you are indicating that PX4 should be used only for final verification and not for training?

It seems you are indicating that PX4 should be used only for final verification and not for training?

Yes,

To train this policy to be as closer to the eventual deployment model, it would make a lot of sense to train this policy with PX4 in the training loop. Eventually, the RL policy does the end-to-end control. It does NOT need to learn the dimensions of the PX4 state, but it would make a lot of sense to use the same components that would be used on real drone so that you are generating training data that is closer to real world (especially in the RL case where the policy is not just reacting to current state but learning how to achieve the eventual goal).

I think your assumption is that PX4 is transparent, meaning that if you command velocity inputs with a given state it should always have the same result. However, this is not true. PX4 has a lot of behaviors built in so that it can operate safely in the field. I would argue that if you want to capture the velocity controller dynamics in your training you would be better off by modeling that separately in your simulation. This way you separate your network from all the safety logic that is built in, so you are not building something on top of these safety features.

If you integrate the whole firmware into the simulation, this also will give you huge headaches when you try to transfer this to a real platform since you will have different parameters on your real vehicle compared to SITL. From what you mention, it sounds like you are not planning to randomize the dynamics of the vehicle or the parameter of the firmware. Since the behavior of the firmware is different from what you trained your model on, it is most likely that the distribution that the network has learned is no longer relevant.