Px4 SITL and gazebo gym reinforcment learning

Hey guys.
I’ve created a gym environment using mavros and px4-sitl that uses the “/mavros/setpoint_raw/attitude” topic to publish actions and get observations from “/mavros/local_position/pose”
unlike other gym environments there is no actual simulation “step”, instead the simulation runs freely and I enforce a 33.3Hz rate.

For the moment I’m trying to teach the agent to reach (or maintain) a certain height, hence the observation space is [ z_pos - desired_z_pos , z_vel ] and the action space is [ Thrust ]

The problem is I’m having trouble converging RL agents on this environment. I’ve tried multiple continues action-space agents (DDPG, PPO1, TRPO, SAC) from multiple libraries (baselines, stable-baselines, spinning-up, keras-rl) and non converge after more then 5e5 steps. (neither self-written ddpg and ppo agents)

using the provided PD controller (found in the git) the drone achieves height control perfectly, so it makes me wonder if its an environment issue, or am I missing something ?

I have uploaded the code to github so that anyone can take a look and perhaps give some input.

Iv’e been studying RL only for a couple of months so I’m afraid I missed something here.


@Benykoz Not sure if this comment would be helpful for you, but running sitl directly for learning a RL simulation is not a good idea, since the state space of SITL includes not just the output states but also the internal states of the software. (meaning with the same position, velocity you can have many different states inside the px4 firmware)

This means that you are trying to learn a much bigger state space which makes it harder for your agent to learn. In case you are not defining your internal software states as input, this also means that you are making your problem partially observable, which makes it even more challenging to learn.

Therefore, it is a better approach to learn your policy in a stripped down environment that consists of a much more simple state and dynamics and then try to adapt this policy to make it fly the vehicle.

Thanks for the input Jaeyoung-Lim. I was afraid that is the case.
As a follow up, I see 2 options before me, what would you recommend ?

  1. training a policy on a simpler environment (I have reinmav-gym in mind), and then adapting it to the px-4 SITL environment. alternatively, do you know of another stripped down gazebo-mavros/mavsdk environment I can easily transfer to ?
  2. maybe feeding the agent more states or raw data (such as IMU data or IMU std) ?

Thanks again, been following your work (and repository - great stuff, very helpful)


  1. reinmav-gym was specifically designed so that you are able to simulate large batches of simulation steps inexpensively. One of the crucial things you need to do in order to learn a robust policy is randomizing the dynamics. This can only be tractable if you have a deterministic dynamics model and you mix noise in various places to make the policy cope with random disturbances
    From reinmav-gym, you can easily transfer the policy to mavros / mavsdk since it has body rate input as command inputs. This minimizes the problem of sim to real domain transfer since the feedback loop inside the firmware behaves similarly in simulation and real world.

  2. In my opinion, this is not feasible. It is not just the imu states, but the states of the whole software. There are various states inside the firmware not only failsafes, flight modes, but also internal bolean states which would be impossible to keep track of. Also since all the sensor states are simulated, this makes it hard to randomize the dynamics of the simulation, which prevents you from learning more robust policies.