Local position doesn't converge to vision pose data as expected

Hi there,
We are doing indoor flight without gnss and compass, using visual tag. Mostly the data coming from visual tag is accurate. Data is around 18 hz.

When I checked the local position and vision position, I see that local position data doesn’t converge to vision data as expected, what I mean can be seen on the plot attached.

Can it be the vibration on the drone which maybe causes imu to measure inaccurately and mislead EKF.

Can it be the frequency of vision data at 18 hz not enough ?

or what can be the reason for this ?