Skip to content

Conversation

karl-schulz
Copy link

@karl-schulz karl-schulz commented Jun 2, 2025

When the NavsatTransform node is starting with wait_for_datum, the is the possibility of a race condition. We had nodes crashing because of this. My interpretation of the race condition is this:

  1. The robot publishes a fix before the SetDatum service is called
  2. The NSTF tries to transform it to the map frame, but the map frame (still assumed at lat=0, lon=0) is far away from the received fix.
  3. As a result, the NSTF publishes a odometry/gps with very high x/y values (in the range of 40km).
  4. The EKF node (from this package) receives this message and sends the robot far away from the map frame, publishing a odometry/filtered/global an map->base_footprint TF message with high x/y values.
  5. The NavsatTransform then receives odometry/filtered/global with it's X/Y far outside the UTM origin/range, crashing.

This is the error log:

localization  | [navsat_transform_node-6] [INFO] [1745838634.424235191] [navsat_transform]: Datum set.
localization  | [navsat_transform_node-6] terminate called after throwing an instance of 'GeographicLib::GeographicErr'
localization  | [navsat_transform_node-6]   what():  Latitude 48.1346d more than 20d from N pole
localization  | [ERROR] [navsat_transform_node-6]: process has died [...]

I think this is because of theses issues:

  • The NavsatTransform should not publish odometry/gps before one of these conditions is met:
    • wait_for_datum parameter is False
    • The SetDatum is called with the initial datum
  • The NavsatTransform should not do any kind of processing of incoming messages before the datum is set

I fixed it in this PR and tested locally in a regression test that isolates the race condition. I also added some more debugging logs to track issues better - I can also leave that part out if you want.

I'd be very happy about feedback if I got the issue right and if somebody else is having similar issues.

@ayrton04
Copy link
Collaborator

Might be a stupid question, but why can't we just add this right inside the gpsFixCallback?

if (use_manual_datum_ && !(has_transform_gps_ && has_transform_imu_))
{
  return;
}

If we have wait_for_datum (use_manual_datum_) enabled, the only way for has_transform_gps_ and has_transform_imu_ to become true is for the datum parameter to be set or the service to be called with the datum parameter. So if we get any GPS poses before then, we can ignore them.

@ayrton04
Copy link
Collaborator

Actually, I take it back. This is the right way to do this. It also solves #927.

@ayrton04
Copy link
Collaborator

@karl-schulz would you mind rebasing this on jazzy-devel and making sure it passes the build farm checks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants