Data driven road graph enrichment

Data driven road graph enrichment

Problem description

As we know, the geographical map can be conveniently represented as a graph with nodes and egdes (Picture 1). Each node has a GPS location with latitude and longitude. Open source service Open Street Map uses this representation under the hood and all road graphs can be downloaded.


Many devices can send the information about their location so it is visible, how the users of these devices moved in time. Sometimes we need to juxtapose these GPS tracks with the road graph and to calculate the 'projection' of the track to the graph, i.e. the ordered sequence of nodes or edges.

This problem in computer science is called Map Matching (see the link for detailed description). There exist some powerful algorithms, but considering specifics of our problem, we decided to design and implement our own.


The applications for presented approach can be divided in two groups:

  1. Road graph examination. Open Street Maps are created by real users and there might be errors like missing edges, incorrect annotation etc. Also the graph can be incomplete on purpose, e.g. it can contain only highways or primary links. It means, there will be missing edges, if in reality there was a secondary link. More about OSM data structures you can read here.

  2. Road graph enrichment. If the user data containes additional information, for example time in traffic jams, the road graph can be dynamically enriched to provide the aggregation to other users.

Our use case was a challenging combination of both situations. We obtained the road graph from OSM for the whole Europe, but only for highways and primary links, as the full graph takes a lot of RAM to process. The Eurowag company provided us with freight vehicles telemetry, where there were vehicle locations, timestamps, fuel levels and a lot of other attributes. The task was to annotate the graph edges with average speed and average fuel consumption, given an incomplete graph.

Our approach


For shortness and convenience we will further call the road graph G(N, E, D), where N is a set of nodes and E is the set of edges, D is the lengths of the edges, the recorded track T and the obtained edge sequence with annotation <E', A> and node sequence P.

Proposed Algorithm

The proposed algorithm consists of the following steps (the picture below illustrates them):

  1. Assign the first node of P. It is the most time consuming and intuitevely misleading part (subpicture 2 and 3).

    1.1. Filter the track T. Some of the starting and ending points are too far from any node in G (up to kilometers) so there is no need to keep them.

    1.2. Using kd-tree structure we obtain 10-30 closest nodes (set K) around the first point. Then we cut a subtrack of 10-20 first points Ts from the T.

    1.3. We iterate over the set K and run the main procedure project_dijkstra() (the description is below) on Ts and try to 'fit' it (to calculate a projection).

    1.4. We chose the subprojection with smallest area between subtrack and subprojection.

    1.5. If the main procedure project_dijkstra() failed for all first node candidates, cut the track and perform a jump.

    1.6. Otherwise go to step 2.

  2. Continue slicing the T into subtracks of length 50-80 points, getting projections with project_dijkstra() and connecting the result with the previous iteration (subpicture 4). It is also good idea not to connect previous result with the next one directly, but with some overlap.

    2.1. But if project_dijkstra() fails, we need to perform a jump, by deleting next 5-10 track points (subpicture 5).

    2.2. If the jump happened, the algorithm performs a search for the new starting node from the trimmed track and repeats the step 1 (subpictures 6, 7).

  3. Finish, if there are no more track points left to slice (subpicture 8).

algorithm iteration

The core procedure is called project_dijkstra() and it implements the well-known Dijkstra shortest path in algorithm (if you are curious how it works, there is a good explaination here). The major difference is that we run it on a transformed graph G'(N, E, D'), which is created from G with the same sets of nodes and edges, but the distances are actually the areas between the subpath and the track T. The algorithm consists of following steps:

  1. There is a subtrack T' and a starting node Ns from the graph G. The area between them equals 0. Ns is added to the queue PQ (subpicture 1).

  2. The node n with the smallest area is popped from the PQ.

  3. We obtain the set of nodes connected with edges with the popped node from the graph G .

  4. Iterating through the set of nodes n_new, calculating the area between the sequence (..., n, n_new) and the subtrack T' and inserting into the PQ, if it is not there yet. The data annotation also happens here (subpicture 2).

  5. Repeat the step 2 (subpicture 3) until PQ is empty or the area nearby the last point is reached (subpicture 4).

algorithm dijkstra


The algorithm was designed with an assumption, that vehicles travel in highways and primary links most of their route. But there are cars travelling mostly within the same city and they have to be filtered out.


Proposed algorithm has following advantages:

  1. It is robust to switching from highways to secondary links and then back to highways, because it can perform a jump.

  2. Because of partitioning of the track for small pieces, the algorithm is robust to turns, loops and multiple passes through the same route.

  3. Initial filtering of the track reduces the number of points outside of the highway. Points are also filtered by the speed, as we do not need the clutter around warehouses and parkings.

  4. Works offline. The output is only dependent on history.

And disadvantages:

  1. Does not work online. The track should be recorded fully before processing.

  2. Relatively slow. If the procedure has to perform the jump relatively often, it has to check many starting points.

About the Author

Margarita Argirova

Margarita is data scientist in LibertyAces

You Might Be Interested in Reading These Articles

BitSwan Features v2003

We've just launched new features: MySQL AsyncLookup, Pipeline link/unlink, DirectSource, etc.

Continue reading ...

new release features anomaly async pipeline kafka

Published on March 16, 2020

BitSwan Features v2001

We've just launched new features: MySQL AsyncLookup, Pipeline link/unlink, DirectSource, etc.

Continue reading ...

new release features mysql async pipeline direct source

Published on February 11, 2020

BitSwan Features v1906

We've just launched new features: KafkaKeyFilter filtering over multiple keys, PostgreSQL sink and connection, Mongo query in LatchProcessor, Idle time in File sources, TimeDriftFilter etc.

Continue reading ...

new release features file encoding mongo asynchronous kafka latch time drift

Published on June 30, 2019