Data driven road graph enrichment
Problem description
As we know, the geographical map can be conveniently represented as a graph with nodes and egdes (Picture 1). Each node has a GPS location with latitude and longitude. Open source service Open Street Map uses this representation under the hood and all road graphs can be downloaded.
Many devices can send the information about their location so it is visible, how the users of these devices moved in time. Sometimes we need to juxtapose these GPS tracks with the road graph and to calculate the 'projection' of the track to the graph, i.e. the ordered sequence of nodes or edges.
This problem in computer science is called Map Matching (see the link for detailed description). There exist some powerful algorithms, but considering specifics of our problem, we decided to design and implement our own.
Applications
The applications for presented approach can be divided in two groups:

Road graph examination. Open Street Maps are created by real users and there might be errors like missing edges, incorrect annotation etc. Also the graph can be incomplete on purpose, e.g. it can contain only highways or primary links. It means, there will be missing edges, if in reality there was a secondary link. More about OSM data structures you can read here.

Road graph enrichment. If the user data containes additional information, for example time in traffic jams, the road graph can be dynamically enriched to provide the aggregation to other users.
Our use case was a challenging combination of both situations. We obtained the road graph from OSM for the whole Europe, but only for highways and primary links, as the full graph takes a lot of RAM to process. The Eurowag company provided us with freight vehicles telemetry, where there were vehicle locations, timestamps, fuel levels and a lot of other attributes. The task was to annotate the graph edges with average speed and average fuel consumption, given an incomplete graph.
Our approach
Notation
For shortness and convenience we will further call the road graph G(N, E, D)
, where N
is a set
of nodes and E
is the set of edges, D
is the lengths of the edges,
the recorded track T
and the obtained edge sequence with annotation <E', A>
and node sequence P
.
Proposed Algorithm
The proposed algorithm consists of the following steps (the picture below illustrates them):

Assign the first node of
P
. It is the most time consuming and intuitevely misleading part (subpicture 2 and 3).1.1. Filter the track
T
. Some of the starting and ending points are too far from any node inG
(up to kilometers) so there is no need to keep them.1.2. Using kdtree structure we obtain 1030 closest nodes (set
K
) around the first point. Then we cut a subtrack of 1020 first pointsTs
from theT
.1.3. We iterate over the set
K
and run the main procedureproject_dijkstra()
(the description is below) onTs
and try to 'fit' it (to calculate a projection).1.4. We chose the subprojection with smallest area between subtrack and subprojection.
1.5. If the main procedure
project_dijkstra()
failed for all first node candidates, cut the track and perform a jump.1.6. Otherwise go to step 2.

Continue slicing the
T
into subtracks of length 5080 points, getting projections withproject_dijkstra()
and connecting the result with the previous iteration (subpicture 4). It is also good idea not to connect previous result with the next one directly, but with some overlap.2.1. But if
project_dijkstra()
fails, we need to perform a jump, by deleting next 510 track points (subpicture 5).2.2. If the jump happened, the algorithm performs a search for the new starting node from the trimmed track and repeats the step 1 (subpictures 6, 7).

Finish, if there are no more track points left to slice (subpicture 8).
The core procedure is called project_dijkstra()
and it implements the wellknown Dijkstra shortest path in
algorithm (if you are curious how it works, there is a good explaination here).
The major difference is that we run it on a transformed graph G'(N, E, D')
, which is created
from G
with the same sets of nodes and edges, but the distances are actually the areas between
the subpath and the track T
. The algorithm consists of following steps:

There is a subtrack
T'
and a starting nodeNs
from the graphG
. The area between them equals 0.Ns
is added to the queuePQ
(subpicture 1). 
The node
n
with the smallest area is popped from thePQ
. 
We obtain the set of nodes connected with edges with the popped node from the graph
G
. 
Iterating through the set of nodes
n_new
, calculating the area between the sequence(..., n, n_new)
and the subtrackT'
and inserting into thePQ
, if it is not there yet. The data annotation also happens here (subpicture 2). 
Repeat the step 2 (subpicture 3) until
PQ
is empty or the area nearby the last point is reached (subpicture 4).
Results
The algorithm was designed with an assumption, that vehicles travel in highways and primary links most of their route. But there are cars travelling mostly within the same city and they have to be filtered out.
Proposed algorithm has following advantages:

It is robust to switching from highways to secondary links and then back to highways, because it can perform a jump.

Because of partitioning of the track for small pieces, the algorithm is robust to turns, loops and multiple passes through the same route.

Initial filtering of the track reduces the number of points outside of the highway. Points are also filtered by the speed, as we do not need the clutter around warehouses and parkings.

Works offline. The output is only dependent on history.
And disadvantages:

Does not work online. The track should be recorded fully before processing.

Relatively slow. If the procedure has to perform the jump relatively often, it has to check many starting points.