Data driven road graph enrichment
As we know, the geographical map can be conveniently represented as a graph with nodes and egdes (Picture 1). Each node has a GPS location with latitude and longitude. Open source service Open Street Map uses this representation under the hood and all road graphs can be downloaded.
Many devices can send the information about their location so it is visible, how the users of these devices moved in time. Sometimes we need to juxtapose these GPS tracks with the road graph and to calculate the 'projection' of the track to the graph, i.e. the ordered sequence of nodes or edges.
This problem in computer science is called Map Matching (see the link for detailed description). There exist some powerful algorithms, but considering specifics of our problem, we decided to design and implement our own.
The applications for presented approach can be divided in two groups:
Road graph examination. Open Street Maps are created by real users and there might be errors like missing edges, incorrect annotation etc. Also the graph can be incomplete on purpose, e.g. it can contain only highways or primary links. It means, there will be missing edges, if in reality there was a secondary link. More about OSM data structures you can read here.
Road graph enrichment. If the user data containes additional information, for example time in traffic jams, the road graph can be dynamically enriched to provide the aggregation to other users.
Our use case was a challenging combination of both situations. We obtained the road graph from OSM for the whole Europe, but only for highways and primary links, as the full graph takes a lot of RAM to process. The Eurowag company provided us with freight vehicles telemetry, where there were vehicle locations, timestamps, fuel levels and a lot of other attributes. The task was to annotate the graph edges with average speed and average fuel consumption, given an incomplete graph.
For shortness and convenience we will further call the road graph
G(N, E, D), where
N is a set
of nodes and
E is the set of edges,
D is the lengths of the edges,
the recorded track
T and the obtained edge sequence with annotation
<E', A> and node sequence
The proposed algorithm consists of the following steps (the picture below illustrates them):
Assign the first node of
P. It is the most time consuming and intuitevely misleading part (subpicture 2 and 3).
1.1. Filter the track
T. Some of the starting and ending points are too far from any node in
G(up to kilometers) so there is no need to keep them.
1.2. Using kd-tree structure we obtain 10-30 closest nodes (set
K) around the first point. Then we cut a subtrack of 10-20 first points
1.3. We iterate over the set
Kand run the main procedure
project_dijkstra()(the description is below) on
Tsand try to 'fit' it (to calculate a projection).
1.4. We chose the subprojection with smallest area between subtrack and subprojection.
1.5. If the main procedure
project_dijkstra()failed for all first node candidates, cut the track and perform a jump.
1.6. Otherwise go to step 2.
Continue slicing the
Tinto subtracks of length 50-80 points, getting projections with
project_dijkstra()and connecting the result with the previous iteration (subpicture 4). It is also good idea not to connect previous result with the next one directly, but with some overlap.
2.1. But if
project_dijkstra()fails, we need to perform a jump, by deleting next 5-10 track points (subpicture 5).
2.2. If the jump happened, the algorithm performs a search for the new starting node from the trimmed track and repeats the step 1 (subpictures 6, 7).
Finish, if there are no more track points left to slice (subpicture 8).
The core procedure is called
project_dijkstra() and it implements the well-known Dijkstra shortest path in
algorithm (if you are curious how it works, there is a good explaination here).
The major difference is that we run it on a transformed graph
G'(N, E, D'), which is created
G with the same sets of nodes and edges, but the distances are actually the areas between
the subpath and the track
T. The algorithm consists of following steps:
There is a subtrack
T'and a starting node
Nsfrom the graph
G. The area between them equals 0.
Nsis added to the queue
nwith the smallest area is popped from the
We obtain the set of nodes connected with edges with the popped node from the graph
Iterating through the set of nodes
n_new, calculating the area between the sequence
(..., n, n_new)and the subtrack
T'and inserting into the
PQ, if it is not there yet. The data annotation also happens here (subpicture 2).
Repeat the step 2 (subpicture 3) until
PQis empty or the area nearby the last point is reached (subpicture 4).
The algorithm was designed with an assumption, that vehicles travel in highways and primary links most of their route. But there are cars travelling mostly within the same city and they have to be filtered out.
Proposed algorithm has following advantages:
It is robust to switching from highways to secondary links and then back to highways, because it can perform a jump.
Because of partitioning of the track for small pieces, the algorithm is robust to turns, loops and multiple passes through the same route.
Initial filtering of the track reduces the number of points outside of the highway. Points are also filtered by the speed, as we do not need the clutter around warehouses and parkings.
Works offline. The output is only dependent on history.
Does not work online. The track should be recorded fully before processing.
Relatively slow. If the procedure has to perform the jump relatively often, it has to check many starting points.