imagico.de

Assessing the OpenStreetMap coastline data quality

One aspect of OpenStreetMap data everyone using it should know and everyone using it on a larger scale will sooner or later realize is that data quality varies strongly. The coastline data i have been working quite a lot with recently is somewhat special in this regard since it is the only type of data on OpenStreetMap that is routinely processed as a whole by the OpenStreetMap project itself and as a result there is some ensurance of consistency in the data (which is an ongoing struggle however – see Jochen Topf's recent status report). This does not say anything about the accuracy of the data which i will try to assess here.

I have written previously that the OpenStreetMap coastline data has reached a fairly good overall level of quality – a claim i have not backed up with any verifiable information though. The major difficulty of evaluating the accuracy of any geographic data set is that you would need much better data of the same features as reference for an accurate assessment. Specifically if you have a coastline data set you would need to know the shape of the actual coastline significantly more precisely that the data you wish to evaluate to actually say something about its accuracy.

Now while there are certainly local data sets with coastline information more detailed than OpenStreetMap data most of them are not so much more detailed that differences between these data sets and OpenStreetMap data could comfortably be attributed exclusively to inaccuracies in the latter. All global data sets (some conveniently listed on the Openstreetmap wiki) are at least in parts less detailed than OpenStreetMap data so they cannot be used as a reference either.

How to assess data quality without a reference

So in the end we have to think about how to evaluate the data without a reference. The first idea that comes to mind is to measure the density of data points. The coastline data is represented as line strings – a set of points on the earth surface connected by straight lines (what exactly straight means here will be left to discuss another day). The denser the data points the better the actual coastline can be represented. But there are two problems with this approach: First a smooth coastline can be fairly accurately represented by very few nodes with a lot of distance between them. Second – if the nodes are not at the actual coastline a very detailed coastline representation in the data can still be inaccurate.


rough coastline with low detail approximation	smooth coastline with low detail approximation	coastline with detailed but inaccurate approximation

The latter possibility is something that cannot be solved without accurate reference data. In the simplest case of a constant offset as shown in the above illustration the data might look exactly like the reference except for the offset and still be inaccurate. We can however do something about the former. The major difference in the form of the red lines in the first and second image above is the angle between the line segments at the nodes. Calculating that gives us a second measure to characterize the data in addition to the node distance.

Before i now analyze the coastline data i would like to take a step back and see what we are actually looking at here. This requires understanding how the data is actually acquired. In case of OpenStreetMap this varies somewhat depending on the data sources but data is nearly always based on aerial or satellite images and processing essentially can be characterized by the following drawings:


coastline data acquisition based on image classification	coastline data acquisition by manual tracing of images

Most other coastline data sets as well as a large part of the Openstreetmap data which has been imported from other sources (most notably PGS) have been produced from satellite images by land/water classification and subsequent vectorization of the classification mask. Those parts of the OSM coastline produced by community mapping on the other hand are manually traced on aerial or satellite images by the mappers.

Now as i said the lack of a reference to compare the data to restricts our ability to look at this. Since we only have the final data to analyze we can essentially only look at the steps in between through the filter of the processing steps coming afterwards. As a result we have a clear view at the vectorization/tracing step producing the line string data we analyze. But saying something about the quality of the data requires a deeper look into the process which we will only be able to do indirectly and only by making assumptions about the vectorization/tracing step.

Data analysis

I will start with an analysis of the average distance between coastline points across the earth surface. The color of each pixel in the following image represents the average node distance of the coastlines within the area of this pixel.

You can click on the image to view a larger version. It can be seen that the average point distance varies quite significantly and this variation is to a large extend neither random nor can be clearly attributed to differences in the smoothness of the actual coastline. Note the upper limit for the color scale of 850m has no particular significance. There are some (smaller) parts where the average distance is even larger.

It can be clearly seen that those parts of the earth otherwise densely mapped, in particular Europe and Japan feature a high coastline node density – as to be expected. What's more interesting are abrupt changes in node density, like at the North American west coast at the border between Canada and Alaska or in Greenland the difference between east and west coast.

Next i have a similar image of the average derivation angle at the coastline nodes. Zero degrees means a node is right in the middle between the previous and the next node in the line and 90 degrees means the line is making a right angle turn to either side at this node.

First of all remarkable is that the average derivation angle is quite high (global average is 63 degrees). On the scale we are looking at here (between a few and a few hundred meters) clearly defined sharp corners are not particularly common on the earth coastline. So a large average angle indicates the coastline data is not a particularly good representation of the imagery it is derived from or in other words: by using more detailed vectorization or manual tracing (see the diagrams above) a significantly more accurate coastline could probably have been produced from the same image data. Why has this not been done? The reason is probably austerity with respect to data sizes. Often during data production the goal is not maximum accuracy but complying with data volume constraints.

Most of the areas with universally high average angle (the red/orange parts in the above map) are regions where data has been imported from PGS. This data has a quite unique signature both in respect to average point distance and average derivation angle. This can be observed most clearly in southern Alaska, southern Chile and western Greenland. This signature is probably a result of the vectorization technique used to produce this data but note not all coastline data produced with automated techniques will be similarly uniform. In fact efficient vectorization techniques will vary the node density depending on the smoothness of the line. This can be seen at the coast of western Poland for example (although here is is probably manually mapped) where the average node distance is high (among the highest in central Europe) while the average angle is low. The actual coastline is very smooth here so only few nodes are required for an accurate representation and they still show a low derivation angle.

Since there are very few actual corners in the real coastline at the scale we talk about here it can be reasonably said that there is an optimum derivation angle of about 10-20 degrees. How close the data is to this optimum does not say anything about the accuracy of the data but about how well the line strings represent the underlying data source. If the angle is much larger like in case of the PGS data information that is in the source data is lost in the too coarse vector representation. If the angle is much lower the data is actually more detailed than necessary.

Using these observations we can now try to combine these two measurements – the average node distance and the average derivation angle – into a combined quality value:

log(dist_avg*max(10, angle_avg))

Put into words this formula means: a low distance between nodes results in a low value (=high quality of the data) even if the average angle is high. A low average angle equally results in a low combined value even if the node distance is high but angles lower than 10 degrees are not considered to improve quality. The logarithm is just for scaling. The resulting combined quality value is shown in the following image:

Note this measure of 'quality' is to be interpreted with care. While red areas are most likely areas with fairly low accuracy it does not really work the other way round: Not all blue sections are really good quality. This assessment particularly fails if there has been a significant amount of smoothing applied to the data.

This map indicates three larger areas where improvements in data quality are most seriously needed:

Indonesia and surrounding (in particular Malaysia, Thailand and Burma)
The Caspian Sea
Eastern Greenland

There are various other smaller areas not very well represented, especially some of the arctic and subantarctic islands but the above are the largest areas.

It should be noted that at least Indonesia is much better represented in the PGS data – apparently this has never been imported there (or in an older version that was less accurate). However since there has been a lot of manual mapping in the region meanwhile large scale imports of coastline data will be difficult.

A few highlights and numbers

Finally i would like to point out a few highlights and examples from the data as well as some overall statistics:

I already mentioned the global average derivation angle (63 degrees) – the average distance between nodes is 66 meters resulting in a total coastline length of 1.99 million kilometers. The longest segment that is not artificially closing the coastline at the 180 degree line (those were not considered in the statistics here) is 23.7 kilometers long but this is not actually part of a coastline but runs across the mouth of a river (there are several cases of river mouths with the coastline ending very far towards the sea leading to long straight segments).

The regions that would be most important to improve have already been mentioned above. I will point out two specific examples of particularly bad data quality here:

While the former of these two is prominently visible in deep red in the last map above the latter is not - although whole fjords of more than 50 km length are missing there. This is because the data is smoothed significantly. I don't know the exact source but it could be from fairly old maps, possibly based on inaccurate/incomplete surveys of the region.

On the positive side of regions where the OSM coastline is particularly detailed i would like to point out Estonia where the coastline is not only very detailed but the average derivation angle is right in the optimal range described above. There are probably various urban harbor coastlines that have been mapped in even more detail down to sub meter accuracy but i am going to skip those in this global analysis.

Christoph Hormann, April 2013

Visitor comments:

by Florin Badita from Romania posted on Thu Sep 24 2015 16:39:38

How do you calculate the average angle between the nodes ?

by chris from Germany posted on Thu Sep 24 2015 17:20:22

Angles at the nodes of a way requires - in the general case - a bit of spherical trigonometry, as described here:\\n\\nhttps://en.wikipedia.org/wiki/Spherical_law_of_cosines\\n\\nbut you can for usual node distances of ways in OSM also simply use the planar formula (i.e. difference in planar direction between the two segments surrounding the node).

by verdy_p from France posted on Thu Apr 25 2013 11:13:52

How long is the coastline of Britanny in France ? This is a wellknown problem, because there&#039;s no response to it. Britanny (there are similar examples elsewhere in the world) is wellknown for its fractal type of coastlines, and each time you want to refine the coastline, it&#039;s impossible to reach a point where the deviation angles will be lower than 10 degrees, on MOST of its length.\\n\\nYou you really continue to do that, you&#039;ll fall below the level of accurady of data, because of the very noticeable effects of tide.\\n\\nSo the result that could be reached in Estonia (and could be reached easily for &quot;flat&quot; countries like Belgium or the Netherlands, or for regions of France like Aquitaine) cannot be reached everywhere, but only on coastlines bordered mostly by long beaches (e.g. most coasts of Spain on the Mediterranean Sea, but not around Britanny, or the coastlines of Portugal and Galicia in Northwest Spain).\\n\\n**All** attemps made by national geographic institutes have failed to reach a situation where you fall below a maximum deviation angle.\\n\\nSo your formula should take this into account more closely : it makes no sense to increase the details around a region whose coastline is now very complex to handle, even when it has been split into several parts using administrative borders as limits (the split by country was not enough, the split by admin region at level 4 was not enough, the split by admin department at leavel 6 was not enough, the split by admin arrondisssement does not give more meaningful limit, and the next level is the split by municipality at level 8. Still this is not enough ! \\n\\nSo now we split the coastline where we find enough river mouths or beaches, becuase these segments can be smoothed to reach the level of accuracy. But for everything else, on rocky coastlines, the only level we can reach is to be below a reasonnable length. Given the precision of orthophotography, and the effects of tide, it makes no sense going below a limit of of 3 meters, even if deviation angles are much higher than 20 degrees (In harbours, we frequently find very sharp angles and even if you probably should &quot;round&quot; them a little to be below 25 degrees, creating &quot;octogonal&quot; shapes to ends of dikes, this will be enough, even if segments are can reach about 50 meters).\\n\\nSo your formula should be something like;\\n\\nmax(avg_length/(90&deg;/deviation_angle), 3 meters)\\n\\nYou can take the log of it, for scaling the result. \\n\\ni.e.\\n\\nlog(max(avg_length/(90&deg;/deviation_angle), 3m))\\n\\nYou can transform it into :\\n\\nmax(log(avg_length) + log(dev_angle) - log(90&deg;),\\n log(3m))\\n\\nAnd then the two constants can be simplified:\\n\\nmax(log(avg_length) + log(dev_angle),\\n log(3m) + log(90&deg;))\\n\\nThen to avoid the errors of logs with zeros or small values:\\n\\nmax(log(max(avg_length, 3m)) +\\n log(max(dev_angle, 10&deg;)),\\n log(3m) + log(90&deg;))\\n\\nThis can be computed incrementally angle node by anfle node (avg_length is the average of the two neighbouring segments), it should return 0 almost everywhere, or a positive value; then sum the result and divide by the total number of nodes until you&#039;ve crossed a grid cell for producing your final graphic.\\n\\nThis only depends on two parameters:\\n- the target minimum distance of 3m.\\n- the target minimum angle of 10 degrees.\\nYou may reduce these parameters (notably the second one if you want, but should not set any one of them to zero.\\n\\nBut using the max() only on the deviation angle without considering the length is wrong (this is equivalent to setting the target minimum distance to 0, and this is not reasonnable and will be wrong because there&#039;s no such data with this accuracy even if orthophotography may reach decimatric precisions now for details, but true geolocalisation and orthophotography have their own inaccuracies).\\n\\nExperiment the value of these parameters by runing them to look for some complex regions like the coasts of Britanny, Galicia, Norway, Scotland, Greece, California, Japan, Indonesia, or Chile where the sea border has very noticeable elevation on their lands and is fragmented as complex fractals (may be the elevation model of nearby lands and of average tides, could help tweaking these parameters locally).\\n\\nOtherwise you&#039;ll instruct people to detail each small rocklet that could move around, producing too much data that won&#039;t make any sense.

by chris from Germany posted on Thu Apr 25 2013 12:22:29

Hello verdy_p,\\n\\nyou raise a lot of valid points but i&#039;d like to point out a few things concerning those i consider important:\\n\\n- the nature of the OpenStreetMap project forbids claiming a certain minimum distance or maximum accuracy to exist and assuming everything beyond that is &#039;too much data&#039;. Everyone should be allowed to map in as much detail as he/she wants to.\\n\\n- The only argument that would actually speak against increasing the detail would be if this detail does not contain factual information. The tidal argument does not really apply here since the definition of the coastline is always independent of the tide. The other argument that could be used is that the coastline changes so rapidly that the small detail is outdated already when entered into the database. Especially in the case of rocky coastlines this does not apply either - erosion rates of rocky shores are usually in the range of millimeters-centimeters per year so even a sub-meter accuracy measurement would often be valid for decades.\\n\\n- my use of the derivation angle is in no way meant as an absolute criterion - as you point out it is useless for this purpose since the derivation angle does not generally converge to a minimum. I use it purely as a measure how well the vector data represents the data source it is derived from. As such it is quite suitable and the minimum angle of 10 degrees seems reasonable.\\n\\n- your ideas for alternative quality scores are interesting although introducing a lower bound for the distance of 3m would not actually change much since node distances below this are rare.\\n

by Harry Wood from United Kingdom posted on Sat Apr 13 2013 04:31:33

That&#039;s neat. I wrote the PGS wiki page you&#039;ve linked there, and I&#039;ve added a link back to this.\\n\\nWhenever I need to make a test edit of some OpenStreetMap data, I do a bit of PGS fixup. I go to somewhere random in Siberia or Alaska and tidy things up. At the moment it&#039;s never difficult to find somewhere that needs work, but maybe if more people do the same, we&#039;ll start to turn things blue on your colour scale. \\n\\n...Or maybe not. Loading in a bit imagery in these remote spots, is quite an awe inspiring demonstration of the vastness of our planet. \\n\\nWhat your map shows me though, is that the tropical Asian coastline needs work too. I&#039;ll go edit there next time for a bit of variety!

Your name/nickname:	*
Your country:
Your e-mail address:	will not be made public
Link to your website:	will be displayed with your posting

Geo-Visualization

Assessing the OpenStreetMap coastline data quality

How to assess data quality without a reference

Data analysis

A few highlights and numbers

Visitor comments: