San Diego, California.
I spent the most memorable years in San Diego. It’s considered as my second hometown. The city is beautiful and diverse in all aspects, excluding the weather, of course.
https://www.openstreetmap.org/export#map=11/32.8433/-117.0607
All the cleaning is about minutia. Most of them can be treated individually.
Swapped entries of zipcode and street name:
{‘value’: ‘10006’, ‘type’: ‘addr’, ‘id’: ‘132979649’, ‘key’: ‘housenumber’}
{‘value’: ‘Scripps Ranch Blvd.’, ‘type’: ‘addr’, ‘id’: ‘132979649’, ‘key’: ‘postcode’}
{‘value’: ‘92131’, ‘type’: ‘addr’, ‘id’: ‘132979649’, ‘key’: ‘street’}
{‘value’: ‘yes’, ‘type’: ‘regular’, ‘id’: ‘132979649’, ‘key’: ‘building’}
After auditing the street types of the OSM file, I added several street types which are pretty common in this area, such as: “Circle”, “Point”, “Row”, “Terrace”, “Walk”, “Way”, to the “expected” list. Also, the mapping dictionary is updated. All the typos and abbreviation of street types will be treated together.
Then I test running the update to see if the transformation/normalization works as expected.
Since this error/typo is quite case specific, it’s easier to look up the issue (they share the same ways.id) in the csv file and then make updates directly.
sandiego.osm - 100.6MB
nodes.csv - 29.6MB
nodes_tags.csv - 18.7MB
ways.csv - 2.3MB
ways_nodes.csv - 8MB
ways_tags.csv - 6MB
sandiego.db - 59MB
SELECT COUNT(*) FROM nodes;
360582
SELECT COUNT(*) FROM ways;
38477
SELECT COUNT(DISTINCT uid) FROM
(SELECT uid FROM nodes UNION SELECT uid FROM ways) AS subq;
628
SELECT COUNT(*) FROM nodes_tags WHERE key = ‘amenity’ and value = ‘atm’;
31
Following the previous statistics, I’m curious that which banks are most likely to install atm machines.
SELECT aa.value AS Operator, count(aa.id) AS num
FROM (SELECT nodes_tags.id AS tt FROM nodes_tags WHERE key = ‘amenity’ AND value = ‘atm’) AS sub
JOIN nodes_tags AS aa ON sub.tt = aa.id
WHERE aa.key = ‘operator’
GROUP BY aa.value
ORDER BY num DESC
LIMIT 3;
Operator | num |
---|---|
“Wells Fargo” | 6 |
“Bank of America” | 4 |
“Nautilus Hyosung” | 4 |
The numbers are much less than I expected. I highly suspect the data is not complete.
SELECT aa.value AS Cuisine, COUNT(aa.id) AS num FROM (SELECT nodes_tags.id AS tt FROM nodes_tags WHERE key = ‘amenity’ AND value = ‘restaurant’) AS sub
JOIN nodes_tags AS aa ON sub.tt = aa.id
WHERE aa.key = ‘cuisine’
GROUP BY aa.value
ORDER BY num DESC
LIMIT 10;
Cuisine | num |
---|---|
pizza | 21 |
mexican | 14 |
chinese | 13 |
american | 12 |
japanese | 10 |
thai | 10 |
sushi | 9 |
italian | 8 |
burger | 6 |
indian | 5 |
Everyone loves pizzas, and of course mexico is right next door!
SELECT nodes.id, nodes_tags.key, nodes_tags.value, nodes.version FROM nodes LEFT JOIN nodes_tags
ON nodes.id = nodes_tags.id
ORDER BY version DESC
LIMIT 10;
nodes.id|nodes_tags.key|nodes_tags.value|nodes.version
—|—|—|—
60632894|||17
60533001|||16
60632893|||16
145588850|||16
288449058|||16
310978181|||16
288449057|||15
319438613|highway|stop|14
49751676|highway|motorway_junction|13
49751676|ref|21|13
The most modified nodes were edited 17 times. 2 nodes here are highway stops/junctions. The top 7 don’t have any tags associated. So another JOIN is performed to see if they’re nodes along ways.
SELECT nodes.version, nodes.id, ways_nodes.position, ways_nodes.id, ways_tags.value
FROM nodes LEFT JOIN ways_nodes ON nodes.id = ways_nodes.node_id JOIN ways_tags ON ways_nodes.id = ways_tags.id
WHERE ways_tags.key = ‘name’
ORDER BY version DESC
LIMIT 10;
nodes.version | nodes.id | ways_nodes.position | ways_nodes.id | ways_tags.value |
---|---|---|---|---|
17 | 60632894 | 20 | 313201101 | “San Diego Northern Railway” |
16 | 60533001 | 0 | 102698586 | Green |
16 | 60533001 | 27 | 159338263 | Green |
16 | 60632893 | 17 | 313201101 | “San Diego Northern Railway” |
16 | 288449058 | 24 | 313201101 | “San Diego Northern Railway” |
16 | 310978181 | 11 | 358587259 | “San Diego Northern Railway” |
15 | 288449057 | 27 | 313201101 | “San Diego Northern Railway” |
14 | 319438613 | 0 | 29043282 | “Justice Lane” |
14 | 319438613 | 6 | 331955334 | “Voigt Drive” |
13 | 49751676 | 2 | 25863005 | “San Diego Freeway” |
Since one nodes may have multiple tags and appear on multiple routes, so duplicate id’s still make sense. These nodes are revised most often probably because of the on-going downtown infrastructure construction, leading to detours frequently. The “Green” actually means the Green Line Trolley in downtown San Diego area, which connects several sightseeing spots, highly recommend!
Noticeably there’re a bunch of “FIXME” and “fixme” keys. First I update all the “fixme” to be “FIXME” by
UPDATE nodes_tags SET key = ‘FIXME’ WHERE key = ‘fixme’;
UPDATE ways_tags SET key = ‘FIXME’ WHERE key = ‘fixme’
I should have done that before the data transportaion. But this happens in real life that some flaws don’t show up until it’s accidentally encountered. So I decided keep going.
Then,
SELECT * FROM nodes_tags WHERE key = ‘FIXME’;
Some Results are as follows,
1423657933,FIXME,”is this a playground?”
1423658080,FIXME,”looks like this is a skatepark. confirm”
1474749189,FIXME,”find name”
1637062320,FIXME,”don’t know exact name ofthis place”
2120326491,FIXME,”tiger had this road here but no idea where it really is”
2129886368,FIXME,”where is this road? tiger not right based on bing maps”
These entries are spots that need verification from locals or some research from users. I’m thinking of two ways to solve this problem.
As mentioned at the beginning, some addresses are labeled as “street”, but actually contain the full addresses and even the crossing road. While another label “full” flags several full addresses. Whether the “full” label is necessary or not is unconfirmed, communications with OpenstreetMap may be required to make such decisions.
The data is not complete, even with a lot contributors. Some cleaning need further confirmation, although it doesn’t affect the queries performed in this exercise. We can still extract plentiful information from the existing data.
https://docs.python.org/2/library/re.html
https://docs.python.org/2/library/xml.etree.elementtree.html
https://wiki.openstreetmap.org/wiki/Key:amenity#Sustenance
https://gist.github.com/carlward/54ec1c91b62a5f911c42#file-sample_project-md
https://www.w3schools.com/sql/default.asp
and the Udacity Data Wrangling course materials.