ad04145791b59754c6a78f4dac94efdf92ede89d galt Wed Jun 19 15:29:48 2019 -0700 Updated geoIp mapping for our 3 official mirrors - RR, euro, asia with much better quality data from MaxMind. diff --git src/hg/geoIp/README src/hg/geoIp/README deleted file mode 100644 index 7bc30b8..0000000 --- src/hg/geoIp/README +++ /dev/null @@ -1,209 +0,0 @@ -NEW genome-asia! - -Our previous update was from 2011 when genome-euro was first made, -so we are going to update everything with fresh data. - -RESTARTING update from ground zero, -all tables backed up and cleared out on hgFixed - -wget 'http://dev.maxmind.com/static/csv/codes/iso3166.csv' -O country.csv - -To convert csv to tab-separated (removing quotes): -cat country.csv | gawk -F "," '{print $1 "\t" $2}' | sed 's/"//g' > country.tab - - -wget 'http://dev.maxmind.com/static/csv/codes/country_continent.csv' -O countryToContinent.csv -To convert csv to tab-separated: -cat countryToContinent.csv | gawk -F "," '{print $1 "\t" $2}' > countryToContinent.tab - -https://gist.github.com/nobuti/3816985 -Manually made continent.csv -It seems odd that nobody has an easy to download cvs or tab-delimited continents list. -To convert csv to tab-separated: -cat continent.csv | gawk -F "," '{print $1 "\t" $2}' > continent.tab - -I have my data, so now I load the tables: - -This is a little script that runs autoSql on each .as to get the .sql files -However these .sql files are not checked in to git since they are -so easily generated. - -./make-sql - -took the old gbNode.tab with US and euro, -and added the japan record: -vi gbNode.tab -| 3 | genome-asia.ucsc.edu | JP | Asian node of UCSC Genome Browser at the RIKEN institute of Japan | Asian Server | -changed "\" to tab char. - -made a script to create and load the .as tables -some of them actually do not have any data yet, -since the data is created by joins of other tables - -./load-tables - -note that because the script is so simple -and some of the tables do not have .tab files, -they will give errors. We will fill the empty -tables later by inserting from a join of other tables. - -gbNode -geoIpNode -geoIpCountry -country -continent -countryToContinent - -Check the integrity of the continent/country tables: - -forgot to strip off the 1st row of the csv file. -Cleanup left-over header row. -delete from countryToContinent where continentId = '"continent code"'; - -DONE maybe I need to delete A1 and A2 which are weird special categories -that are not real countries and do not have an assigned continent which is given the value "--". -| A1 | Anonymous Proxy | -| A2 | Satellite Provider | -These turn out not to be a problem. - - -mysql> select * from countryToContinent where countryId not in (select id from country); -+-----------+-------------+ -| countryId | continentId | -+-----------+-------------+ -| AN | NA | -| FX | EU | -+-----------+-------------+ - -AN Netherlands Antilles - -insert into country values ('AN', 'Netherlands Antilles'); - -insert into country values ('FX', 'France, Metropolitan'); - -mysql> select * from countryToContinent where continentId not in (select id from continent); -+-----------+-------------+ -| countryId | continentId | -+-----------+-------------+ -| A1 | -- | -| A2 | -- | -| O1 | -- | -+-----------+-------------+ - -insert into continent values ('--','not applicable'); - -mysql> select count(*) from geoIpCountry where countryId not in (select id from country); -+----------+ -| count(*) | -+----------+ -| 0 | -+----------+ - --- - -start off with crude continent-level assignments: - -make everything 1 by default - -insert into continentToNode select id, 1 from continent; - -Europe -update continentToNode set node=2 where continentId = "EU"; - -Asia -update continentToNode set node=3 where continentId = "AS"; - -Now we just can make a join - -insert into countryToNode (select countryId, node from countryToContinent cc, continentToNode cn where cc.continentId = cn.continentId); - -# fix ones requested by Bob to be included in euro instead of defaulting to america. -# these are mostly around the middle east. -update countryToNode set node=2 where countryId in ('AE','AM','BH','CY','GE','IL','IQ','JO','KW','LB','OM','PS','QA','SA','SY','YE'); - -I compared the before and after updating with public_html/geoMap/test.csh at - http://hgwdev.soe.ucsc.edu/~galt/geoMap/countryToNodeMap.html -And it looked identical. So we did not lose anything. - -mysql> select distinct countryId from geoIpCountry where countryId not in (select distinct countryId from countryToNode); -+-----------+ -| countryId | -+-----------+ -| SS | -| CW | -| SX | -| BQ | -+-----------+ - -mysql> select * from country where id in ('SS','CW','SX','BQ'); -+----+--------------+ -| id | name | -+----+--------------+ -| BQ | Bonaire | -| CW | Curacao | -| SS | South Sudan | -| SX | Sint Maarten | -+----+--------------+ - - -So, what happened to these countries, and why are they not in countryToNode? -Shall we add them? - -insert into countryToContinent values ('SS', 'AF'); -insert into countryToContinent values ('SX', 'NA'); -insert into countryToContinent values ('CW', 'NA'); -insert into countryToContinent values ('BQ', 'NA'); - -insert into countryToNode values ('SS', '1'); -insert into countryToNode values ('SX', '1'); -insert into countryToNode values ('CW', '1'); -insert into countryToNode values ('BQ', '1'); - - -mysql> select distinct countryId from geoIpCountry where countryId not in (select distinct countryId from countryToNode); -Empty set (0.13 sec) - -Re-ran test.csh and checked again: -http://hgwdev.soe.ucsc.edu/~galt/geoMap/countryToNodeMap.html - -This means we should have full integrity. - -insert into geoIpNode (select ipStart, ipEnd, node from geoIpCountry gic, countryToNode ctn where gic.countryId=ctn.countryId); - -Query OK, 164390 rows affected (1.61 sec) -Records: 164390 Duplicates: 0 Warnings: 0 - -NOW IT IS READY! - -Note that in the future we may be putting other contries with other servers, -like we did with the middle-east countries, overriding their default continent mappings -in countryToNode. - - -Not sure yet if I am going to do this again. -It was what I did last time. -The idea was to keep a backup of these important tables. -That way perhaps hgFixed could be used when generating -other sets of testing tables for redirect. - -TODO at the end: -rename tables to real: -gbNodeReal -geoIpNodeReal - - -Make an extra backup of just the git untracked files just in case: - -mkdir /hive/data/outside/geoIp/ -gitu | xargs -I X cp X /hive/data/outside/geoIp/ - - -Make a backup of the data too: - -mkdir /hive/data/outside/geoIp/geoIpTableDumps -mkdir /hive/data/outside/geoIp/geoIpTableDumps/2016-06-23 -# sadly we need to temporarily make it write-able by all so mysqld can write dump there. -chmod 777 /hive/data/outside/geoIp/geoIpTableDumps/2016-06-23 -hgsqldump hgFixed -T 2016-06-23 continent continentToNode country countryToContinent countryToNode gbNode geoIpCountry geoIpNode -chmod 775 /hive/data/outside/geoIp/geoIpTableDumps/2016-06-23 -