30 January 2014

Mapping the UCSC/Web-Sequences to a world map.

People at the UCSC have recently released a new track for the GenomeBrowser


"We're pleased to announce the release of the Web Sequences track on the UCSC Genome Browser. This track, produced in collaboration with Microsoft Research, contains the results of a 30-day scan for DNA sequences from over 40 billion different webpages. The sequences were then mapped with Blat to the human genome (...) The data were extracted from a variety of sources including patents, online textbooks, help forums, and any other webpages that contain DNA sequence. In essence, this track displays the Blat alignments of nearly every DNA sequence on the internet!"

I've mapped each genomic location from this track to a country and generated the following (unreadable) picture:

How this picture was generated

  • I've downloaded the data from the UCSC using the Table browser. The data look like this:
    #bin    chrom   chromStart      chromEnd        name    score   strand  thickStart      thickEnd        reserved        blockCount      blockSizes      chromStarts     tSeqTypes     seqIds  seqRanges       publisher       pmid    doi     issn    journal title   firstAuthor     year    impact  classes locus
    585     chr1    14789   15004   3500336380      75              14789   15004   8421504 2       40,35   0,180   g       350033638000000000      0-75                          Tophat, Cufflinks and replicates - Page 2 - SEQanswers  seqanswers.com  0       0               WASH2P,WASH7P
    585     chr1    15017   15590   3500327042      381             15017   15590   8421504 2       326,55  0,518   g       350032704200000008      0-747                        Research Technologies at Indiana University      biomedapp.iu.edu        0       0               WASH7P
    585     chr1    68858   68895   3500020489      37              68858   68895   8421504 1       37      0       g       350002048900000000,350002048900000001   0-36,0-36    Genome mapability - Musings from a PhD candidate davetang.org    0       0               OR4F5
    585     chr1    69170   69479   3500359797      142             69170   69479   8421504 2       76,66   0,243   c       350035979700000000,350035979700000002   0-76,10-76    CRAM compression and TLEN SAM's field - SEQanswers      seqanswers.com  0       0               OR4F5
    585     chr1    70013   70230   3500427570      150             70013   70230   8421504 2       75,75   0,142   g       350042757000000000,350042757000000001   0-75,0-75     Inconsistency with SAM flag output? - SEQanswers        seqanswers.com  0       0               OR4F5
    585     chr1    98860   98888   3500207083      26              98860   98888   8421504 3       5,7,14  0,6,14  g       350020708300000108,350020708300000060,350020708300000239      0-24,0-21,0-21                                          Method For The Simultaneous Determination Of Blood Group And Platelet Antigen Genotypes         .freshpatents.com     0       0               OR4F5
    586     chr1    137603  138008  3500170315      405             137603  138008  8421504 1       405     0       p       350017031500015076,350017031500015074   0-135,0-270  Balding D. (2007) Handbook of Statistical Genetics       www.scribd.com  0       0               OR4F5
    586     chr1    139485  143008  3500419332      1794            139485  143008  8421504 2       65,1729 0,1794  g       350041933200000004,350041933200000000,350041933200000001,350041933200000002,350041933200000003        0-1263,0-1859,0-1852,0-1860,0-576                                               PPT  Evolution by Genome Duplication PowerPoint presentation | free to view   www.powershow.com       0       0               OR4F5
    586     chr1    141535  143008  3500270480      1372            141535  143008  8421504 24      57,60,58,59,61,59,60,59,59,62,61,58,60,58,16,59,59,59,59,57,57,59,58,58 0,61,125,187,250,314,377,441,503,566,631,695,756,819,881,919,981,1044,1107,1170,1230,1291,1353,1415   g       350027048000000003,350027048000000002   0-902,0-525                  Chen-Kung Chou 3-22-2004 www.dls.ym.edu.tw       0       0               OR4F5
  • I want to generate BED file: 'chrom/start/end/country'. The 23rd column contains the URL of the web-sequence. I use the domain of the URL to try to guess the country. The following awk script was used to generate the file:
    BEGIN   {
            FS="[\t]";
            }
    
            {
            country=$23;
            for(;;)
                    {
                    slash=index(country,"/");
                    if(slash==0) break;
                    country=substr(country,1,slash-1);
                    }
            for(;;)
                    {
                    colon=index(country,":");
                    if(colon==0) break;
                    country=substr(country,1,colon-1);
                    }
            if( country ~ /\.$/ ) next;
            if( country ~ /\.com$/ ) next;
            if( country ~ /\.org$/ ) next;
            if( country ~ /\.cat$/ ) next;
            if( country ~ /\.net$/ ) next;
            if( country ~ /\.gov$/ ) next;
            if( country ~ /\.edu$/ ) next;
            if( country ~ /\.name$/ ) next;
            if( country ~ /\.info$/ ) next;
            if( country ~ /\.biz$/ ) next;
            if( country ~ /\.[0-9]+$/ ) next;
            if( index(country,".")==0) next;
            if( index(country," ")!=0) next;
            for(;;)
                    {
                    dot=index(country,".");
                    if(dot==0) break;
                    country=substr(country,dot+1);
                    }
    
                    if( country== "af") {country="afghanistan";}
                    else if( country== "ax") {country="Ålandislands";}
                    else if( country== "al") {country="albania";}
                    else if( country== "dz") {country="algeria";}
                    else if( country== "as") {country="americansamoa";}
                    else if( country== "ad") {country="andorra";}
                    else if( country== "ao") {country="angola";}
                    else if( country== "ai") {country="anguilla";}
                    else if( country== "aq") {country="antarctica";}
                    else if( country== "ag") {country="antiguaandbarbuda";}
                    else if( country== "ar") {country="argentina";}
                    else if( country== "am") {country="armenia";}
                    else if( country== "aw") {country="aruba";}
                    else if( country== "au") {country="australia";}
                    else if( country== "at") {country="austria";}
                    else if( country== "az") {country="azerbaijan";}
                    else if( country== "bs") {country="bahamas";}
                    else if( country== "bh") {country="bahrain";}
                    else if( country== "bd") {country="bangladesh";}
                    else if( country== "bb") {country="barbados";}
                    else if( country== "by") {country="belarus";}
                    else if( country== "be") {country="belgium";}
                    else if( country== "bz") {country="belize";}
                    else if( country== "bj") {country="benin";}
                    else if( country== "bm") {country="bermuda";}
                    else if( country== "bt") {country="bhutan";}
                    else if( country== "bo") {country="bolivia,plurinationalstateof";}
                    else if( country== "bq") {country="bonaire,sinteustatiusandsaba";}
                    else if( country== "ba") {country="bosniaandherzegovina";}
                    else if( country== "bw") {country="botswana";}
                    else if( country== "bv") {country="bouvetisland";}
                    else if( country== "br") {country="brazil";}
                    else if( country== "io") {country="britishindianoceanterritory";}
                    else if( country== "bn") {country="bruneidarussalam";}
                    else if( country== "bg") {country="bulgaria";}
                    else if( country== "bf") {country="burkinafaso";}
                    else if( country== "bi") {country="burundi";}
                    else if( country== "kh") {country="cambodia";}
                    else if( country== "cm") {country="cameroon";}
                    else if( country== "ca") {country="canada";}
                    else if( country== "cv") {country="capeverde";}
                    else if( country== "ky") {country="caymanislands";}
                    else if( country== "cf") {country="centralafricanrepublic";}
                    else if( country== "td") {country="chad";}
                    else if( country== "cl") {country="chile";}
                    else if( country== "cn") {country="china";}
                    else if( country== "cx") {country="christmasisland";}
                    else if( country== "cc") {country="cocos(keeling)islands";}
                    else if( country== "co") {country="colombia";}
                    else if( country== "km") {country="comoros";}
                    else if( country== "cg") {country="congo";}
                    else if( country== "cd") {country="congo,thedemocraticrepublicofthe";}
                    else if( country== "ck") {country="cookislands";}
                    else if( country== "cr") {country="costarica";}
                    else if( country== "ci") {country="cÔted'ivoire";}
                    else if( country== "hr") {country="croatia";}
                    else if( country== "cu") {country="cuba";}
                    else if( country== "cw") {country="curaÇao";}
                    else if( country== "cy") {country="cyprus";}
                    else if( country== "cz") {country="czechrepublic";}
                    else if( country== "dk") {country="denmark";}
                    else if( country== "dj") {country="djibouti";}
                    else if( country== "dm") {country="dominica";}
                    else if( country== "do") {country="dominicanrepublic";}
                    else if( country== "ec") {country="ecuador";}
                    else if( country== "eg") {country="egypt";}
                    else if( country== "sv") {country="elsalvador";}
                    else if( country== "gq") {country="equatorialguinea";}
                    else if( country== "er") {country="eritrea";}
                    else if( country== "ee") {country="estonia";}
                    else if( country== "et") {country="ethiopia";}
                    else if( country== "fk") {country="falklandislands(malvinas)";}
                    else if( country== "fo") {country="faroeislands";}
                    else if( country== "fj") {country="fiji";}
                    else if( country== "fi") {country="finland";}
                    else if( country== "fr") {country="france";}
                    else if( country== "gf") {country="frenchguiana";}
                    else if( country== "pf") {country="frenchpolynesia";}
                    else if( country== "tf") {country="frenchsouthernterritories";}
                    else if( country== "ga") {country="gabon";}
                    else if( country== "gm") {country="gambia";}
                    else if( country== "ge") {country="georgia";}
                    else if( country== "de") {country="germany";}
                    else if( country== "gh") {country="ghana";}
                    else if( country== "gi") {country="gibraltar";}
                    else if( country== "gr") {country="greece";}
                    else if( country== "gl") {country="greenland";}
                    else if( country== "gd") {country="grenada";}
                    else if( country== "gp") {country="guadeloupe";}
                    else if( country== "gu") {country="guam";}
                    else if( country== "gt") {country="guatemala";}
                    else if( country== "gg") {country="guernsey";}
                    else if( country== "gn") {country="guinea";}
                    else if( country== "gw") {country="guinea-bissau";}
                    else if( country== "gy") {country="guyana";}
                    else if( country== "ht") {country="haiti";}
                    else if( country== "hm") {country="heardislandandmcdonaldislands";}
                    else if( country== "va") {country="holysee(vaticancitystate)";}
                    else if( country== "hn") {country="honduras";}
                    else if( country== "hk") {country="china";}
                    else if( country== "hu") {country="hungary";}
                    else if( country== "is") {country="iceland";}
                    else if( country== "in") {country="india";}
                    else if( country== "id") {country="indonesia";}
                    else if( country== "ir") {country="iran";}
                    else if( country== "iq") {country="iraq";}
                    else if( country== "ie") {country="ireland";}
                    else if( country== "im") {country="isleofman";}
                    else if( country== "il") {country="israel";}
                    else if( country== "it") {country="italy";}
                    else if( country== "jm") {country="jamaica";}
                    else if( country== "jp") {country="japan";}
                    else if( country== "je") {country="jersey";}
                    else if( country== "jo") {country="jordan";}
                    else if( country== "kz") {country="kazakhstan";}
                    else if( country== "ke") {country="kenya";}
                    else if( country== "ki") {country="kiribati";}
                    else if( country== "kp") {country="northkorea";}
                    else if( country== "kr") {country="southkorea";}
                    else if( country== "kw") {country="kuwait";}
                    else if( country== "kg") {country="kyrgyzstan";}
                    else if( country== "la") {country="laopeople'sdemocraticrepublic";}
                    else if( country== "lv") {country="latvia";}
                    else if( country== "lb") {country="lebanon";}
                    else if( country== "ls") {country="lesotho";}
                    else if( country== "lr") {country="liberia";}
                    else if( country== "ly") {country="libya";}
                    else if( country== "li") {country="liechtenstein";}
                    else if( country== "lt") {country="lithuania";}
                    else if( country== "lu") {country="luxembourg";}
                    else if( country== "mo") {country="macao";}
                    else if( country== "mk") {country="macedonia,theformeryugoslavrepublicof";}
                    else if( country== "mg") {country="madagascar";}
                    else if( country== "mw") {country="malawi";}
                    else if( country== "my") {country="malaysia";}
                    else if( country== "mv") {country="maldives";}
                    else if( country== "ml") {country="mali";}
                    else if( country== "mt") {country="malta";}
                    else if( country== "mh") {country="marshallislands";}
                    else if( country== "mq") {country="martinique";}
                    else if( country== "mr") {country="mauritania";}
                    else if( country== "mu") {country="mauritius";}
                    else if( country== "yt") {country="mayotte";}
                    else if( country== "mx") {country="mexico";}
                    else if( country== "fm") {country="micronesia,federatedstatesof";}
                    else if( country== "md") {country="moldova,republicof";}
                    else if( country== "mc") {country="monaco";}
                    else if( country== "mn") {country="mongolia";}
                    else if( country== "me") {country="montenegro";}
                    else if( country== "ms") {country="montserrat";}
                    else if( country== "ma") {country="morocco";}
                    else if( country== "mz") {country="mozambique";}
                    else if( country== "mm") {country="myanmar";}
                    else if( country== "na") {country="namibia";}
                    else if( country== "nr") {country="nauru";}
                    else if( country== "np") {country="nepal";}
                    else if( country== "nl") {country="netherlands";}
                    else if( country== "nc") {country="newcaledonia";}
                    else if( country== "nz") {country="newzealand";}
                    else if( country== "ni") {country="nicaragua";}
                    else if( country== "ne") {country="niger";}
                    else if( country== "ng") {country="nigeria";}
                    else if( country== "nu") {country="niue";}
                    else if( country== "nf") {country="norfolkisland";}
                    else if( country== "mp") {country="northernmarianaislands";}
                    else if( country== "no") {country="norway";}
                    else if( country== "om") {country="oman";}
                    else if( country== "pk") {country="pakistan";}
                    else if( country== "pw") {country="palau";}
                    else if( country== "ps") {country="palestine,stateof";}
                    else if( country== "pa") {country="panama";}
                    else if( country== "pg") {country="papuanewguinea";}
                    else if( country== "py") {country="paraguay";}
                    else if( country== "pe") {country="peru";}
                    else if( country== "ph") {country="philippines";}
                    else if( country== "pn") {country="pitcairn";}
                    else if( country== "pl") {country="poland";}
                    else if( country== "pt") {country="portugal";}
                    else if( country== "pr") {country="puertorico";}
                    else if( country== "qa") {country="qatar";}
                    else if( country== "re") {country="france";}
                    else if( country== "ro") {country="romania";}
                    else if( country== "ru") {country="russia";}
                    else if( country== "rw") {country="rwanda";}
                    else if( country== "bl") {country="saintbarthÉlemy";}
                    else if( country== "sh") {country="sainthelena,ascensionandtristandacunha";}
                    else if( country== "kn") {country="saintkittsandnevis";}
                    else if( country== "lc") {country="saintlucia";}
                    else if( country== "mf") {country="saintmartin(frenchpart)";}
                    else if( country== "pm") {country="saintpierreandmiquelon";}
                    else if( country== "vc") {country="saintvincentandthegrenadines";}
                    else if( country== "ws") {country="samoa";}
                    else if( country== "sm") {country="sanmarino";}
                    else if( country== "st") {country="saotomeandprincipe";}
                    else if( country== "sa") {country="saudiarabia";}
                    else if( country== "sn") {country="senegal";}
                    else if( country== "rs") {country="serbia";}
                    else if( country== "sc") {country="seychelles";}
                    else if( country== "sl") {country="sierraleone";}
                    else if( country== "sg") {country="singapore";}
                    else if( country== "sx") {country="sintmaarten(dutchpart)";}
                    else if( country== "sk") {country="slovakia";}
                    else if( country== "si") {country="slovenia";}
                    else if( country== "sb") {country="solomonislands";}
                    else if( country== "so") {country="somalia";}
                    else if( country== "za") {country="southafrica";}
                    else if( country== "gs") {country="southgeorgiaandthesouthsandwichislands";}
                    else if( country== "ss") {country="southsudan";}
                    else if( country== "es") {country="spain";}
                    else if( country== "lk") {country="srilanka";}
                    else if( country== "sd") {country="sudan";}
                    else if( country== "sr") {country="suriname";}
                    else if( country== "sj") {country="svalbardandjanmayen";}
                    else if( country== "sz") {country="swaziland";}
                    else if( country== "se") {country="sweden";}
                    else if( country== "ch") {country="switzerland";}
                    else if( country== "sy") {country="syrianarabrepublic";}
                    else if( country== "tw") {country="taiwan";}
                    else if( country== "tj") {country="tajikistan";}
                    else if( country== "tz") {country="tanzania";}
                    else if( country== "th") {country="thailand";}
                    else if( country== "tl") {country="timor-leste";}
                    else if( country== "tg") {country="togo";}
                    else if( country== "tk") {country="tokelau";}
                    else if( country== "to") {country="tonga";}
                    else if( country== "tt") {country="trinidadandtobago";}
                    else if( country== "tn") {country="tunisia";}
                    else if( country== "tr") {country="turkey";}
                    else if( country== "tm") {country="turkmenistan";}
                    else if( country== "tc") {country="turksandcaicosislands";}
                    else if( country== "tv") {country="tuvalu";}
                    else if( country== "ug") {country="uganda";}
                    else if( country== "ua") {country="ukraine";}
                    else if( country== "ae") {country="unitedarabemirates";}
                    else if( country== "gb") {country="unitedkingdom";}
                    else if( country== "uk") {country="unitedkingdom";}
                    else if( country== "us") {country="USA";}
                    else if( country== "um") {country="unitedstatesminoroutlyingislands";}
                    else if( country== "uy") {country="uruguay";}
                    else if( country== "uz") {country="uzbekistan";}
                    else if( country== "vu") {country="vanuatu";}
                    else if( country== "ve") {country="venezuela";}
                    else if( country== "vn") {country="vietnam";}
                    else if( country== "vg") {country="virginislands,british";}
                    else if( country== "vi") {country="virginislands,u.s.";}
                    else if( country== "wf") {country="wallisandfutuna";}
                    else if( country== "eh") {country="westernsahara";}
                    else if( country== "ye") {country="yemen";}
                    else if( country== "zm") {country="zambia";}
                    else if( country== "zw") {country="zimbabwe";}
                    else { next;}
    
            printf("%s\t%s\t%s\t%s\n",$2,$3,$4,country);
            }
    
  • For the world map, I've used a SVG-vectorial map from wikipedia: https://commons.wikimedia.org/wiki/File:World_V2.0.svg.

    The coordinates of the boundaries of each country are defined in a SVG 'path' element:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN" "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/
    <svg xmlns="http://www.w3.org/2000/svg" width="8.88889in" height="4.44444in" viewBox="0 0 800 400">
      <path id="Taiwan" fill="none" stroke="black" stroke-width="1" d="M 668.85,151.22  C 668.98,150.71 ...
      <path id="Estonia" fill="none" stroke="black" stroke-width="1" d="M 460.75,68.26  C 459.95,68.11 4 ...
      <path id="Latvia" fill="none" stroke="black" stroke-width="1" d="M 461.23,72.27  C 460.75,72.20 46 ...
      <path id="Lithuania" fill="none" stroke="black" stroke-width="1" d="M 452.39,79.42  C 452.67,79.72 ...
      <path id="Byelarus" fill="none" stroke="black" stroke-width="1" d="M 453.57,81.92  C 453.87,82.37  ...
      <path id="Ukraine" fill="none" stroke="black" stroke-width="1" d="M 453.09,85.95  C 453.43,86.30 4 ...
      <path id="Moldova" fill="none" stroke="black" stroke-width="1" d="M 460.57,93.00  C 461.00,93.70 4 ...
      <path id="Syria" fill="none" stroke="black" stroke-width="1" d="M 480.33,127.61  C 481.03,127.90 4 ...
      <path id="Turkey" fill="none" stroke="black" stroke-width="1" d="M 499.47,116.91  C 499.31,116.32  ...
      <path id="Kuwait" fill="none" stroke="black" stroke-width="1" d="M 505.26,133.84  C 504.97,134.56  ...
      <path id="Saudi Arabia" fill="none" stroke="black" stroke-width="1" d="M 495.83,163.75  C 496.31,1 ...
      <path id="United Arab Emirates" fill="none" stroke="black" stroke-width="1" d="M 516.03,150.12  C  ...
      <path id="Yemen" fill="none" stroke="black" stroke-width="1" d="M 517.68,161.79  C 517.01,160.50 5 ...
      <path id="Slovenia" fill="none" stroke="black" stroke-width="1" d="M 430.55,97.07  C 429.62,97.36  ...
      <path id="Croatia" fill="none" stroke="black" stroke-width="1" d="M 439.09,103.97  C 439.01,103.46 ...
      <path id="Bosnia and Herzegovina" fill="none" stroke="black" stroke-width="1" d="M 440.44,105.02   ...
    (...)
  • I've joined the data using a custom java program (available on github at: https://github.com/lindenb/jvarkit/wiki/WorldMapGenome ). The program transforms the 'path' elements to a GeneralPath
    $  cat map.bed |\
         java -jar dist/worldmapgenome.jar \
         -u World_V2.0.svg \
         -w 2000 -o ~/ouput.jpg \
         -R hg19.fasta
That's it,
Pierre

No comments: