src/hg/makeDb/trackDb/t2g.html 1.2
1.2 2010/05/21 23:18:25 hiram
initial text description for text2Genome
Index: src/hg/makeDb/trackDb/t2g.html
===================================================================
RCS file: /projects/compbio/cvsroot/kent/src/hg/makeDb/trackDb/t2g.html,v
retrieving revision 1.1
retrieving revision 1.2
diff -b -B -U 4 -r1.1 -r1.2
--- src/hg/makeDb/trackDb/t2g.html 20 May 2010 18:43:21 -0000 1.1
+++ src/hg/makeDb/trackDb/t2g.html 21 May 2010 23:18:25 -0000 1.2
@@ -1,8 +1,61 @@
<H2>Description</H2>
<P>
-This track indicates the location of sequences discovered in publications
-mapped back to the genome.
-</P>
+This track indicates the location of sequences in publications
+mapped back to the genome. It is based on data from >22.000 articles with DNA
+sequences from the <A HREF="http://www.pubmedcentral.com/"
+TARGET=_blank>Pubmed Central</A> <A HREF="http://www.ncbi.nlm.nih.gov/pmc/about/openftlist.html"
+TARGET=_blank>Open-Access archive</A>, which consists of ~130.000 free
+research articles (Feb 2010)</P>
+
<H2>Methods</H2>
+<P>
+Articles were downloaded from PubMed Central. Depending on availability, XML,
+raw ASCII or text extracted from PDFs was used. The results were then
+processed by the <A HREF="http://sourceforge.net/projects/text2genome/"
+TARGET=_blank>text2genome.org</A> software. It searches for stretches of
+separated nucleotide-like letters that are longer than 19bp or words that
+contain more than 40% nucleotide-like letters. The DNA resulting sequences
+were mapped with BLAST to all genomes that are part of
+Ensembl/EnsemblGenomes version 56 and filtered with the text2genome pipeline:
+
+<UL>
+<LI>Hits to NCBI Univec are completely removed</LI>
+<LI>Only matches on the most plausible genome are kept. This is the
+genome with the most matching sequences which either is mentioned in the
+text and recognized by <A HREF="http://www.sf.net/projects/linnaeus/"
+TARGET=_blank>LINNAEUS</A> or a well-known model organism.</LI>
+<LI>Hits from the same paper that are closer than 50kbp are
+chained (shown as exon-blocks on the browser)</LI>
+<LI>Non-unique hits are only kept in the chain with the most members</LI>
+</UL>
+
+</P>
+
<H2>Credits</H2>
+<P>
+Data was processed by Maximilian Haussler, Martin Gerner and Casey Bergman.
+Import into UCSC by Hiram Clawson. For questions or feedback on this data
+track, please send an email to
+<A HREF="mailto:text2genome@manchester.
+ac.
+uk">
+text2genome@manchester.
+ac.
+uk</A>.
+</P>
+
<H2>References</H2>
+
+<P>
+Haeussler M, Bergman CM. Annotating genes and genomes with sequences
+extracted from biomedical articles, <em>in prep.</em>, see also
+<A HREF="http://text2genome.org/" TARGET=_blank>www.text2genome.org</A>
+</P>
+
+<P>
+Aerts S, Haeussler M, van Vooren S, Griffith OL, Hulpiau P, Jones SJM,
+Montgomery SB, Bergman CM, The Open Regulatory Annotation Consortium.
+<A HREF="http://www.ncbi.nlm.nih.gov/pubmed/18271954"
+TARGET=_blank>Text-mining assisted regulatory annotation</A>.
+Genome Biol. 2008;9(2):R31.
+</P>