6b1648806fc1de0da4b16bb23f51e0fd52065ef7 gperez2 Wed Apr 13 13:12:33 2022 -0700 making ReMap hub into native track, refs #28960 diff --git src/hg/makeDb/trackDb/human/reMap2022.html src/hg/makeDb/trackDb/human/reMap2022.html new file mode 100644 index 0000000..8f2e448 --- /dev/null +++ src/hg/makeDb/trackDb/human/reMap2022.html @@ -0,0 +1,190 @@ +<h2>Description</h2> +<p> +This track represents the ReMap Atlas of regulatory regions which consists of a +large scale integrative analysis of all Public ChIP-seq data for transcriptional +regulators from GEO, ArrayExpress and ENCODE. +</p> + +<p> +Below is a schematic diagram of the types of regulatory regions: +<ul> +<li>ReMap 2022 Atlas (all peaks for each analyzed datasets)</li> +<li>ReMap 2022 Non redundant peaks (merged similar target)</li> +<li>ReMap 2022 Cis Regulatory Modules</li> +</ul> +</p> + +<img src="http://remap.univ-amu.fr/storage/public/hubReMap2022/img/schema_datatype_remap.png" +alt="Schematic diagram data types" style="display: block; margin-left: left; +margin-right: auto; max-width:800px"> +<br> + +<h2> Display Conventions and Configuration </h2> +<ul> +<li> +Each transcription factor follows a specific RGB color. +</li> +<li> +ChIP-seq peak summits are represented by vertical bars. +</li> +<li> +Hsap : A data set is defined as a ChIP/Exo-seq experiment in a given +GEO/ArrayExpress/ENCODE series (e.g. GSE41561), for a given TF (e.g.: ESR1), in +a particular biological condition (e.g. MCF-7). +<br>Data sets are labelled with the concatenation of these three pieces of +information (e.g. GSE41561.ESR1.MCF-7). +</li> +<li> +Atha : The "dataset" is defined as a ChIP-seq experiment in a given series +(e.g. GSE94486), for a given target (e.g. ARR1), in a particular biological +condition (i.e. ecotype, tissue type, experimental conditions ; e.g. +Col-0_seedling_3d-6BA-4h). +<br>Data sets are labelled with the concatenation of these three pieces of +information (e.g. GSE94486.ARR1.Col-0_seedling_3d-6BA-4h). +</li> +</ul> + +<h2>Methods</h2> +<p> +This 4th release of ReMap (2022) present the analysis of a total of 8,103 +quality controlled ChIP-seq (n=7,895) and ChIP-exo (n=208) datasets from public +sources (GEO, ArrayExpress, ENCODE). The ChIP-seq/exo datasets have been mapped +to the GRCh38/hg38 human assembly. The "dataset" is defined as a ChIP-seq +experiment in a given series (e.g. GSE46237), for a given TF (e.g. NR2C2), in a +particular biological condition (i.e. cell line, tissue type, disease state or +experimental conditions ; e.g. HELA). Datasets were labeled by concatenating +these three pieces of information such as GSE46237.NR2C2.HELA. +<p>Those merged analyses cover a total of 1,211 DNA-binding protein +(transcriptional regulators) such as a variety of transcription factors (TFs), +transcription co-activators (TCFs) and chromatin-remodeling factors (CRFs) for +182 million peaks. +</p> + +<img src="http://remap.univ-amu.fr/storage/public/hubReMap2022/img/fig2_hsap_2022-20.png" +alt="Schematic diagram" style="display: block; margin-left: left; margin-right: +auto; max-width:800px"> +<br> + +<h4>GEO & ArrayExpress</h4> +Public ChIP-seq data sets were extracted from Gene Expression Omnibus (GEO) and +ArrayExpress (AE) databases. For GEO, the query '('chip seq' OR 'chipseq' OR +'chip sequencing') AND 'Genome binding/occupancy profiling by high throughput +sequencing' AND 'homo sapiens'[organism] AND NOT 'ENCODE'[project]' was used to +return a list of all potential data sets to analyze, which were then manually +assessed for further analyses. Data sets involving polymerases (i.e. Pol2 and +Pol3), and some mutated or fused TFs (e.g. KAP1 N/C terminal mutation, GSE27929) +were excluded. + +<h4>ENCODE Human</h4> +Available ENCODE ChIP-seq data sets for transcriptional regulators from +www.encodeproject.org portal were processed with the standardized ReMap pipeline. +The list of ENCODE data was retrieved as FASTQ files from the ENCODE portal +(https://www.encodeproject.org/) using the following filters: Assay: "ChIP-seq", +Organism: "Homo sapiens", Target of assay: "transcription factor", Available data: +"fastq" on 2016 June 21st. Metadata information in JSON format and FASTQ files +were retrieved using the Python requests module. + + +<h4>ChIP-seq processing</h4> +Both Public and ENCODE data were processed similarly. Bowtie 2 (<a href= +"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322381/" TARGET = _BLANK>PMC3322381 +</a>) (version 2.2.9) with options -end-to-end -sensitive was used to align all +reads on the human genome (GRCh38/hg38 assembly). Biological and technical +replicates for each unique combination of GSE/TF/Cell type or Biological condition +were used for peak calling. TFBS were identified using MACS2 peak-calling tool +(<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120977/" TARGET = _BLANK> +PMC3120977</a>) (version 2.1.1.2) in order to follow ENCODE ChIP-seq guidelines, +with stringent thresholds (MACS2 default thresholds, p-value: 1e-5). An input data +set was used when available. + + +<h4>Quality assessment</h4> +To assess the quality of public data sets, a score was computed based on the +cross-correlation and the FRiP (fraction of reads in peaks) metrics developed by +the ENCODE Consortium (<a href="http://genome.ucsc.edu/ENCODE/qualityMetrics.html" +TARGET = _BLANK>http://genome.ucsc.edu/ENCODE/qualityMetrics.html</a>). Two +thresholds were defined for each of the two cross-correlation ratios (NSC, +normalized strand coefficient: 1.05 and 1.10; RSC, relative strand coefficient: +0.8 and 1.0). Detailed descriptions of the ENCODE quality coefficients can be +found at http://genome.ucsc.edu/ENCODE/qualityMetrics.html. The phantompeak +tools suite was used (<a href="https://code.google.com/p/phantompeakqualtools/" +TARGET = _BLANK>https://code.google.com/p/phantompeakqualtools/</a>) to compute +RSC and NSC. +<p> +Please refer to the ReMap 2022, 2020, and 2018 publications for more details +(citation below). +</p> + +<!-- +<p> +<img src="http://pedagogix-tagc.univ-mrs.fr/remap2/hubDirectory/trackhub/img/remap2_figure3_web.png" alt="Detailled view of FOXA1" align="middle"> +</p> +This is a detailled view of the data increase in ReMap v2 with FOXA1 peaks at a specific location. +<br> +--> + +<h2>Data Access</h2> +<p> +ReMap Atlas of regulatory regions data can be explored interactively with the +<a href="../cgi-bin/hgTables">Table Browser</a> and cross-referenced with the +<a href="../cgi-bin/hgIntegrator">Data Integrator</a>. For programmatic access, +the track can be accessed using the Genome Browser's +<a href="../../goldenPath/help/api.html">REST API</a>. +ReMap annotations can be downloaded from the +<a href="http://hgdownload.soe.ucsc.edu/gbdb/$db/reMap`">Genome Browser's download server</a> +as a bigBed file. This compressed binary format can be remotely queried through +command line utilities. Please note that some of the download files can be quite large.</p> + +<p> +Individual BED files for specific TFs, or Cells/Biotypes or Datasets can be +found and downloaded on the ReMap website <a href="http://remap.univ-amu.fr/" +>http://remap.univ-amu.fr/</a> or at <a href="http://remap2022.univ-amu.fr/" +>http://remap2022.univ-amu.fr/</a>. +</p> + +The ReMap BED files for all version [2022, 2020, 2018, 2015] are available for +download at the ReMap website <a href="http://remap.univ-amu.fr/" +>http://remap.univ-amu.fr/</a> in the download tab. + + + +<h2>References</h2> + +<p> +Chèneby J, Gheorghe M, Artufel M, Mathelier A, Ballester B. +<a href="https://www.ncbi.nlm.nih.gov/pubmed/29126285" target="_blank"> +ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP- +seq experiments</a>. +<em>Nucleic Acids Res</em>. 2018 Jan 4;46(D1):D267-D275. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/29126285" target="_blank">29126285</a>; PMC: <a +href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753247/" target="_blank">PMC5753247</a> +</p> +<p> +Chèneby J, Ménétrier Z, Mestdagh M, Rosnet T, Douida A, Rhalloussi W, Bergon A, Lopez +F, Ballester B. +<a href="https://www.ncbi.nlm.nih.gov/pubmed/31665499" target="_blank"> +ReMap 2020: a database of regulatory regions from an integrative analysis of Human and Arabidopsis +DNA-binding sequencing experiments</a>. +<em>Nucleic Acids Res</em>. 2020 Jan 8;48(D1):D180-D188. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/31665499" target="_blank">31665499</a>; PMC: <a +href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7145625/" target="_blank">PMC7145625</a> +</p> +<p> +Griffon A, Barbier Q, Dalino J, van Helden J, Spicuglia S, Ballester B. +<a href="https://www.ncbi.nlm.nih.gov/pubmed/25477382" target="_blank"> +Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory +landscape</a>. +<em>Nucleic Acids Res</em>. 2015 Feb 27;43(4):e27. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/25477382" target="_blank">25477382</a>; PMC: <a +href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4344487/" target="_blank">PMC4344487</a> +</p> +<p> +Hammal F, de Langen P, Bergon A, Lopez F, Ballester B. +<a href="https://www.ncbi.nlm.nih.gov/pubmed/34751401" target="_blank"> +ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an +integrative analysis of DNA-binding sequencing experiments</a>. +<em>Nucleic Acids Res</em>. 2022 Jan 7;50(D1):D316-D325. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/34751401" target="_blank">34751401</a>; PMC: <a +href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8728178/" target="_blank">PMC8728178</a> +</p> +