98a9f12dcb9aac0ebf3aff6baa7351f0132bc60e brianlee Wed Aug 31 16:42:50 2022 -0700 Adding Cloud help page for NAR paper (can be revamped as Data Access page) refs #28386 diff --git src/hg/htdocs/goldenPath/help/cloud.html src/hg/htdocs/goldenPath/help/cloud.html new file mode 100755 index 0000000..fa20f64 --- /dev/null +++ src/hg/htdocs/goldenPath/help/cloud.html @@ -0,0 +1,430 @@ +<!DOCTYPE html> +<!--#set var="TITLE" value="Cloud Data and Software Resources" --> +<!--#set var="ROOT" value="../.." --> + +<!-- Relative paths to support mirror sites with non-standard GB docs install --> +<!--#include virtual="$ROOT/inc/gbPageStart.html" --> + +<h1>Cloud Data and Software Resources</h1> + +<!-- ========== Introduction ============================== --> +<a id="Intro"></a> +<p> +This page exists to help people operating in the cloud. +It covers the topics of data access on the cloud, including details about our +<a href="#REST_API">API</a> and Amazon <b>s3://genome-browser</b> <a href="#S3">bucket</a>, +our software installations for <a href="#GBiC">cloud computing</a>, and references to helpful +tools, such as our pre-compiled <a href="#bigBed">bigBedToBed</a> utility. +</p> +<h2>Contents</h2> + +<h3>Cloud Data</h3> +<div class="container"> + <div class="row"> + <div class="col-sm-6"> + <h6>Amazon S3 </h6> + <ul class=gbsNoBullet> + <li><a href="#S3">What is the Amazon S3 genome-browser bucket?</a></li> + <li><a href="#S3_contents">What specific files are in the s3://genome-browser bucket?</a></li> + <li><a href="#S3_downloads">How can one get data from the s3://genome-browser bucket?</a></li> + </ul> + </div> + <div class="col-sm-6"> + <h6>REST API</h6> + <ul class=gbsNoBullet> + <li><a href="#REST_API">What is the REST API?</a></li> + <li><a href="#REST_API_data">What kind of data can you get from the REST API?</a></li> + </ul> + </div> + </div> +</div> +<div class="container"> + <div class="row"> + <div class="col-sm-6"> + <h6>Download Server</h6> + <ul class=gbsNoBullet> + <li><a href="#download">What is the download server and how does one use it?</a></li> + </ul> + </div> + <div class="col-sm-6"> + <h6>MySQL Server</h6> + <ul class=gbsNoBullet> + <li><a href="#MySQL">What is the MySQL server and how does one use it?</a></li> + </ul> + </div> + </div> +</div> + +<h3>Cloud Software</h3> +<div class="container"> + <div class="row"> + <div class="col-sm-6"> + <h6>GBiB/GBiC</h6> + <ul class=gbsNoBullet> + <li><a href="#GBiC">What are GBiB and GBiC? + (Genome Browser in a Box/in the Cloud)</a></li> + </ul> + </div> + <div class="col-sm-6"> + <h6>Docker</h6> + <ul class=gbsNoBullet> + <li><a href="#Docker">Do you support Docker?</a></li> + </ul> + </div> + </div> +</div> + +<h3>Helpful Tools</h3> +<div class="container"> + <div class="row"> + <div class="col-sm-6"> + <h6>Browser Utilities</h6> + <ul class=gbsNoBullet> + <li><a href="#bigBed">How do I extract data from the bigBed/2bit data formats?</a></li> + </ul> + </div> + <div class="col-sm-6"> + <h6>Amazon Ecosystem</h6> + <ul class=gbsNoBullet> + <li><a href="#tag_name">Where can I learn more about Amazon Tools?</a></li> + </ul> + </div> + </div> +</div> + +<!-- ====Cloud Data Section============================== --> + +<a id="S3"></a> +<h2>What is the Amazon s3://genome-browser bucket?</h2> +<p> +S3 stands for Simple Storage Service, and it is the name for cloud storage in Amazon Web Services +(AWS). The data available through S3 is essentially stored in a folder called a bucket, and +files are called objects. The s3://genome-browser bucket is a copy of the main data available on our +UCSC Genome Browser Download website: <a href="https://hgdownload.soe.ucsc.edu/downloads.html" +target="_blank">https://hgdownload.soe.ucsc.edu/downloads.html</a></p> +<p> +By placing our Download server files in an S3 bucket, developers +working in the cloud can more easily integrate with UCSC data. +You can learn more about how S3-object-based storage works, and its advantages of being +accessible anywhere across the world with low latency and high durability by reviewing +<a href="https://docs.aws.amazon.com/s3/" target="_blank">Amazon's S3 documentation</a>.</p> + +<a id="S3_contents"></a> +<h2>What specific files are in the s3://genome-browser bucket?</h2> +<p> +The data mirrors our <a href="https://hgdownload.soe.ucsc.edu/downloads.html" +target="_blank">UCSC Genome Browser Download</a> website's main rsync directories: +<pre> +UCSC Human Golden Path Downloads s3://genome-browser/goldenPath +UCSC Human Genome Browser Gbdb Data Files s3://genome-browser/gbdb +UCSC Human Genome Raw Mysql Tables s3://genome-browser/mysql +UCSC Human Genome Web Site CGI Binaries s3://genome-browser/cgi-bin +UCSC Human Genome Web Site Htdocs s3://genome-browser/htdocs +</pre></p> +<p> +<ul> +<li>The <b>goldenPath</b> directory is organized by assembly name, and represents the file +structure on our Download server, which includes README.txt files. For +instance, the sequence data for the human hg38 assembly would be found in this location with an +instructive README.txt: <code>goldenPath/hg38/bigZips/README.txt</code>. The README.txt, also +available on the <a href="https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/README.txt" +target="_blank">Download website</a>, +informs that the most recent patch-inclusive sequence is found in +<code>goldenPath/hg38/bigZips/latest/</code>.</li> +<li>The <b>gbdb</b> directory, also organized by assembly name, provides access to genome +browser database files in binary format used by the browser software. +For instance, the underlying binary indexed sequence data for the hg38 databases used in the +display in the UCSC Genome Browser would be located in the following location, +<code>gbdb/hg38/hg38.2bit</code>, matching the file in the +<code>goldenPath/hg38/bigZips/latest/</code> +directory, reflecting how these files are operated on by the UCSC Genome Browser software +in order to display assembly sequence when browsing.</li> +<li>The <b>mysql</b> directory, also organized by assembly name, provides access +to MySQL database tableName.MYD files, and their related +tableName.MYI index and tableName.frm format files, providing a copy of the tables +used by the main Browser site.</li> +<li>The <b>cgi-bin</b> directory is a copy of the software +run on the main browser site.</li> +<li>The <b>htdocs</b> directory is a copy of the html pages used on the +main browser site, such as <code>htdocs/goldenPath/pubs.html</code> which lists our publications.</li> +</ul></p> + +<a id="S3_downloads"></a> +<h2>How can one get data from the s3://genome-browser bucket?</h2> +<p> +Amazon provides an <a target="_blank" +href="https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html">AWS +Command Line Interface (AWS CLI)</a> which includes options such as <a target="_blank" +href="https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html">sync</a>. +Here is an example to download an AWS bucket with CLI: <code>aws s3 sync s3://bucket-name .</code></p> +<p> +The data is also available via http at <b>genome-browser.s3-website-us-east-1.amazonaws.com</b> +where files can be accessed.</p> +<h3>Examples</h3> +<p> +<ul> +<li>For instance, here is an example of accessing a README file +in the <code>goldenPath/</code>Downloads directory:<br> +<a href="http://genome-browser.s3-website-us-east-1.amazonaws.com/goldenPath/hg38/bigZips/README.txt" +target="_blank">http://genome-browser.s3-website-us-east-1.amazonaws.com/goldenPath/hg38/bigZips/README.txt</a></li> +<li>And here is an example link that would access the <code>gbdb/</code> binary data directory +for the human hg38 assembly 2bit file:<br> +<a href="http://genome-browser.s3-website-us-east-1.amazonaws.com/gbdb/hg38/hg38.2bit" +target="_blank">http://genome-browser.s3-website-us-east-1.amazonaws.com/gbdb/hg38/hg38.2bit</a></li> +<li>And here is an example link that would access our +publications html page from the bucket's <code>htdocs/</code> hypertext document directory:</br> +<a href="http://genome-browser.s3-website-us-east-1.amazonaws.com/htdocs/goldenPath/pubs.html" +target="_blank">http://genome-browser.s3-website-us-east-1.amazonaws.com/htdocs/goldenPath/pubs.html</a></li> +</ul></p> + +<!-- ========== What is the REST API? ============================== --> +<a id="REST_API"></a> +<h2>What is the REST API?</h2> +<p> +The UCSC Genome Browser has a REST API for the programmatic extraction of data. +REST is an acronym for REpresentational State Transfer and API stands for +Application Programming Interface, read more on the help page: <a target="_blank" +href="http://genome.ucsc.edu/goldenPath/help/api.html"> http://genome.ucsc.edu/goldenPath/help/api.html</a></p> +<p> +The REST API returns data in JavaScript Object Notation (<a +href="https://www.w3schools.com/js/js_json_intro.asp" target="_blank">JSON</a>) format, which +can easily be sent between computers, and used by many different programming languages.</p> +<p> +Data can be accessed with this URL: <b>https://api.genome.ucsc.edu/</b> By adding +different endpoint functions such as <code>/list/</code> or <code>/getData/</code> +specific results can be obtained.</p> +<h3>Examples</h3> +<p> +<pre> + wget -O- 'https://api.genome.ucsc.edu/list/publicHubs' + wget -O- 'https://api.genome.ucsc.edu/getData/sequence?genome=hg38;chrom=chrM;start=4321;end=5678' +</pre></p> + +<!-- ========== What kind of data can you get from the REST API? ============================== --> +<a id="REST_API_data"></a> +<h2>What kind of data can you get from the REST API?</h2> +<p> +With different endpoint functions such as <code>/list/</code> or +<code>/getData/</code> URLs can be constructed to pull specific results. +<table> +<tr><th>Endpoint function</th><th>Required</th><th>Optional</th></tr> +<tr><th>/list/publicHubs</th><td>(none)</td><td>(none)</td></tr> +<tr><th>/list/ucscGenomes</th><td>(none)</td><td>(none)</td></tr> +<tr><th>/list/hubGenomes</th><td>hubUrl</td><td>(none)</td></tr> +<tr><th>/list/tracks</th><td>genome or (hubUrl and genome)</td><td>trackLeavesOnly=1</td></tr> +<tr><th>/list/chromosomes</th><td>genome or (hubUrl and genome)</td><td>track</td></tr> +<tr><th>/list/schema</th><td>(genome or (hubUrl and genome)) and track</td><td>(none)</td></tr> +<tr><th>/getData/sequence</th><td>(genome or (hubUrl and genome)) and chrom</td><td>start and +end</td></tr> +<tr><th>/getData/track</th><td>(genome or (hubUrl and genome)) and track</td><td>chrom, +(start and end), maxItemsOutput, jsonOutputArrays</td></tr> +</table></p> +<p> +By reviewing <a href="http://genome.ucsc.edu/goldenPath/help/api.html#list_examples" +target="_blank">example data access URLs</a> demonstrating of list and getData functions +and further <a href="http://genome.ucsc.edu/goldenPath/help/api.html#Practical_examples" +target="_blank">practical examples URLs</a> of extracting specific track data items +you can learn more about the ways of using the API to extract data.</p> + +<!-- ========== What is the Download Server and does one use it? ============================== --> +<a id="download"></a> +<h2>What is the Download server and how does one use it?</h2> +<p> +The UCSC Genome Browser Download website, <a href="https://hgdownload.soe.ucsc.edu/downloads.html" +target="_blank">hgdownload.soe.ucsc.edu</a>, is the source of the data +hosted in the Amazon s3://genome-browser bucket. It can be viewed in a web browser to access +specific download files, or the data can be copied with rysnc commands.</p> +<h3>Examples</h3> +<p> +For instance, the following <a href="https://en.wikipedia.org/wiki/Rsync" target="_blank">rsync</a> +command will show you the various rysnc directories available on our Download server: +<pre> +$ rsync -a -P rsync://hgdownload.soe.ucsc.edu/ + +genome UCSC Human Genome Downloads +sars UCSC Human Genome SARS Downloads +htdocs UCSC Human Genome Web Site Htdocs +goldenPath UCSC Human Golden Path Downloads +cgi-bin UCSC Human Genome Web Site CGI Binaries x86_64 +cgi-bin-i386 UCSC Human Genome Web Site CGI Binaries i386 +gbdb UCSC Human Genome Browser Gbdb Config Files +archives UCSC Human Genome Browser Archived Config Files +mysql UCSC Human Genome Raw Mysql Tables +gbib UCSC Genome Browser in a Box +hubs UCSC Genome Browser Public Hubs +</pre> +<ul> +<li>For instance, here is an example of accessing a README file +in the <code>goldenPath/</code>Downloads directory:<br> +<code>rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/README.txt ./</code></li> +<li>And here is an example link that would access the <code>gbdb/</code> binary data directory +for the human hg38 assembly 2bit file:<br> +<code>rsync -a -P rsync://hgdownload.soe.ucsc.edu/gbdb/hg38/hg38.2bit ./</code></li> +<li>And here is an example link that would access our +publications html page from the bucket's <code>htdocs/</code> hypertext document directory:</br> +<code>rsync -a -P rsync://hgdownload.soe.ucsc.edu/htdocs/goldenPath/pubs.html ./</code></li> +</ul></p> +<p> +Many of these rsync directories exist to support the Genome Browser in a Cloud (<a target="_blank" +href="gbic.html">GBiC</a>) and the Genome Browser in a Box (<a target="_blank" +href="gbib.html">GBiB</a>) software products discussed below. +Also note that there is a mirror of the download server available in Europe so the above rysnc +commands can also be pointed to the <code>hgdownload-euro</code> locations. +<ul><li>For instance here is a command to access data from the Europe location:<br> +<code>rsync -a -P rsync://hgdownload-euro.soe.ucsc.edu/gbdb/hg38/hg38.2bit ./</code></li> +</ul> + +<!-- ========== What is the MySQL server and how does one use it? ============================== --> +<a id="MySQL"></a> +<h2>What is the MySQL server and how does one use it?</h2> +<p> +The UCSC Genome Browser uses <a href="https://en.wikipedia.org/wiki/MariaDB" +target="_blank">MariaDB</a> (fork of MySQL) as the backend database server and maintains +a public server at <b>genome-mysql.soe.ucsc.edu</b> to allow direct queries.</p> +<h3>Examples</h3> +<p> +<ul> +<li>For instance, here is an example of accessing the hg38 human assembly database and +selecting from the table <code>trackDb</code> all the entries in the group (grp) "genes" and +ordering those entries by tableName:<br> +<code> +mysql -h genome-mysql.soe.ucsc.edu -u genome -NBe 'select tableName from trackDb where grp = "genes" order by tableName' hg38 +</code></li> +<li>And here is an example of accessing a specific Transcription Factor Binding Site (TFBS) table +<code>wgEncodeRegTfbsClusteredV3</code> on the human hg19 assembly +and selecting entries from a 500 base pair region on chr1:<br> +<code> +mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -Ne 'select chrom,chromStart,chromEnd,name,score +from wgEncodeRegTfbsClusteredV3 where chrom = "chr1" and chromStart > 10000 and chromEnd < 10500;' hg19 +</code></li> +<li>And here is an example query that will pull all the long non-coding entries (lncRNA) from the +<code>wgEncodeGencodeBasicV39</code> table on the hg38 genome:<br> +<code> +mysql -u genome -h genome-mysql.soe.ucsc.edu hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, +wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "lncRNA");' +</code></li> +</ul></p> +<p> +See the <a target="_blank" href="mysql.html">Downloading Data using MariaDB (MySQL)</a> +for more information. Also, there is a mirror of the MariaDb server available +in Europe so commands can also be pointed to the <code>genome-euro-mysql</code> location. +<ul><li>For instance here is a command to access hg38 data from the Europe location:<br> +<code>mysql -h genome-mysql-euro.soe.ucsc.edu -u genome -NBe 'show tables' hg38</code></li> +</ul> +</p> + + +<!-- ====Cloud Software Section============================== --> + +<!-- ========== Genome Browser in a Box/in the Cloud (GBiB/GBiC) ============================== --> +<a id="GBiC"></a> +<h2>What are GBiB and GBIC? (Genome Browser in a Box/in the Cloud)</h2> +<p> +To replicate, or mirror, the software of the UCSC Genome Browser in +another location we offer the Genome Browser in a Cloud (GBiC) +and the Genome Browser in a Box (GBiB) software products.</p> +<p> +The GBiC is an installation script that automates the setup of a UCSC Genome Browser +mirror including setting up MariaDB and Apache servers. The program +downloads and configures MySQL and Apache, and then downloads +the UCSC Genome Browser software to /usr/local/apache to make a local +instance of the Browser.</p> +<p> +The GBiB is a small virtual machine version of the UCSC Genome Browser +that can be run on a laptop or desktop computer. It requires an installation +of a compatible version of the VirtualBox Software, and will then access +annotation data on demand through the Internet from UCSC as used, or +selective data can be downloaded for faster access.</p> +<p> +The GBiB and GBiC software tools resource the +<a href="#download">Download server</a> to <code>rsync</code> +data, as well as in certain circumstances the +<a href="#MySQL">MySQL server</a> to extract +coordinate-specific table data.</p> +<p> +See the individual support pages for the <a target="_blank" +href="gbic.html">GBiC</a> and the <a target="_blank" href="gbib.html">GBiB</a> +for detailed information about how to install and operate both. +You can get either the GBiC or the GBiB from the <a target="_blank" +href='https://genome-store.ucsc.edu/' title=''>UCSC Genome Browser store</a> +free for non-commercial use.</p> + +<!-- ========== Do you support Docker? ============================== --> +<a id="Docker"></a> +<h2>Do you support Docker?</h2> +<p> +We do support a Dockerfile, that in essence points to the GBiC installation +script. While we recommend our <a href="#GBiC">GBiC</a> script, we understand many +people are more familiar with working through Docker and provide +<a href="mirror.html#docker-installation-instructions" +target="_blank">Docker installation instructions</a>.</p> +<p> +Please note, similar to how our GBiB and GBiC are available in the +<a href="https://genome-store.ucsc.edu/" target="_blank">UCSC Genome Browser store</a>, +where usage of our mirror software is <b>free for non-commercial use</b>. +Any commercial usage, including through the Docker image, involves +a <a href="/license/index.html" target="_blank">license</a>.</p> + +<!-- ========== How do I extract data from the bigBed/2bit data formats? ============================== --> +<a id="bigBed"></a> +<h2>How do I extract data from the bigBed/2bit data formats?</h2> +<p> +A lot of our data is stored in a binary indexed version called bigBed. This format +saves space and also allows the extraction of information based on the first three fields +(chrom, chromStart, chromEnd), which define annotation coordinate location.</p> +<p> +To pull information out of bigBed files there is a tool called <code>bigBedToBed</code>. +By running the command by itself you can see the command options. +<pre> +bigBedToBed v1 - Convert from bigBed to ascii bed format. +usage: + bigBedToBed input.bb output.bed +options: + -chrom=chr1 - if set restrict output to given chromosome + -start=N - if set, restrict output to only that over start + -end=N - if set, restict output to only that under end + -maxItems=N - if set, restrict output to first N items + -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs + -header - output a autoSql-style header (starts with '#'). +</pre></p> +<p> +Another similar tool is available to extract data from the binary indexed 2bit sequence +storage format. The tool <code>twoBitToFa</code> can be given coordinate ranges and the +DNA can be extracted from the file.</p> +<h3>Examples</h3> +<p> +<ul> +<li>For instance, here is an example of accessing the hg38 2bit human assembly sequence file +hosted at the s3 Amazon bucket and extracting a small coordinate range: +<pre> +twoBitToFa -seq=chr1 -start=1234500 -end=1234600 http://genome-browser.s3-website-us-east-1.amazonaws.com/gbdb/hg38/hg38.2bit stdout +>chr1:1234500-1234600 +GCGTCCCTAGGTCAGGCCGTTGAGTTCGAGCTCCGATGGGCCACCTTGAA +TCCAGGACTGACCGCCCGTGTGTGCACAGTTTGTTCTTGGACGAGGACTC +</pre></li> +<li>And here is an example of accessing the ENCODE Candidate Cis-Regulatory Elements (cCREs) bigBed +file hosted on the Amazon s3 bucket and extracting enhancers in a defined region. +<pre> +bigBedToBed -chrom=chr1 -start=190000 -end=200000 http://genome-browser.s3-website-us-east-1.amazonaws.com/gbdb/hg38/encode3/ccre/encodeCcreCombined.bb stdout | head +chr1 190865 191071 EH38E1310154 179 . 190865 191071 255,205,0 dELS,CTCF-boundd ELS 1.79282201562 enhDE1310154 EH38E1310154 distal enhancer-like signature +</pre></li> +</ul></p> + + +<!-- ========== Where can I learn more about Amazon Tools? ============================== --> +<a id="tag_name"></a> +<h2>Where can I learn more about Amazon Tools?</h2> +<p> +The Amazon Ecosystem comes integrated with a collection of systems such as +<a target="_blank" href="https://aws.amazon.com/cloudfront/">CloudFront</a>, +<a target="_blank" href="https://aws.amazon.com/cloudwatch/">CloudWatch</a>, +<a target="_blank" href="https://aws.amazon.com/rds/">Relational Database Service (RDS)</a>, +<a target="_blank" href="https://aws.amazon.com/ebs/">Elastic Block Store (EBS)</a>, +<a target="_blank" href="https://aws.amazon.com/lambda/">Lambda</a>, +and <a target="_blank" href="https://aws.amazon.com/aurora/">Aurora</a>. +Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. +The UCSC Genome Browser's tableName.MYD and tableName.MYI files can be used with Aurora, +instead of installing MariaDb, however, there may be some services costs in Amazon for +using Aurora.</p> + +<!--#include virtual="$ROOT/inc/gbPageEnd.html" -->