98a9f12dcb9aac0ebf3aff6baa7351f0132bc60e brianlee Wed Aug 31 16:42:50 2022 -0700 Adding Cloud help page for NAR paper (can be revamped as Data Access page) refs #28386 diff --git src/hg/htdocs/goldenPath/help/cloud.html src/hg/htdocs/goldenPath/help/cloud.html new file mode 100755 index 0000000..fa20f64 --- /dev/null +++ src/hg/htdocs/goldenPath/help/cloud.html @@ -0,0 +1,430 @@ + + + + + + + +
+This page exists to help people operating in the cloud. +It covers the topics of data access on the cloud, including details about our +API and Amazon s3://genome-browser bucket, +our software installations for cloud computing, and references to helpful +tools, such as our pre-compiled bigBedToBed utility. +
++S3 stands for Simple Storage Service, and it is the name for cloud storage in Amazon Web Services +(AWS). The data available through S3 is essentially stored in a folder called a bucket, and +files are called objects. The s3://genome-browser bucket is a copy of the main data available on our +UCSC Genome Browser Download website: https://hgdownload.soe.ucsc.edu/downloads.html
++By placing our Download server files in an S3 bucket, developers +working in the cloud can more easily integrate with UCSC data. +You can learn more about how S3-object-based storage works, and its advantages of being +accessible anywhere across the world with low latency and high durability by reviewing +Amazon's S3 documentation.
+ + ++The data mirrors our UCSC Genome Browser Download website's main rsync directories: +
+UCSC Human Golden Path Downloads s3://genome-browser/goldenPath +UCSC Human Genome Browser Gbdb Data Files s3://genome-browser/gbdb +UCSC Human Genome Raw Mysql Tables s3://genome-browser/mysql +UCSC Human Genome Web Site CGI Binaries s3://genome-browser/cgi-bin +UCSC Human Genome Web Site Htdocs s3://genome-browser/htdocs ++
+
goldenPath/hg38/bigZips/README.txt
. The README.txt, also
+available on the Download website,
+informs that the most recent patch-inclusive sequence is found in
+goldenPath/hg38/bigZips/latest/
.gbdb/hg38/hg38.2bit
, matching the file in the
+goldenPath/hg38/bigZips/latest/
+directory, reflecting how these files are operated on by the UCSC Genome Browser software
+in order to display assembly sequence when browsing.htdocs/goldenPath/pubs.html
which lists our publications.
+Amazon provides an AWS
+Command Line Interface (AWS CLI) which includes options such as sync.
+Here is an example to download an AWS bucket with CLI: aws s3 sync s3://bucket-name .
+The data is also available via http at genome-browser.s3-website-us-east-1.amazonaws.com +where files can be accessed.
++
goldenPath/
Downloads directory:gbdb/
binary data directory
+for the human hg38 assembly 2bit file:htdocs/
hypertext document directory:
+http://genome-browser.s3-website-us-east-1.amazonaws.com/htdocs/goldenPath/pubs.html+The UCSC Genome Browser has a REST API for the programmatic extraction of data. +REST is an acronym for REpresentational State Transfer and API stands for +Application Programming Interface, read more on the help page: http://genome.ucsc.edu/goldenPath/help/api.html
++The REST API returns data in JavaScript Object Notation (JSON) format, which +can easily be sent between computers, and used by many different programming languages.
+
+Data can be accessed with this URL: https://api.genome.ucsc.edu/ By adding
+different endpoint functions such as /list/
or /getData/
+specific results can be obtained.
+
+ wget -O- 'https://api.genome.ucsc.edu/list/publicHubs' + wget -O- 'https://api.genome.ucsc.edu/getData/sequence?genome=hg38;chrom=chrM;start=4321;end=5678' ++ + + +
+With different endpoint functions such as /list/
or
+/getData/
URLs can be constructed to pull specific results.
+
Endpoint function | Required | Optional |
---|---|---|
/list/publicHubs | (none) | (none) |
/list/ucscGenomes | (none) | (none) |
/list/hubGenomes | hubUrl | (none) |
/list/tracks | genome or (hubUrl and genome) | trackLeavesOnly=1 |
/list/chromosomes | genome or (hubUrl and genome) | track |
/list/schema | (genome or (hubUrl and genome)) and track | (none) |
/getData/sequence | (genome or (hubUrl and genome)) and chrom | start and +end |
/getData/track | (genome or (hubUrl and genome)) and track | chrom, +(start and end), maxItemsOutput, jsonOutputArrays |
+By reviewing example data access URLs demonstrating of list and getData functions +and further practical examples URLs of extracting specific track data items +you can learn more about the ways of using the API to extract data.
+ + + ++The UCSC Genome Browser Download website, hgdownload.soe.ucsc.edu, is the source of the data +hosted in the Amazon s3://genome-browser bucket. It can be viewed in a web browser to access +specific download files, or the data can be copied with rysnc commands.
++For instance, the following rsync +command will show you the various rysnc directories available on our Download server: +
+$ rsync -a -P rsync://hgdownload.soe.ucsc.edu/ + +genome UCSC Human Genome Downloads +sars UCSC Human Genome SARS Downloads +htdocs UCSC Human Genome Web Site Htdocs +goldenPath UCSC Human Golden Path Downloads +cgi-bin UCSC Human Genome Web Site CGI Binaries x86_64 +cgi-bin-i386 UCSC Human Genome Web Site CGI Binaries i386 +gbdb UCSC Human Genome Browser Gbdb Config Files +archives UCSC Human Genome Browser Archived Config Files +mysql UCSC Human Genome Raw Mysql Tables +gbib UCSC Genome Browser in a Box +hubs UCSC Genome Browser Public Hubs ++
goldenPath/
Downloads directory:rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/README.txt ./
gbdb/
binary data directory
+for the human hg38 assembly 2bit file:rsync -a -P rsync://hgdownload.soe.ucsc.edu/gbdb/hg38/hg38.2bit ./
htdocs/
hypertext document directory:
+rsync -a -P rsync://hgdownload.soe.ucsc.edu/htdocs/goldenPath/pubs.html ./
+Many of these rsync directories exist to support the Genome Browser in a Cloud (GBiC) and the Genome Browser in a Box (GBiB) software products discussed below.
+Also note that there is a mirror of the download server available in Europe so the above rysnc
+commands can also be pointed to the hgdownload-euro
locations.
+
rsync -a -P rsync://hgdownload-euro.soe.ucsc.edu/gbdb/hg38/hg38.2bit ./
+The UCSC Genome Browser uses MariaDB (fork of MySQL) as the backend database server and maintains +a public server at genome-mysql.soe.ucsc.edu to allow direct queries.
++
trackDb
all the entries in the group (grp) "genes" and
+ordering those entries by tableName:
+mysql -h genome-mysql.soe.ucsc.edu -u genome -NBe 'select tableName from trackDb where grp = "genes" order by tableName' hg38
+
wgEncodeRegTfbsClusteredV3
on the human hg19 assembly
+and selecting entries from a 500 base pair region on chr1:
+mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -Ne 'select chrom,chromStart,chromEnd,name,score
+from wgEncodeRegTfbsClusteredV3 where chrom = "chr1" and chromStart > 10000 and chromEnd < 10500;' hg19
+
wgEncodeGencodeBasicV39
table on the hg38 genome:
+mysql -u genome -h genome-mysql.soe.ucsc.edu hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g,
+wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "lncRNA");'
+
+See the Downloading Data using MariaDB (MySQL)
+for more information. Also, there is a mirror of the MariaDb server available
+in Europe so commands can also be pointed to the genome-euro-mysql
location.
+
mysql -h genome-mysql-euro.soe.ucsc.edu -u genome -NBe 'show tables' hg38
+To replicate, or mirror, the software of the UCSC Genome Browser in +another location we offer the Genome Browser in a Cloud (GBiC) +and the Genome Browser in a Box (GBiB) software products.
++The GBiC is an installation script that automates the setup of a UCSC Genome Browser +mirror including setting up MariaDB and Apache servers. The program +downloads and configures MySQL and Apache, and then downloads +the UCSC Genome Browser software to /usr/local/apache to make a local +instance of the Browser.
++The GBiB is a small virtual machine version of the UCSC Genome Browser +that can be run on a laptop or desktop computer. It requires an installation +of a compatible version of the VirtualBox Software, and will then access +annotation data on demand through the Internet from UCSC as used, or +selective data can be downloaded for faster access.
+
+The GBiB and GBiC software tools resource the
+Download server to rsync
+data, as well as in certain circumstances the
+MySQL server to extract
+coordinate-specific table data.
+See the individual support pages for the GBiC and the GBiB +for detailed information about how to install and operate both. +You can get either the GBiC or the GBiB from the UCSC Genome Browser store +free for non-commercial use.
+ + + ++We do support a Dockerfile, that in essence points to the GBiC installation +script. While we recommend our GBiC script, we understand many +people are more familiar with working through Docker and provide +Docker installation instructions.
++Please note, similar to how our GBiB and GBiC are available in the +UCSC Genome Browser store, +where usage of our mirror software is free for non-commercial use. +Any commercial usage, including through the Docker image, involves +a license.
+ + + ++A lot of our data is stored in a binary indexed version called bigBed. This format +saves space and also allows the extraction of information based on the first three fields +(chrom, chromStart, chromEnd), which define annotation coordinate location.
+
+To pull information out of bigBed files there is a tool called bigBedToBed
.
+By running the command by itself you can see the command options.
+
+bigBedToBed v1 - Convert from bigBed to ascii bed format. +usage: + bigBedToBed input.bb output.bed +options: + -chrom=chr1 - if set restrict output to given chromosome + -start=N - if set, restrict output to only that over start + -end=N - if set, restict output to only that under end + -maxItems=N - if set, restrict output to first N items + -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs + -header - output a autoSql-style header (starts with '#'). ++
+Another similar tool is available to extract data from the binary indexed 2bit sequence
+storage format. The tool twoBitToFa
can be given coordinate ranges and the
+DNA can be extracted from the file.
+
+twoBitToFa -seq=chr1 -start=1234500 -end=1234600 http://genome-browser.s3-website-us-east-1.amazonaws.com/gbdb/hg38/hg38.2bit stdout +>chr1:1234500-1234600 +GCGTCCCTAGGTCAGGCCGTTGAGTTCGAGCTCCGATGGGCCACCTTGAA +TCCAGGACTGACCGCCCGTGTGTGCACAGTTTGTTCTTGGACGAGGACTC +
+bigBedToBed -chrom=chr1 -start=190000 -end=200000 http://genome-browser.s3-website-us-east-1.amazonaws.com/gbdb/hg38/encode3/ccre/encodeCcreCombined.bb stdout | head +chr1 190865 191071 EH38E1310154 179 . 190865 191071 255,205,0 dELS,CTCF-boundd ELS 1.79282201562 enhDE1310154 EH38E1310154 distal enhancer-like signature +
+The Amazon Ecosystem comes integrated with a collection of systems such as +CloudFront, +CloudWatch, +Relational Database Service (RDS), +Elastic Block Store (EBS), +Lambda, +and Aurora. +Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. +The UCSC Genome Browser's tableName.MYD and tableName.MYI files can be used with Aurora, +instead of installing MariaDb, however, there may be some services costs in Amazon for +using Aurora.
+ +