98a9f12dcb9aac0ebf3aff6baa7351f0132bc60e brianlee Wed Aug 31 16:42:50 2022 -0700 Adding Cloud help page for NAR paper (can be revamped as Data Access page) refs #28386 diff --git src/hg/htdocs/goldenPath/help/cloud.html src/hg/htdocs/goldenPath/help/cloud.html new file mode 100755 index 0000000..fa20f64 --- /dev/null +++ src/hg/htdocs/goldenPath/help/cloud.html @@ -0,0 +1,430 @@ + + + + + + + +

Cloud Data and Software Resources

+ + + +

+This page exists to help people operating in the cloud. +It covers the topics of data access on the cloud, including details about our +API and Amazon s3://genome-browser bucket, +our software installations for cloud computing, and references to helpful +tools, such as our pre-compiled bigBedToBed utility. +

+

Contents

+ +

Cloud Data

+
+
+ + +
+
+
+
+ + +
+
+ +

Cloud Software

+
+
+ +
+
Docker
+ +
+
+
+ +

Helpful Tools

+
+
+ +
+
Amazon Ecosystem
+ +
+
+
+ + + + +

What is the Amazon s3://genome-browser bucket?

+

+S3 stands for Simple Storage Service, and it is the name for cloud storage in Amazon Web Services +(AWS). The data available through S3 is essentially stored in a folder called a bucket, and +files are called objects. The s3://genome-browser bucket is a copy of the main data available on our +UCSC Genome Browser Download website: https://hgdownload.soe.ucsc.edu/downloads.html

+

+By placing our Download server files in an S3 bucket, developers +working in the cloud can more easily integrate with UCSC data. +You can learn more about how S3-object-based storage works, and its advantages of being +accessible anywhere across the world with low latency and high durability by reviewing +Amazon's S3 documentation.

+ + +

What specific files are in the s3://genome-browser bucket?

+

+The data mirrors our UCSC Genome Browser Download website's main rsync directories: +

+UCSC Human Golden Path Downloads             s3://genome-browser/goldenPath
+UCSC Human Genome Browser Gbdb Data Files    s3://genome-browser/gbdb
+UCSC Human Genome Raw Mysql Tables           s3://genome-browser/mysql
+UCSC Human Genome Web Site CGI Binaries      s3://genome-browser/cgi-bin
+UCSC Human Genome Web Site Htdocs            s3://genome-browser/htdocs
+

+

+

+ + +

How can one get data from the s3://genome-browser bucket?

+

+Amazon provides an AWS +Command Line Interface (AWS CLI) which includes options such as sync. +Here is an example to download an AWS bucket with CLI: aws s3 sync s3://bucket-name .

+

+The data is also available via http at genome-browser.s3-website-us-east-1.amazonaws.com +where files can be accessed.

+

Examples

+

+

+ + + +

What is the REST API?

+

+The UCSC Genome Browser has a REST API for the programmatic extraction of data. +REST is an acronym for REpresentational State Transfer and API stands for +Application Programming Interface, read more on the help page: http://genome.ucsc.edu/goldenPath/help/api.html

+

+The REST API returns data in JavaScript Object Notation (JSON) format, which +can easily be sent between computers, and used by many different programming languages.

+

+Data can be accessed with this URL: https://api.genome.ucsc.edu/ By adding +different endpoint functions such as /list/ or /getData/ +specific results can be obtained.

+

Examples

+

+

+    wget -O- 'https://api.genome.ucsc.edu/list/publicHubs'
+    wget -O- 'https://api.genome.ucsc.edu/getData/sequence?genome=hg38;chrom=chrM;start=4321;end=5678'
+

+ + + +

What kind of data can you get from the REST API?

+

+With different endpoint functions such as /list/ or +/getData/ URLs can be constructed to pull specific results. + + + + + + + + + + +
Endpoint functionRequiredOptional
/list/publicHubs(none)(none)
/list/ucscGenomes(none)(none)
/list/hubGenomeshubUrl(none)
/list/tracksgenome or (hubUrl and genome)trackLeavesOnly=1
/list/chromosomesgenome or (hubUrl and genome)track
/list/schema(genome or (hubUrl and genome)) and track(none)
/getData/sequence(genome or (hubUrl and genome)) and chromstart and +end
/getData/track(genome or (hubUrl and genome)) and trackchrom, +(start and end), maxItemsOutput, jsonOutputArrays

+

+By reviewing example data access URLs demonstrating of list and getData functions +and further practical examples URLs of extracting specific track data items +you can learn more about the ways of using the API to extract data.

+ + + +

What is the Download server and how does one use it?

+

+The UCSC Genome Browser Download website, hgdownload.soe.ucsc.edu, is the source of the data +hosted in the Amazon s3://genome-browser bucket. It can be viewed in a web browser to access +specific download files, or the data can be copied with rysnc commands.

+

Examples

+

+For instance, the following rsync +command will show you the various rysnc directories available on our Download server: +

+$ rsync -a -P rsync://hgdownload.soe.ucsc.edu/ 
+
+genome         UCSC Human Genome Downloads
+sars           UCSC Human Genome SARS Downloads
+htdocs         UCSC Human Genome Web Site Htdocs
+goldenPath     UCSC Human Golden Path Downloads
+cgi-bin        UCSC Human Genome Web Site CGI Binaries x86_64
+cgi-bin-i386   UCSC Human Genome Web Site CGI Binaries i386
+gbdb           UCSC Human Genome Browser Gbdb Config Files
+archives       UCSC Human Genome Browser Archived Config Files
+mysql          UCSC Human Genome Raw Mysql Tables
+gbib           UCSC Genome Browser in a Box
+hubs           UCSC Genome Browser Public Hubs
+
+

+

+Many of these rsync directories exist to support the Genome Browser in a Cloud (GBiC) and the Genome Browser in a Box (GBiB) software products discussed below. +Also note that there is a mirror of the download server available in Europe so the above rysnc +commands can also be pointed to the hgdownload-euro locations. +

+ + + +

What is the MySQL server and how does one use it?

+

+The UCSC Genome Browser uses MariaDB (fork of MySQL) as the backend database server and maintains +a public server at genome-mysql.soe.ucsc.edu to allow direct queries.

+

Examples

+

+

+

+See the Downloading Data using MariaDB (MySQL) +for more information. Also, there is a mirror of the MariaDb server available +in Europe so commands can also be pointed to the genome-euro-mysql location. +

+

+ + + + + + +

What are GBiB and GBIC? (Genome Browser in a Box/in the Cloud)

+

+To replicate, or mirror, the software of the UCSC Genome Browser in +another location we offer the Genome Browser in a Cloud (GBiC) +and the Genome Browser in a Box (GBiB) software products.

+

+The GBiC is an installation script that automates the setup of a UCSC Genome Browser +mirror including setting up MariaDB and Apache servers. The program +downloads and configures MySQL and Apache, and then downloads +the UCSC Genome Browser software to /usr/local/apache to make a local +instance of the Browser.

+

+The GBiB is a small virtual machine version of the UCSC Genome Browser +that can be run on a laptop or desktop computer. It requires an installation +of a compatible version of the VirtualBox Software, and will then access +annotation data on demand through the Internet from UCSC as used, or +selective data can be downloaded for faster access.

+

+The GBiB and GBiC software tools resource the +Download server to rsync +data, as well as in certain circumstances the +MySQL server to extract +coordinate-specific table data.

+

+See the individual support pages for the GBiC and the GBiB +for detailed information about how to install and operate both. +You can get either the GBiC or the GBiB from the UCSC Genome Browser store +free for non-commercial use.

+ + + +

Do you support Docker?

+

+We do support a Dockerfile, that in essence points to the GBiC installation +script. While we recommend our GBiC script, we understand many +people are more familiar with working through Docker and provide +Docker installation instructions.

+

+Please note, similar to how our GBiB and GBiC are available in the +UCSC Genome Browser store, +where usage of our mirror software is free for non-commercial use. +Any commercial usage, including through the Docker image, involves +a license.

+ + + +

How do I extract data from the bigBed/2bit data formats?

+

+A lot of our data is stored in a binary indexed version called bigBed. This format +saves space and also allows the extraction of information based on the first three fields +(chrom, chromStart, chromEnd), which define annotation coordinate location.

+

+To pull information out of bigBed files there is a tool called bigBedToBed. +By running the command by itself you can see the command options. +

+bigBedToBed v1 - Convert from bigBed to ascii bed format.
+usage:
+   bigBedToBed input.bb output.bed
+options:
+   -chrom=chr1 - if set restrict output to given chromosome
+   -start=N - if set, restrict output to only that over start
+   -end=N - if set, restict output to only that under end
+   -maxItems=N - if set, restrict output to first N items
+   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
+   -header - output a autoSql-style header (starts with '#').
+

+

+Another similar tool is available to extract data from the binary indexed 2bit sequence +storage format. The tool twoBitToFa can be given coordinate ranges and the +DNA can be extracted from the file.

+

Examples

+

+

+ + + + +

Where can I learn more about Amazon Tools?

+

+The Amazon Ecosystem comes integrated with a collection of systems such as +CloudFront, +CloudWatch, +Relational Database Service (RDS), +Elastic Block Store (EBS), +Lambda, +and Aurora. +Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. +The UCSC Genome Browser's tableName.MYD and tableName.MYI files can be used with Aurora, +instead of installing MariaDb, however, there may be some services costs in Amazon for +using Aurora.

+ +