bbabbd5d2566d47d923d51dbe350634783455999 mspeir Sun Oct 26 12:14:52 2025 -0700 change soe to gi, refs #35031 diff --git src/hg/htdocs/FAQ/FAQdownloads.html src/hg/htdocs/FAQ/FAQdownloads.html index 054468a690e..1abecf9933f 100755 --- src/hg/htdocs/FAQ/FAQdownloads.html +++ src/hg/htdocs/FAQ/FAQdownloads.html @@ -62,68 +62,68 @@ Return to FAQ Table of Contents
Sequence and annotation data downloads are usually made available within the first week of the release of a new assembly. The download directories are automatically updated nightly to incorporate additions and modifications to the data.
You can download sequence and annotation data using our FTP server, but we recommend using rsync, which has the advantage of starting up where it left off after a failure, when run again. Please see the previous link for examples.
You can also download data from our -Downloads page or our +Downloads page or our DAS server. To download a specific subset of the data or to configure the output format of the data, use the Table Browser. For information on extracting a large set of sequences from an assembly, see Extracting sequence in batch from an assembly.
For more information on using the UCSC DAS server, see Downloading data from the UCSC DAS server.
Another option for querying sequence and annotation data is the REST API. This interface allows for extraction of sequence and annotations from both UCSC assemblies and from hubs.
To quickly download large volumes of data you can use UDR (UDT Enabled Rysnc): UDR provides users much faster download rates. Here is an example using UDR, once installed, to download all the mouse mm9 ENCODE information that amounts to several terabytes:
-$ udr rsync -avP hgdownload.soe.ucsc.edu::goldenPath/mm9/encodeDCC/ /my/local/mm9/
+$ udr rsync -avP hgdownload.gi.ucsc.edu::goldenPath/mm9/encodeDCC/ /my/local/mm9/
$ udr rsync -avP hgdownload-euro.soe.ucsc.edu::goldenPath/mm9/encodeDCC/ /my/local/mm9/
Optional: download from our secondary download server.
$ udr rsync -avP hgdownload2.soe.ucsc.edu::goldenPath/mm9/encodeDCC/ /my/local/mm9/
Please read more about the new UDR method here.
As of June 2016, the location of metadata tables that support the GenBank and RefSeq tracks (RefSeq, Other RefSeq, mRNA, EST, etc.) have been moved from directories of individual assemblies to one global database, hgFixed.
The tables below (previously found per assembly) can now be downloaded from the -hgFixed database:
+hgFixed database:There are two ways to extract genomic sequence in batch from an assembly:
A. Download the appropriate fasta files from our -ftp server and extract sequence data using +ftp server and extract sequence data using your own tools or the tools from our source tree. This is the recommended method when you have very large sequence datasets or will be extracting data frequently. Sequence data for most assemblies is located in the assembly's "chromosomes" subdirectory on the downloads server. For example, the sequence for human assembly hg17 can be found in -ftp://hgdownload.soe.ucsc.edu/goldenPath/hg17/chromosomes/. +ftp://hgdownload.gi.ucsc.edu/goldenPath/hg17/chromosomes/. You'll find instructions for obtaining our source programs and utilities here. Some programs that you may find useful are nibFrag and twoBitToFa, as well as other fa* programs. To obtain usage information about most programs, execute it without arguments.
B. Use the Table browser to extract sequence. This is a convenient way to obtain small amounts of sequence.
Note: It is not recommeneded to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry.
If you wish to update a large number of coordinates to a different assembly and have access to a Linux platform, you may find it useful to try the command-line version of the LiftOver tool. The executable file for this utility can be downloaded here. LiftOver requires a pre-generated over.chain file as input, available for selected assemblies from the -Downloads page. If the desired +Downloads page. If the desired file is not available, send a request to the genome mailing list and we may be able to provide you with one.
Here is an example on how to set up and run LiftOver from the command line:
chmod +x liftOver
./liftOver
liftOver - Move annotations from one assembly to another
usage:
liftOver oldFile map.chain newFile unMapped
...wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToMm10.over.chain.gz+
wget http://hgdownload.gi.ucsc.edu/goldenPath/hg38/liftOver/hg38ToMm10.over.chain.gz
chr1 11166587 11191615 MTOR chr9 136130562 136150630 ABO chr12 25358179 25403854 KRAS chrX 151335633 151619831 GABRA3
./liftOver preLift.bed hg19ToHg38.over.chain.gz conversions.bed unMapped
If you are looking at the RefSeq Genes, the refFlat table contains both the gene name (usually a HUGO Gene Nomenclature Committee ID) and its accession number. For the Known Genes, use the kgAlias table.
To obtain a complete copy of the entire Known Genes data set for an organism, open the Genome -Browser Downloads page, jump to the +Browser Downloads page, jump to the section specific to the organism, click the Annotation database link in that section, then click the link for the knownGene.txt.gz table.
Data for a specific region or chromosome may be obtained from the Table Browser by selecting the "Genes and Gene Prediction Tracks" group, the "UCSC Genes" track and the "knownGene" table. Set the position to the region of interest, then click the "get output" button.
UCSC uses the latest versions of RepeatMasker and repeat libraries available on the date when the assembly data is processed. RepeatMasker version information can usually be found in the README text -for the assembly's bigZips downloads +for the assembly's bigZips downloads directory.
Masking is done using the RepeatMasker -s flag. For mouse repeats, we also use -m. In addition to RepeatMasker, we use the Tandem Repeat Finder (trf) program, masking out repeats of period 12 or less. The repeats are just "soft" masked. Alignments are allowed to extend through repeats, but not initiate in them.
Yes, you can obtain the repeat-masked files via the Table Browser or from the organism's annotation database downloads directory. The RepeatMasker annotation tables are named chrN_rmsk (where N represents the chromosome number) and the Tandem Repeat Finder (TRF) tables are named simpleRepeat.
@@ -817,31 +817,31 @@UCSC occasionally uses updated versions of the RepeatMasker software and repeat libraries that are not yet available on the RepeatMasker website (see Repeat-masking data for more information).
The UCSC Genome Browser offers several ways to obtain this information, depending on your requirements.
-The Genome Browser downloads site +The Genome Browser downloads site provides prepackaged downloads of 1000 bp, 2000 bp, and 5000 bp upstream sequence for RefSeq genes that have a coding portion and annotated 5' and 3' UTRs. You can obtain these from the bigZips downloads directory for the assembly of interest.
To fetch the upstream sequence for a specific gene, use the Table Browser. Enter the genome, assembly, and select the knownGene table. Paste the gene name or accession number in the identifier field. Choose sequence for the output format type, then click the get output button. On the next page, select genomic. On the final page, you will have the opportunity to configure the amount of upstream promoter sequence to fetch, along with several other options. Click Get Sequence when you've finished configuring the output.
You can also use the Genome Browser to obtain sequence for a specific gene. Open the Genome Browser window to display the gene in which you're interested. Click the entry for the gene in the RefSeq or Known Genes track, then click the Genomic Sequence link. Alternatively, you can click the DNA link in the top menu bar of the Genome Browser tracks window to access options for displaying the @@ -898,31 +898,31 @@ contains the physical position of all STS markers, including those on the deCODE map. This file also contains information about the position on the genome-wide maps, including the deCODE map. A second file, stsInfo2, contains additional information about each marker, including aliases, primer sequence information, etc. This table is related to the first table by an ID (the identNo field in both files).
Yes. See our documentation on Downloading Data using MariaDB (MySQL).
Connect to the US MariaDB server using the command:
-mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A
+mysql --user=genome --host=genome-mysql.gi.ucsc.edu -A
Or to the European MariaDB server using the command:
mysql --user=genome --host=genome-euro-mysql.soe.ucsc.edu -A
The fourth column of the BED output contains a lot of information separated by underscores. For example:
uc009vjk.2_cds_1_0_chr1_324343_f
This information is represented as follows:
ucscId_sequenceType_sequenceTypeNumber_basesAdded_chromosome_positionOfFirstBaseOfItem_strand
The raw data underlying a track can be explored interactively with the Table Browser, Data Integrator, or Variant Annotation Integrator. For automated analysis, the genome annotation can be downloaded from the -downloads server, one of our two +downloads server, one of our two public MariaDB servers, or using our REST API.
bigBed data: For bigBed files, individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found -here. The tool can +here. The tool can also be used to obtain only features within a given range using one of the hgdownload servers, example:
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/path/to/file/bigBedfile.bb -chrom=chr21 -start=0 -end=1000000 stdout
+ bigBedToBed http://hgdownload.gi.ucsc.edu/gbdb/path/to/file/bigBedfile.bb -chrom=chr21 -start=0 -end=1000000 stdout
bigBedToBed http://hgdownload-euro.soe.ucsc.edu/gbdb/path/to/file/bigBedfile.bb -chrom=chr21 -start=0 -end=1000000 stdout
Read more in our blog about Accessing the Genome Browser Programmatically to acquire data.
For versions dbSNP153 and above, the data is formatted in bigBed files. Previous versions are MySQL tables. For help with versions before dbSNP153, see accessing MySQL data. This FAQ entry pertains to versions dbSNP153 and above.
Since dbSNP has grown to include over 700 million variants, the size of the All dbSNP (153+) subtrack can cause the Table Browser and Data Integrator to time out, leading to a blank page or truncated output, unless queries are restricted to a chromosomal region or to a specific set of rs# IDs (which can be pasted/uploaded into the Table Browser), or to one of the subset tracks such as Common or ClinVar.
For automated analysis, the track data files can be downloaded from the downloads server for -hg19 and -hg38. Below +hg19 and +hg38. Below are specific examples for dbSNP153, however, the same methods and directories will work by substituting a more recent dbSNP release.
| file | format | subtrack | ||
|---|---|---|---|---|
| dbSnp153.bb | -hg19 | -hg38 | bigDbSnp (bigBed4+13) | All dbSNP (153) |
| dbSnp153ClinVar.bb | -hg19 | -hg38 | bigDbSnp (bigBed4+13) | ClinVar dbSNP (153) |
| dbSnp153Common.bb | -hg19 | -hg38 | bigDbSnp (bigBed4+13) | Common dbSNP (153) |
| dbSnp153Mult.bb | -hg19 | -hg38 | bigDbSnp (bigBed4+13) | Mult. dbSNP (153) |
| dbSnp153BadCoords.bb | -hg19 | -hg38 | bigBed4 | Map Err (153) |
| - dbSnp153Details.tab.gz | gzip-compressed tab-separated text | Detailed variant properties, independent of genome assembly version | ||
Several utilities for working with bigBed-formatted binary files can be downloaded -here. Run a utility with no arguments in order to see a brief description of the utility and its options.
Example: retrieve all variants in the region chr1:200001-200400
-bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp153.bb -chrom=chr1 -start=200000 -end=200400 stdout+
bigBedToBed http://hgdownload.gi.ucsc.edu/gbdb/hg38/snp/dbSnp153.bb -chrom=chr1 -start=200000 -end=200400 stdout
Example: retrieve variant rs6657048
bigBedNamedItems dbSnp153.bb rs6657048 stdout
Example: retrieve all variants with rs# IDs in file myIds.txt
bigBedNamedItems -nameFile dbSnp153.bb myIds.txt dbSnp153.myIds.bed
The columns in the bigDbSnp/bigBed files and dbSnp153Details.tab.gz file are described in bigDbSnp.as and dbSnpDetails.as respectively.
UCSC has an API @@ -1149,51 +1149,51 @@
Currently, the Table Browser option return data in
GTF format is limited as explained below.
To convert custom GenePred format data into GTF, the best method is to use the
command-line format conversion utility, genePredToGtf. This can optionally be set up
to automatically connect to the UCSC public SQL database and return GTF files in a few minutes using
this short guide.
For simplicity, GTF files have been generated using the genePredToGtf method
described above and are available on our download server for the main gene transcript sets.
These can be found at the following download server address:
-http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/
+http://hgdownload.gi.ucsc.edu/goldenPath/$db/bigZips/genes/
where $db is the assembly of interest. For example, the hg38 GTF files.
Summary of Table Browser limitations:
GenePred (short for Gene Predictions) is a table
format commonly used for gene tracks in the UCSC Genome Browser where each transcript has a single
row. Tables are not stored in GTF as it would require many rows to describe a single transcript
since each gene feature (i.e., exon) requires a separate line. The genePredToGtf command-line
utility can be used to convert genePred to GTF. Download the genePredToGtf operating
system-specific command-line utility from the
-utilities directory.
Please see the Genes in GTF
or GFF Format wiki page for examples and various methods for conversion. The genePredToGtf
utility can convert files from several sources, such as Table Browser output from a genePred table,
a local downloaded gene set table like refGene.txt, or from querying
public MariaDB tables.
Most of our tables have a special first column called "bin" that helps with quickly displaying data on the Genome Browser. This (chrom,bin) index causes query results to be ordered first by bin, then by chromStart. This allows us to query and return results more quickly than if they were sorted by chromStart.
@@ -1244,58 +1244,58 @@Though not all analysis sets contain the same information, features include:
For more information on analysis sets, see the NCBI FAQ. Information on what is contained in each specific assembly analysis set can be found in the README by clicking the Genome sequence files link for the assembly of interest in our -Downloads page. +Downloads page.
For 2000+ GenArk genomes, we visualize them in assembly hubs instead of native assemblies like hg38 and mm39. These Genome Browsers can be accessed from our Genomes page by searching common name or GCA/GCF number. You can also access the browsers for these species directly with links in the following format:
https://genome.ucsc.edu/h/GCF_000951035.1
The downloads data for these assemblies is stored in a different location than our goldenPath, SQL, or gbdb file directories. There are two ways to access this data for download. First, you can go to the -GenArk page +GenArk page and select your clade (primates, mammals, birds, etc.) and then you will be brought to a page with a table of species and GCA/GCF assembly identifiers. Find your genome and click on the third column, labeled "Scientific name and data download", which will take you to the download directory for that species.
Alternatively, you can enter your GCA/GCF identifier in the URL in groups of three characters, seperated by slashes. For example, the identifier "GCA_004027835.1" has data in the following directory: -
https://hgdownload.soe.ucsc.edu/hubs/GCA/004/027/835/+
https://hgdownload.gi.ucsc.edu/hubs/GCA/004/027/835/
The difference in the conservation scores, for both PhastCons and PhyloP, is that the wiggle database format (from which the details page and Table Browser scores are extracted) uses lossy compression that keeps enough resolution to display the pixelated scores in the browser graphic display but does not reconstruct the true original scores. This is why we make the original score files available for download.