bbabbd5d2566d47d923d51dbe350634783455999 mspeir Sun Oct 26 12:14:52 2025 -0700 change soe to gi, refs #35031 diff --git src/hg/htdocs/FAQ/FAQdownloads.html src/hg/htdocs/FAQ/FAQdownloads.html index 054468a690e..1abecf9933f 100755 --- src/hg/htdocs/FAQ/FAQdownloads.html +++ src/hg/htdocs/FAQ/FAQdownloads.html @@ -62,68 +62,68 @@ Return to FAQ Table of Contents

Downloading sequence and annotation data

How do I obtain the sequence and/or annotation data for a release?

Sequence and annotation data downloads are usually made available within the first week of the release of a new assembly. The download directories are automatically updated nightly to incorporate additions and modifications to the data.

You can download sequence and annotation data using our FTP server, but we recommend using rsync, which has the advantage of starting up where it left off after a failure, when run again. Please see the previous link for examples.

You can also download data from our -Downloads page or our +Downloads page or our DAS server. To download a specific subset of the data or to configure the output format of the data, use the Table Browser. For information on extracting a large set of sequences from an assembly, see Extracting sequence in batch from an assembly.

For more information on using the UCSC DAS server, see Downloading data from the UCSC DAS server.

Another option for querying sequence and annotation data is the REST API. This interface allows for extraction of sequence and annotations from both UCSC assemblies and from hubs.

To quickly download large volumes of data you can use UDR (UDT Enabled Rysnc): UDR provides users much faster download rates. Here is an example using UDR, once installed, to download all the mouse mm9 ENCODE information that amounts to several terabytes:

-
$ udr rsync -avP hgdownload.soe.ucsc.edu::goldenPath/mm9/encodeDCC/ /my/local/mm9/
+
$ udr rsync -avP hgdownload.gi.ucsc.edu::goldenPath/mm9/encodeDCC/ /my/local/mm9/
$ udr rsync -avP hgdownload-euro.soe.ucsc.edu::goldenPath/mm9/encodeDCC/ /my/local/mm9/

Optional: download from our secondary download server.

$ udr rsync -avP hgdownload2.soe.ucsc.edu::goldenPath/mm9/encodeDCC/ /my/local/mm9/
Please read more about the new UDR method here.

Metadata tables for GenBank and RefSeq moved to hgFixed database

I can no longer find metadata tables like gbCdnaInfo for an assembly.

As of June 2016, the location of metadata tables that support the GenBank and RefSeq tracks (RefSeq, Other RefSeq, mRNA, EST, etc.) have been moved from directories of individual assemblies to one global database, hgFixed.

The tables below (previously found per assembly) can now be downloaded from the -hgFixed database:

+hgFixed database:

@@ -164,36 +164,36 @@ The Table Browser, as connected tables and joined fields described when clicking the "data format description " button
  • One of our two public access MariaDB servers in the US and Europe
  • Extracting sequence in batch from an assembly

    I have a lot of coordinates for an assembly and want to extract the corresponding sequences. What is the best way to proceed?

    There are two ways to extract genomic sequence in batch from an assembly:

    A. Download the appropriate fasta files from our -ftp server and extract sequence data using +ftp server and extract sequence data using your own tools or the tools from our source tree. This is the recommended method when you have very large sequence datasets or will be extracting data frequently. Sequence data for most assemblies is located in the assembly's "chromosomes" subdirectory on the downloads server. For example, the sequence for human assembly hg17 can be found in -ftp://hgdownload.soe.ucsc.edu/goldenPath/hg17/chromosomes/. +ftp://hgdownload.gi.ucsc.edu/goldenPath/hg17/chromosomes/. You'll find instructions for obtaining our source programs and utilities here. Some programs that you may find useful are nibFrag and twoBitToFa, as well as other fa* programs. To obtain usage information about most programs, execute it without arguments.

    B. Use the Table browser to extract sequence. This is a convenient way to obtain small amounts of sequence.

    1. Create a custom track of the genomic coordinates in BED format and upload into the Genome Browser.
    2. Select the custom track in the Table browser, then select the "sequence" output format to retrieve data. We recommend that you save the file locally as gzip.
    3. @@ -650,85 +650,85 @@ reverse, and cross-species conversions, but does not accept batch input. The LiftOver tool, accessed via the Tools link on the Genome Browser home page, also supports forward, reverse, and cross-species conversions, as well as batch conversions.

      Note: It is not recommeneded to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry.

      If you wish to update a large number of coordinates to a different assembly and have access to a Linux platform, you may find it useful to try the command-line version of the LiftOver tool. The executable file for this utility can be downloaded here. LiftOver requires a pre-generated over.chain file as input, available for selected assemblies from the -Downloads page. If the desired +Downloads page. If the desired file is not available, send a request to the genome mailing list and we may be able to provide you with one.

      Using liftOver

      Here is an example on how to set up and run LiftOver from the command line:

      1. Download the LiftOver program for your computer's operating system here
      2. Change permissions on that file so that it can be executed
         chmod +x liftOver
      3. Run the program with no arguments to see the usage statement
         ./liftOver
         liftOver - Move annotations from one assembly to another
         usage:
            liftOver oldFile map.chain newFile unMapped
         ...
      4. Download your genome conversion chain file from the - downloads directory. + downloads directory. For example, the human to mouse conversion (hg38ToMm10) can be downloaded like so: -
        wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToMm10.over.chain.gz
        +
        wget http://hgdownload.gi.ucsc.edu/goldenPath/hg38/liftOver/hg38ToMm10.over.chain.gz
      5. Prepare your BED file input. Here is a few lines from a BED file you can copy into a text file, saved as "preLift.bed".
         chr1	11166587	11191615	MTOR
         chr9	136130562	136150630	ABO
         chr12	25358179	25403854	KRAS
         chrX	151335633	151619831	GABRA3
      6. You can now use the following command to LiftOver a BED file with annotations in your original genome, "preLift.bed", with your successful conversions in "conversions.bed" and unsuccessful conversions in "unMapped".
         ./liftOver preLift.bed hg19ToHg38.over.chain.gz conversions.bed unMapped

      Linking gene name with accession number

      I have the accession number for a gene and would like to link it to the gene name. Is there a table that shows both pieces of information?

      If you are looking at the RefSeq Genes, the refFlat table contains both the gene name (usually a HUGO Gene Nomenclature Committee ID) and its accession number. For the Known Genes, use the kgAlias table.

      Obtaining a list of Known Genes

      How can I obtain a complete list of all the genes in the UCSC Known Genes table for a particular organism?

      To obtain a complete copy of the entire Known Genes data set for an organism, open the Genome -Browser Downloads page, jump to the +Browser Downloads page, jump to the section specific to the organism, click the Annotation database link in that section, then click the link for the knownGene.txt.gz table.

      Data for a specific region or chromosome may be obtained from the Table Browser by selecting the "Genes and Gene Prediction Tracks" group, the "UCSC Genes" track and the "knownGene" table. Set the position to the region of interest, then click the "get output" button.

      Filtering for a transcription factor in the JASPAR database

      How do I display only one transcription factor?
      @@ -784,31 +784,31 @@
      JASPAR Table Browser

    Repeat-masking data

    What version of RepeatMasker do you use on your data? Which flags do you use?

    UCSC uses the latest versions of RepeatMasker and repeat libraries available on the date when the assembly data is processed. RepeatMasker version information can usually be found in the README text -for the assembly's bigZips downloads +for the assembly's bigZips downloads directory.

    Masking is done using the RepeatMasker -s flag. For mouse repeats, we also use -m. In addition to RepeatMasker, we use the Tandem Repeat Finder (trf) program, masking out repeats of period 12 or less. The repeats are just "soft" masked. Alignments are allowed to extend through repeats, but not initiate in them.

    Availability of repeat-masked data

    Are the repeat annotation files available for every chromosome?

    Yes, you can obtain the repeat-masked files via the Table Browser or from the organism's annotation database downloads directory. The RepeatMasker annotation tables are named chrN_rmsk (where N represents the chromosome number) and the Tandem Repeat Finder (TRF) tables are named simpleRepeat.

    @@ -817,31 +817,31 @@

    RepeatMasker version differences - UCSC vs. RepeatMasker website

    When I run RepeatMasker independently from the RepeatMasker web server, my results vary from those of UCSC. What's the cause?

    UCSC occasionally uses updated versions of the RepeatMasker software and repeat libraries that are not yet available on the RepeatMasker website (see Repeat-masking data for more information).

    Obtaining promoter sequence

    How can I fetch promoter sequence upstream of a gene?

    The UCSC Genome Browser offers several ways to obtain this information, depending on your requirements.

    -The Genome Browser downloads site +The Genome Browser downloads site provides prepackaged downloads of 1000 bp, 2000 bp, and 5000 bp upstream sequence for RefSeq genes that have a coding portion and annotated 5' and 3' UTRs. You can obtain these from the bigZips downloads directory for the assembly of interest.

    To fetch the upstream sequence for a specific gene, use the Table Browser. Enter the genome, assembly, and select the knownGene table. Paste the gene name or accession number in the identifier field. Choose sequence for the output format type, then click the get output button. On the next page, select genomic. On the final page, you will have the opportunity to configure the amount of upstream promoter sequence to fetch, along with several other options. Click Get Sequence when you've finished configuring the output.

    You can also use the Genome Browser to obtain sequence for a specific gene. Open the Genome Browser window to display the gene in which you're interested. Click the entry for the gene in the RefSeq or Known Genes track, then click the Genomic Sequence link. Alternatively, you can click the DNA link in the top menu bar of the Genome Browser tracks window to access options for displaying the @@ -898,31 +898,31 @@ contains the physical position of all STS markers, including those on the deCODE map. This file also contains information about the position on the genome-wide maps, including the deCODE map. A second file, stsInfo2, contains additional information about each marker, including aliases, primer sequence information, etc. This table is related to the first table by an ID (the identNo field in both files).

    Direct MariaDB (MySQL) access to data

    Is it possible to run SQL queries directly on the database rather than using the Table Browser interface?

    Yes. See our documentation on Downloading Data using MariaDB (MySQL).

    Connect to the US MariaDB server using the command:

    -
    mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A 
    +
    mysql --user=genome --host=genome-mysql.gi.ucsc.edu -A 

    Or to the European MariaDB server using the command:

    mysql --user=genome --host=genome-euro-mysql.soe.ucsc.edu -A 

    Name of fourth column in BED output

    When using the Table Browser to extract exons from a Gene track, what does the "Name" column (fourth BED column) refer to?

    The fourth column of the BED output contains a lot of information separated by underscores. For example:

    uc009vjk.2_cds_1_0_chr1_324343_f 

    This information is represented as follows:

    ucscId_sequenceType_sequenceTypeNumber_basesAdded_chromosome_positionOfFirstBaseOfItem_strand

    Track Data Access

    How do I access the data underlying a track?

    The raw data underlying a track can be explored interactively with the Table Browser, Data Integrator, or Variant Annotation Integrator. For automated analysis, the genome annotation can be downloaded from the -downloads server, one of our two +downloads server, one of our two public MariaDB servers, or using our REST API.

    bigBed data: For bigBed files, individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found -here. The tool can +here. The tool can also be used to obtain only features within a given range using one of the hgdownload servers, example:

    Read more in our blog about Accessing the Genome Browser Programmatically to acquire data.

    How do I download dbSNP data?

    For versions dbSNP153 and above, the data is formatted in bigBed files. Previous versions are MySQL tables. For help with versions before dbSNP153, see accessing MySQL data. This FAQ entry pertains to versions dbSNP153 and above.

    Since dbSNP has grown to include over 700 million variants, the size of the All dbSNP (153+) subtrack can cause the Table Browser and Data Integrator to time out, leading to a blank page or truncated output, unless queries are restricted to a chromosomal region or to a specific set of rs# IDs (which can be pasted/uploaded into the Table Browser), or to one of the subset tracks such as Common or ClinVar.

    For automated analysis, the track data files can be downloaded from the downloads server for -hg19 and -hg38. Below +hg19 and +hg38. Below are specific examples for dbSNP153, however, the same methods and directories will work by substituting a more recent dbSNP release. - - - - - - - - - -
    file format subtrack
    dbSnp153.bbhg19hg38 bigDbSnp (bigBed4+13) All dbSNP (153)
    dbSnp153ClinVar.bbhg19hg38 bigDbSnp (bigBed4+13) ClinVar dbSNP (153)
    dbSnp153Common.bbhg19hg38 bigDbSnp (bigBed4+13) Common dbSNP (153)
    dbSnp153Mult.bbhg19hg38 bigDbSnp (bigBed4+13) Mult. dbSNP (153)
    dbSnp153BadCoords.bbhg19hg38 bigBed4 Map Err (153)
    - dbSnp153Details.tab.gz gzip-compressed tab-separated text Detailed variant properties, independent of genome assembly version

    Several utilities for working with bigBed-formatted binary files can be downloaded -here. Run a utility with no arguments in order to see a brief description of the utility and its options.

    Example: retrieve all variants in the region chr1:200001-200400

    -
    bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp153.bb -chrom=chr1 -start=200000 -end=200400 stdout
    +
    bigBedToBed http://hgdownload.gi.ucsc.edu/gbdb/hg38/snp/dbSnp153.bb -chrom=chr1 -start=200000 -end=200400 stdout

    Example: retrieve variant rs6657048

    bigBedNamedItems dbSnp153.bb rs6657048 stdout

    Example: retrieve all variants with rs# IDs in file myIds.txt

    bigBedNamedItems -nameFile dbSnp153.bb myIds.txt dbSnp153.myIds.bed

    The columns in the bigDbSnp/bigBed files and dbSnp153Details.tab.gz file are described in bigDbSnp.as and dbSnpDetails.as respectively.

    UCSC has an API @@ -1149,51 +1149,51 @@

    Obtaining GTF (Gene Transfer Format)

    What is the best method for obtaining GTF output?

    Currently, the Table Browser option return data in GTF format is limited as explained below. To convert custom GenePred format data into GTF, the best method is to use the command-line format conversion utility, genePredToGtf. This can optionally be set up to automatically connect to the UCSC public SQL database and return GTF files in a few minutes using this short guide.

    For simplicity, GTF files have been generated using the genePredToGtf method described above and are available on our download server for the main gene transcript sets. These can be found at the following download server address: -http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/ +http://hgdownload.gi.ucsc.edu/goldenPath/$db/bigZips/genes/ where $db is the assembly of interest. For example, the hg38 GTF files.

    +href="http://hgdownload.gi.ucsc.edu/goldenPath/hg38/bigZips/genes/">hg38 GTF files.

    Summary of Table Browser limitations:

    GenePred (short for Gene Predictions) is a table format commonly used for gene tracks in the UCSC Genome Browser where each transcript has a single row. Tables are not stored in GTF as it would require many rows to describe a single transcript since each gene feature (i.e., exon) requires a separate line. The genePredToGtf command-line utility can be used to convert genePred to GTF. Download the genePredToGtf operating system-specific command-line utility from the -utilities directory.

    +utilities directory.

    Please see the Genes in GTF or GFF Format wiki page for examples and various methods for conversion. The genePredToGtf utility can convert files from several sources, such as Table Browser output from a genePred table, a local downloaded gene set table like refGene.txt, or from querying public MariaDB tables.

    Table Browser output file order

    My table browser output file is not ordered by position, how is it ordered?

    Most of our tables have a special first column called "bin" that helps with quickly displaying data on the Genome Browser. This (chrom,bin) index causes query results to be ordered first by bin, then by chromStart. This allows us to query and return results more quickly than if they were sorted by chromStart.

    @@ -1244,58 +1244,58 @@

    Though not all analysis sets contain the same information, features include:

    For more information on analysis sets, see the NCBI FAQ. Information on what is contained in each specific assembly analysis set can be found in the README by clicking the Genome sequence files link for the assembly of interest in our -Downloads page. +Downloads page.

    GenArk Downloads

    How do I download GenArk assembly hub data for my species?

    For 2000+ GenArk genomes, we visualize them in assembly hubs instead of native assemblies like hg38 and mm39. These Genome Browsers can be accessed from our Genomes page by searching common name or GCA/GCF number. You can also access the browsers for these species directly with links in the following format:

    https://genome.ucsc.edu/h/GCF_000951035.1

    The downloads data for these assemblies is stored in a different location than our goldenPath, SQL, or gbdb file directories. There are two ways to access this data for download. First, you can go to the -GenArk page +GenArk page and select your clade (primates, mammals, birds, etc.) and then you will be brought to a page with a table of species and GCA/GCF assembly identifiers. Find your genome and click on the third column, labeled "Scientific name and data download", which will take you to the download directory for that species.

    Alternatively, you can enter your GCA/GCF identifier in the URL in groups of three characters, seperated by slashes. For example, the identifier "GCA_004027835.1" has data in the following directory: -

    https://hgdownload.soe.ucsc.edu/hubs/GCA/004/027/835/
    +
    https://hgdownload.gi.ucsc.edu/hubs/GCA/004/027/835/

    Conservation scores downloads

    Why are the conservation scores on the UCSC Genome Browser site different from the ones in the download file?

    The difference in the conservation scores, for both PhastCons and PhyloP, is that the wiggle database format (from which the details page and Table Browser scores are extracted) uses lossy compression that keeps enough resolution to display the pixelated scores in the browser graphic display but does not reconstruct the true original scores. This is why we make the original score files available for download.