5295d58d808a9ef30d0514558d943b7dbf4bd4f7 gperez2 Tue Sep 2 16:03:33 2025 -0700 Updating the GB page for the Assembly Hub Wiki, refs #34740 diff --git src/hg/htdocs/goldenPath/help/assemblyHubHelp.html src/hg/htdocs/goldenPath/help/assemblyHubHelp.html new file mode 100755 index 00000000000..ce79fca0e8c --- /dev/null +++ src/hg/htdocs/goldenPath/help/assemblyHubHelp.html @@ -0,0 +1,848 @@ + + + + + + + +

Assembly Hub User Guide

+ + +

Overview

+

+An Assembly Data Hub is a set of Internet-accessible data files that define the reference sequence +to be used for a browser instance, as well as all the data files that define the annotation for +that sequence. Assembly Data Hubs allow researchers to use the UCSC Genome Browser to view their +own sequences with associated annotation, without the requirement that UCSC support a browser on that sequence. +

+ +

+Note: if you are working with a genome that has already been submitted +to the NCBI Assembly system, it may +already be available in the UCSC Genome Browser. +Please check the GenArk Assembly Hub collection +to see if your genome of interest is already available. If it is not listed there, you can use the +UCSC Assembly Request page to request that the genome assembly be +added.

+ + +

Contents

+
Web Server
+
Assembly Hub Components
+ +
Linking to Your Assembly Hub
+
Building Tracks
+ +
Assembly Hub Resources
+ + + +
Adding BLAT Servers
+ + + + + +

Web Server

+

+To display a novel genome sequence in the UCSC Genome Browser, a web server hosted by the +institution (or a free service such as Cyverse) +can be used. For environments operating behind a firewall, hub files can also be loaded locally +through GBiB to provide access to the UCSC Genome +Browser. Hosting hub files over HTTP is strongly recommended, as it is +significantly more efficient than FTP. A hierarchical directory structure must then be +established to organize the files associated with the genome sequence. For example: +

+ +
+myHub/ - directory to organize your files on this hub
+    hub.txt - primary reference text file to define the hub, refers to:
+    genomes.txt - definitions for each genome assembly on this hub
+        newOrg1/ - directory of files for this specific genome assembly
+            newOrg1.2bit - '2bit' file constructed from your fasta sequence
+            description.html - information about this assembly for users
+            trackDb.txt - definitions for tracks on this genome assembly
+            groups.txt - definitions for track groups on this assembly
+            bigWig and bigBed files - data for tracks on this assembly
+            external track hub data tracks
+
+

+The hub can be referenced by a URL such as: http://yourLab.yourInstitution.edu/myHub/hub.txt

+ +

Assembly Hub Components

+ + + +

hub.txt

+

+The initial file, hub.txt is the primary URL reference for the assembly hub:

+

Format of the file:

+
+hub hubName
+shortLabel genome
+longLabel Comment describing this hub contents
+genomesFile genomes.txt
+email contactEmail@institution.edu
+descriptionUrl aboutHub.html
+
+

+shortLabel is the name that will appear in the genome pull-down menu at the +UCSC gateway page.

+

+genomesFile is a reference to the next definition file in this chain that will +describe the assemblies and tracks available at this hub. Typically, genomes.txt is at +the same directory level as this hub.txt; however, it can also be a relative path +reference to a different directory level.

+

+email provides users with a contact point for questions related to this assembly hub.

+

+descriptionUrl specifies a relative path or URL link to a webpage describing the hub.

+

+You can view a working example at hub.txt

+ + +

genomes.txt

+

The genomes.txt file provides references to the genome assemblies and tracks available in +the assembly hub.

+
+genome ricCom1
+trackDb ricCom1/trackDb.txt
+groups ricCom1/groups.txt
+description July 2011 Castor bean
+twoBitPath ricCom1/ricCom1.2bit
+organism Ricinus communis
+defaultPos E09R7372:1000000-2000000
+orderKey 4800
+scientificName Ricinus communis
+htmlPath ricCom1/description.html
+transBlat yourLab.yourInstitution.edu 17777
+blat yourLab.yourInstitution.edu 17777
+isPcr yourLab.yourInstitution.edu 17779
+
+

+Multiple assembly definitions can be included in a single file, separated by blank lines. The file +references are relative paths. In this example, the subdirectory ricCom1 contains +the files for this specific assembly.

+ +

Note: it is strongly recommended that each genome stanza includes defaultPos, +scientificName, organism, description, so that the hub loads with +meaningful defaults and can be more easily searched from the Gateway page.

+ + +

2bit File

+

+The .2bit file is constructed from the FASTA sequence for the assembly using the +faToTwoBit kent program (available from the +downloads page).

+

Example:

+
+faToTwoBit ricCom1.fa ricCom1.2bit
+
+

+Use twoBitInfo to verify sequences and create a chrom.sizes file, +which is not used in the hub itself but is helpful for constructing big* files: +

+
+twoBitInfo ricCom1.2bit stdout | sort -k2rn > ricCom1.chrom.sizes
+
+

+The .2bit file can also be hosted at a URL:

+
+twoBitInfo -udcDir=https://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubPlants/cshl2013/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes
+
+

+To extract sequences from a .2bit file: +

+
+twoBitToFa -seq=chrCp -udcDir=https://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubPlants/cshl2013/ricCom1/ricCom1.2bit stdout > ricCom1.chrCp.fa
+
+ + +

groups.txt

+

The groups.txt file defines the grouping of track controls under the Genome Browser graphic +display.

+

Example:

+
+name map
+label Mapping
+priority 2
+defaultIsClosed 0
+
+ + +

Refer to the Adding Groups to a Track hub section of the Track Hubs help page for more +details.

+ + +

Single-File Track Hub (useOneFile on)

+

+Traditionally, an assembly hub required multiple configuration files (hub.txt, +genomes.txt, trackDb.txt, and optionally groups.txt), along +with a .2bit file for the sequence. The useOneFile on option simplifies +this by consolidating everything into a single configuration file. Note: The single-file +format supports one genome assembly per file. For multiple assemblies, use the traditional +multi-file setup.

+

Example configuration:

+
+hub mySingleFileHub
+shortLabel My Single-File Hub
+longLabel An example of a single-file UCSC track hub
+useOneFile on
+email myEmail@example.com
+
+genome hg19
+
+track exampleBigWig
+shortLabel BigWig Coverage
+longLabel Coverage data over hg19
+type bigWig
+visibility full
+bigDataUrl http://myServer.com/data/example.bigWig
+
+track exampleVCF
+shortLabel VCF Variants
+longLabel Variant calls over hg19 region
+type vcfTabix
+visibility pack
+bigDataUrl http://myServer.com/data/example.vcf.gz
+
+ + + +

+If your hub requires a reference genome sequence, you can still provide a .2bit file +with twoBitPath. Grouping (previously in +groups.txt.) can also be integrated here if needed. +

+ +

+Once hosted on a server, the single configuration file (and associated data files such as +.bigWig, .vcf.gz, .2bit) can be loaded into the UCSC Genome +Browser via the My Hubs page.

+ + +

Building Tracks

+

Tracks are defined in the trackDb.txt file, where each stanza specifies how +tracks are displayed (shortLabel, longLabel, color, visibility), along with other information such +as the group the track belongs to (referencing groups.txt) and whether +additional HTML should be displayed when a user clicks into the track or a track item:

+
+track gap_
+longLabel Gap
+shortLabel Gap
+priority 11
+visibility dense
+color 0,0,0
+bigDataUrl bbi/ricCom1.gap.bb
+type bigBed 4
+group map
+html ../trackDescriptions/gap
+
+

+For more information about the syntax of the trackDb.txt file, refer to the +Track Database Definition page. +

+

Processing genomes to construct tracks often requires a cluster or supercomputer. Small +genomes can be processed on single computers with multiple cores. The process for each track is +unique. For details, refer to the + + Browser Track Construction page, which discusses constructing tracks for assembly +hubs.

+ + +

Cytoband Track

+

+Assembly hubs can include a Cytoband track, which allows quicker navigation of chromosomes and +displays banding pattern information, if known.

+

+A simple version of the track can be built using the existing chrom.sizes file for your assembly. +Banding options include: gneg, gpos25, + gpos50, gpos75, gpos100, acen, gvar, or stalk).

+

Example:

+
+cat araTha1.chrom.sizes | sort -k1,1 -k2,2n | awk '{print $1,0,$2,$1,"gneg"}' > cytoBandIdeo.bed
+
+

+The resulting BED file can be converted into a BigBed file and associated with an .as +definition file (see +example) to +to inform the browser that this is not a standard BED:

+
+bedToBigBed -type=bed4 cytoBandIdeo.bed -as=cytoBand.as araTha1.chrom.sizes cytoBandIdeo.bigBed
+
+

+In trackDb.txt, if the track is named cytoBandIdeo (e.g., +track cytoBandIdeo), it will automatically load into the assembly +hub.

+ + +

Linking to Your Assembly Hub

+

+Direct links to the genome(s) within the assembly hub can then be constructed.

+ + + + + +

Assembly Hub Resources

+

+Resources for automatically building assembly hubs include G-OnRamp and MakeHub.

+ + +

G-OnRamp

+

+G-OnRamp is a Galaxy workflow that turns a genome assembly and RNA-Seq data into a Genome Browser +with multiple evidence tracks. Since G-OnRamp is based on the Galaxy platform, becoming familiar +with Galaxy concepts and functionalities is recommended. See their +instruction page +for an overview. +

+ + +

MakeHub

+

+MakeHub is a command-line tool for fully automatic generation of track data hubs for visualizing +genomes with the UCSC Genome Browser. More information is available on their +GitHub page.

+ + +

Example NCBI assembly hubs

+

+There is a collection of example NCBI assembly hubs that can be used directly or copied as +templates. A large collection of script-generated assembly hubs can be browsed on the development server, with +links defaulting to the genome-test site. To load these hubs on the public UCSC site, copy +the hub.txt link and replace the test server domain with the public domain.

+

+The following table provides links to launch various assembly hubs grouped by species subsets. By +scrolling down each page, you can access rows for individual assemblies (or groups of assemblies, +e.g., bacteria). Clicking the "common name" hyperlink (e.g., "African bush +elephant" on the Vertebrate Mammalian page) loads the selected hub.

+
+ + +

These assemblies use NCBI accession naming patterns. Prototype gene tracks from NCBI gene +predictions are available for a few assemblies. No BLAT servers are provided. Users can copy the +skeleton structure of a hub to run their own BLAT server locally. Brief instructions are available +on each assembly gateway page under "Download files for this assembly hub." + + +

Example: Loading the African bush elephant assembly hub and reviewing the related genomes.txt + and trackDb.txt

+

+Here are some quick steps to load an example hub from this collection, along with an explanation +of how to view the files behind the hub.

+
    +
  1. Click the + Vertebrate Mammalian assembly hub link above.
  2. +
  3. Scroll down to the common name column and click the hyperlink for + "African bush elephant".
  4. +
  5. You will arrive at a gateway page titled "African bush elephant Genome Browser - + GCA_000001905.1_Loxafr3.0 assembly". This page includes a section, + Data file downloads, where you can access the underlying + files.
  6. +
  7. Click Go (or use the top Genome Browser blue bar menu) to view this assembly hub. + (Note: this will open on our genome-test site.).
  8. +
  9. To load this hub on our public site, copy the hyperlink for + African bush elephant and paste it into your browser. + Then, change the beginning of the URL from
  10. +
    +https://genome-test.gi.ucsc.edu/...
    +
    + to +
    +https://genome.ucsc.edu/...
    +
    +
+

Exploring the files behind the hub

+

+To better understand how the hub works, you can review the associated files:

+
    +
  1. Go to the GCA_000001905.1_Loxafr3.0 directory + link.
  2. +
  3. Locate the file GCA_000001905.1_Loxafr3.0.ncbi.2bit. This binary indexed file allows + the Browser to display the genome sequence.
  4. +
  5. Open GCA_000001905.1_Loxafr3.0.genomes.ncbi.txt. This genomes.txt file + defines each assembly in the hub. It points to the genome's .2bit file + (twoBitPath) and specifies the trackDb file that contains the + track definitions. (In the case of this large hub with 204 assemblies, the main + genomes.txt file is one directory up, and this stanza is included there.)
  6. +
  7. Review GCA_000001905.1_Loxafr3.0.trackDb.ncbi.txt. This trackDb.txt + file defines the tracks displayed in the hub. It contains bigDataUrl lines + that tell the Browser where to retrieve data for each track, along with optional + settings such as:
  8. + +
+ + +

Adding BLAT servers

+

BLAT servers (gfServer) can be configured as either dedicated or +dynamic:

+ + + + +

Configuring assembly hubs to use a dedicated gfServer

+

+When running a local BLAT server, assembly hubs can be configured to support BLAT searches by +adding entries to the + genomes.txt file.

+

+Installation and configuration details for gfServer are provided in the +Running your own gfServer +page.

+

+In the genomes.txt stanza for the target assembly, include the following lines (note +the capital B in transBlat):

+
+transBlat yourServer.yourInstitution.edu 17777
+blat yourServer.yourInstitution.edu 17779
+isPcr yourServer.yourInstitution.edu 17779
+
+

With this configuration, BLAT and PCR searches become available for the assembly. +For example:

+
+http://genome.ucsc.edu/cgi-bin/hgBlat?hubUrl=http://yourServer.yourInstitution.edu/myHub/hub.txt
+
+

+This URL opens the BLAT interface, where the assembly will appear in the Genome drop-down menu. +The isPcr line enables the use of a different gfServer instance for PCR queries if +desired.

+

Firewall note: Some institutions block repeated BLAT server queries. In such cases, +administrators must whitelist the following IP ranges:

+ +

+Further details on gfServer options are available from the +Source Downloads page +(pre-compiled binaries are located in the blat/ directory) and the +blat documentation.

+

+gfServers may also be set up within +GBiB +for local operation; see the +GBiB assembly BLAT setup +guide for detailed instructions. + +

To terminate a gfServer instance, run:

+
gfServer stop localhost 17860
+ + +

Troubleshooting BLAT servers

+

+Errors may occur if translatedBlat and nucleotideBlat port numbers are reversed. A typical +message in this case is:

+
Expecting 6 words from server got 2
+

If a gfServer instance is started from the same directory as the .2bit file, for example:

+
+gfServer start localhost 17779 -stepSize=5 contigsRenamed.2bit &
+

an attempt to run a DNA sequence query through the web-based BLAT tool may return:

+
+Error in TCP non-blocking connect() 111 - Connection refused
+Operation now in progress
+Sorry, the BLAT/iPCR server seems to be down. Please try again later.
+
+ + +
    +
  1. Process check
    + Confirm that a gfServer process is running:
  2. +
    ps aux | grep gfServer
    +
  3. Verify path and filename
    + In the genomes.txt, the twoBitPath/filename must match the .2bit file + used when starting gfServer. The location of the gfServer instance can + be verified by changing into the directory where gfServer was launched and running + the appropriate hostname command. +
    hostname -i
    + This will return an IP address, for example: + 132.249.245.79
    + Test the connection with telnet: + telnet: +
    telnet yourIP yourPort
    + For example: +
    telnet 132.249.245.79 17777
    + A successful connection shows: +
    Connected to 132.249.245.79
    + If Connection refused appears, gfServer may not be running, or the + IP/port configuration is incorrect.
    + The genomes.txt file should also be checked to confirm that the BLAT + line matches the correct IP and port. For example: +
    blat 132.249.245.79 17777
    + Instead of: +
    blat localhost 17777
  4. +
  5. Check gfServer status
    + Request status directly from gfServer: +
    gfServer status yourLocation yourPort
    + For example: +
    gfServer status 132.249.245.79 17777
    + Sample output might look like:
  6. +
    +version 36x2
    +type nucleotide
    +host localhost
    +port 17777
    +tileSize 11
    +stepSize 5
    +minMatch 2
    +pcr requests 0
    +blat requests 0
    +bases 0
    +misses 0
    +noSig 1
    +trimmed 0
    +warnings 0
    +
    +
  7. Test with gfClient
    + A reliable troubleshooting method is to bypass the web interface and use the + command-line utility gfClient. If gfClient successfully + connects to gfServer, the IP/port configuration is correct. Running + gfClient directly verifies connectivity independently of the browser + interface. From the directory containing the hub's .2bit file, the + command can be executed as follows: +
    gfClient yourLocation yourPort pathTo2bitFile yourFastaQuery.fa output.psl
    + For example: +
    gfClient localhost 17777 . query.fa gfOutput.psl
    + Note the . after the port, which tells gfClient to use + the .2bit file in the current directory. Check gfOutput.psl for BLAT results.
    + + Ensure that the yourAssembly.2bit file is present on the test machine. +
+ + +

Configuring assembly hubs to use a dynamic gfServer

+

A dynamic BLAT server is specified with the "dynamic" argument to the +blat, transBlat, and isPcr definitions in the hub +genomes.txt file, followed by the gfServer root-relative path of the +directory containing the .2bit and .gfidx files.

+

For example:

+
+blat yourServer.yourInstitution.edu 4096 dynamic yourAssembly
+transBlat yourServer.yourInstitution.edu 4096 dynamic yourAssembly
+isPcr yourServer.yourInstitution.edu 4096 dynamic yourAssembly
+
+

The genome and gfServer indexes would be:

+
+$rootdir/yourAssembly/yourAssembly.2bit
+$rootdir/yourAssembly/yourAssembly.untrans.gfidx
+$rootdir/yourAssembly/yourAssembly.trans.gfidx
+
+

Refer to the +Building gfServer indexes section for for detailed instructions on building + the index.

+

For large hubs, it is possible to have more deeply nested directories. For instance, the +following NCBI convention:

+
+blat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
+transBlat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
+isPcr yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
+
+

Which will reference these genome files and indexes:

+
+$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.2bit
+$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.untrans.gfidx
+$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.trans.gfidx
+
+ + + +

Checking gfServer status for dynamic servers

+

A query without specifying genome acts as an "I am alive" check: +

+% gfServer status myserver 4040
+version 37x1
+serverType dynamic
+
+

Specifying a -genome checks that it is valid and provides information on how the index was +built:

+
+% gfServer -genome=mm10 -genomeDataDir=test/mm10 status myserver 4040
+version 37x1
+serverType dynamic
+type nucleotide
+tileSize 11
+stepSize 5
+minMatch 2
+

Using -trans checks the translated index:

+
+% gfServer -genome=mm10 -genomeDataDir=test/mm10 -trans status myserver 4040
+version 37x1
+serverType dynamic
+type translated
+tileSize 4
+stepSize 4
+minMatch 3
+
+ +