9f28c7f69e700fa77ab91e4f4c3f3fe2432e4b8f lrnassar Wed Jun 25 12:58:23 2025 -0700 Beginning to phase our GBiB mentions for docker, refs #35611 diff --git src/hg/htdocs/goldenPath/help/assemblyHubGuidelines.html src/hg/htdocs/goldenPath/help/assemblyHubGuidelines.html index 7c5476c5f67..18521bfd542 100755 --- src/hg/htdocs/goldenPath/help/assemblyHubGuidelines.html +++ src/hg/htdocs/goldenPath/help/assemblyHubGuidelines.html @@ -1,435 +1,441 @@

Assembly

Please note, if you are working with a genome that has already been submitted to the NCBI Assembly system, it may already be available in the UCSC Genome Browser.

Please examine the GenArk Assembly Hub collection to see if your genome of interest is already available. In the case it cannot be found there, you can use the UCSC Assembly Request page to request a genome assembly be added to the UCSC Genome Browser.

Contents

Overview
Web Server
Linking to Your Assembly Hub
Building Tracks
Assembly Hub Resources
Adding BLAT Servers

Overview

The Assembly Hub function allows you to display your novel genome sequence using the UCSC Genome Browser.

Web Server

-

To display your novel genome sequence, use a web server at your institution (or free services like Cyverse), for usage behind a firewall you can also load them locally through GBiB to supply your files to the UCSC Genome Browser. Note that hosting hub files on HTTP is highly recommended and much more efficient than FTP. You then establish a hierarchy of directories and files to host your novel genome sequence. For example:

+

To display your novel genome sequence, use a web server at your institution (or free services +like Cyverse), +for usage behind a firewall you can also load them locally through docker +to supply your files to the UCSC Genome Browser. Note that hosting hub files on HTTP is highly +recommended and much more efficient than FTP. You then establish a hierarchy of directories and +files to host your novel genome sequence. For example:

 myHub/ - directory to organize your files on this hub
     hub.txt - primary reference text file to define the hub, refers to:
     genomes.txt - definitions for each genome assembly on this hub
         newOrg1/ - directory of files for this specific genome assembly
             newOrg1.2bit - ‘2bit’ file constructed from your fasta sequence
             description.html - information about this assembly for users
             trackDb.txt - definitions for tracks on this genome assembly
             groups.txt - definitions for track groups on this assembly
             bigWig and bigBed files - data for tracks on this assembly
             external track hub data tracks
 

The URL to reference this hub would be: http://yourLab.yourInstitution.edu/myHub/hub.txt

Note: there is now a useOneFile on hub setting that allows the hub properties to be specified in a single file. More information about this setting can be found on the Genome Browser User Guide.

You can view a working example hierarchy of files at: Plants

A smaller slice of this hub is represented in a Quick Start Guide to Assembly Hubs.

Linking to Your Assembly Hub

You can build direct links to the genome(s) in your assembly hub:

hub.txt

The initial file hub.txt is the primary URL reference for your assembly hub. The format of the file:

 hub hubName
 shortLabel genome
 longLabel Comment describing this hub contents
 genomesFile genomes.txt
 email contactEmail@institution.edu
 descriptionUrl aboutHub.html
 

shortLabel is the name that will appear in the genome pull-down menu at the UCSC gateway page. Example: Plants.

genomesFile is a reference to the next definition file in this chain that will describe the assemblies and tracks available at this hub. Typically genomes.txt is at the same directory level as this hub.txt, however it can also be a relative path reference to a different directory level.

The email address provides users a contact point for queries related to this assembly hub.

The descriptionUrl provides a relative path or URL link to a webpage describing the overall hub.

genomes.txt

The genomes.txt file provides the references to the genome assemblies and tracks available at this assembly hub. The example file indicates the typical contents:

 genome ricCom1
 trackDb ricCom1/trackDb.txt
 groups ricCom1/groups.txt
 description July 2011 Castor bean
 twoBitPath ricCom1/ricCom1.2bit
 organism Ricinus communis
 defaultPos E09R7372:1000000-2000000
 orderKey 4800
 scientificName Ricinus communis
 htmlPath ricCom1/description.html
 transBlat yourLab.yourInstitution.edu 17777
 blat yourLab.yourInstitution.edu 17777
 isPcr yourLab.yourInstitution.edu 17779
 

There can be multiple assembly definitions in this single file. Separate these stanzas with blank lines. The references to other files are relative path references. In this example there is a sub-directory here called ricCom1 which contains the files for this specific assembly.

Note that it is strongly encouraged to give each of your genomes stanza's a line for defaultPos, scientificName, organism, description (along with other above settings) so that when your hub is attached it will load a specified default location and have text to be more easily searched from the Gateway page.

2bit File

The .2bit file is constructed from the fasta sequence for the assembly. The kent source program faToTwoBit is used to construct this file. Download the program from the downloads section of the Browser. For example:

 faToTwoBit ricCom1.fa ricCom1.2bit
 

Use the twoBitInfo to verify the sequences in this assembly and create a chrom.sizes file which is not used in the hub, but is useful in later processing to construct the big* files:

 twoBitInfo ricCom1.2bit stdout | sort -k2rn > ricCom1.chrom.sizes
 

The .2bit commands can function with the .2bit file at a URL:

 twoBitInfo -udcDir=http://genome-test.gi.ucsc.edu/~hiram/hubs/Plants/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes
 

Sequence can be extracted from the .2bit file with the twoBitToFa command, for example:

 twoBitToFa -seq=chrCp -udcDir=http://genome-test.gi.ucsc.edu/~hiram/hubs/Plants/ricCom1/ricCom1.2bit stdout > ricCom1.chrCp.fa
 

groups.txt

The groups.txt file defines the grouping of track controls under the primary genome browser image display. The example referenced here has the usual definitions as found in the UCSC Genome Browser.

Each group is defined, for example the Mapping group:

 name map
 label Mapping
 priority 2
 defaultIsClosed 0
 

Building Tracks

Tracks are defined in the trackDb.txt where each stanza describes how tracks are displayed (shortLabel/longLabel/color/visibility) and other information such as what group the track should belong to (referencing the groups.txt) and if any additional html should display when one clicks into the track or a track item:

 track gap_
 longLabel Gap
 shortLabel Gap
 priority 11
 visibility dense
 color 0,0,0
 bigDataUrl bbi/ricCom1.gap.bb
 type bigBed 4
 group map
 html ../trackDescriptions/gap
 

For more informations about the syntax of the trackDb.txt file, use UCSC's Hub Track Database Definition page. It helps to have a cluster super computer to process the genomes to construct tracks. It can be done for small genomes on single computers that have multiple cores. The process for each track is unique. Please note the continuing document: Browser Track Construction for a discussion of constructing tracks for your assembly hub.

Cytoband Track

Assembly hubs can have a Cytoband track that can allow for quicker navigation of individual chromosomes and display banding pattern information if known.

A quick version of the track can be built using the existing chrom.sizes files for your assembly (the banding options include gneg, gpos25, gpos50, gpos75, gpos100, acen, gvar, or stalk):

 cat araTha1.chrom.sizes | sort -k1,1 -k2,2n | awk '{print $1,0,$2,$1,"gneg"}' > cytoBandIdeo.bed
 

The resulting bed file can be turned into a big bed and given a .as file (example here) to inform the browser it is not a normal bed.

 bedToBigBed -type=bed4 cytoBandIdeo.bed -as=cytoBand.as araTha1.chrom.sizes cytoBandIdeo.bigBed
 

In the trackDb, as long as the track is named cytoBandIdeo (track cytoBandIdeo example) it will load in the assembly hub.

Assembly Hub Resources

There are resources for automatically building assembly hubs available from G-OnRamp and MakeHub.

There is also a collection of Example NCBI assembly hubs that are already working and can either be used or copied as a template to build further hubs.

G-OnRamp

G-OnRamp is a Galaxy workflow that turns a genome assembly and RNA-Seq data into a Genome Browser with multiple evidence tracks. Because G-OnRamp is based on the Galaxy platform, developing some familiarity with the key concepts and functionalities of Galaxy would be beneficial prior to using G-OnRamp. Here is a link to their instruction page that gives an overview of their process.

MakeHub

MakeHub is a command line tool for the fully automatic generation of track data hubs for visualizing genomes with the UCSC genome browser. More information can be found on their GitHub page.

Example loading African bush elephant assembly hub and looking at the related genomes.txt and trackDb.txt

Here are some quick steps to load an example hub from this collection, and an attempt to explain how to look at the files behind the hub.

  1. Click the above Vertebrate Mammalian assembly hub link.
  2. Scroll down and find the "common name" column and click the hyperlink for "African bush elephant" after looking at the other information on that row.
  3. Note that you have arrived at a gateway page that has "African bush elephant Genome Browser - GCA_000001905.1_Loxafr3.0" displayed, where you can see a "Download files for this assembly hub:" section if you desired to access these specific files and notably a link.
  4. Click "Go" or the top "Genome Browser" blue bar menu to arrive at viewing this assembly hub (note this is on our genome-test site).
  5. To load this hub on our public site, at the earlier step you can copy the hyperlink for "African bush elephant" and paste it in a browser and change the very first "http://genome-test.gi.ucsc.edu/gbdb/..." to "http://genome.ucsc.edu/cgi-bin/..." instead.

Now to investigate the files behind the hub to understand the process involved:

  1. Click the link found in the "Download files for this assembly hub:" section on a loaded assembly hub's gateway page.
  2. Note the "GCA_000001905.1_Loxafr3.0.ncbi2bit" file, this is the binary indexed remote file that is allowing the Browser to display this genome.
  3. Find the "GCA_000001905.1_Loxafr3.0.genomes.ncbi.txt" file and click the link to look at it.
  4. Review this genomes.txt file, which defines each track in a new hub to show where to find the above 2bit on the "twoBitPath" line and also defines where to find all track database to display data on this genome in the "trackDb" line (the real genomes.txt for this massive hub is up one directory as this hub has 204 assemblies - where you will find this stanza included).
  5. From the earlier link to all the files, click the GCA_000001905.1_Loxafr3.0.trackDb.ncbi.txt link.
  6. Review this trackDb.txt file which defines the tracks to display on this hub, and also has "bigDataUrl" lines to tell the Browser where to find the data to display for each track, as well as other features such on some tracks as "searchIndex" and "searchTrix" lines to help support finding data in the hub and "url" and "urlLabel" lines on some tracks to help create links out on items in the hub to other external resources and "html" lines to a file that will have information to display about the data for users who click into tracks.

Adding BLAT servers

BLAT servers (gfServer) are configured as either dedicated or dynamic servers. Dedicated BLAT serves index a genome when started and remain running in memory to quickly respond to request. Dynamic BLAT servers pre-index genomes to files and are run on demand to handle a BLAT request and then exit.

Dedicated gfServer are easier to configure and faster to respond. However, the server continually uses memory. A dynamic gfServer is more appropriate with multiple assemblies and infrequent use. Their response time is usually acceptable; however, it varies with the speed of the disk containing the index. With repeated access, the operating system will cache the indexes in memory, improving response time.

Configuring assembly hubs to use a dedicated gfServer

By running your own BLAT server, you can add lines to the genomes.txt file of your assembly hub to enable the browser to access the server and activate blat searches.

Please see Running your own gfServer for details on installing and configuring both dedicated and dynamic gfServers.

 transBlat yourServer.yourInstitution.edu 17777
 blat yourServer.yourInstitution.edu 17779
 isPcr yourServer.yourInstitution.edu 17779
 

Please see more about configuring your blat gfServer to replicate the UCSC Browser's settings, which will also have information about optimizing PCR results. The Source Downloads page offers access to utilities with pre-compiled binaries such as gfserver found in a blat/ directory for your machine type here and further blat documentation here, and the gfServer usage statement for further options.

-

Please also know you can set up gfservers on a GBiB and run it locally. Please see this GBiB assembly blat step-by-step set up page for details.

+

Please also know you can set up gfservers on docker and run it locally. +

Note: You can stop your instance of gfServer with a command. For example:

 gfServer stop localhost 17860
 

Troubleshooting BLAT servers

You can see this error if you have the translatedBlat / nucleotideBlat port numbers the wrong way around:

 Expecting 6 words from server got 2
 

The following is an example of an error message when attempting to run a DNA sequence query via the web-based BLAT tool after loading a hub, after starting a gfServer instance (from the same dir as the 2bit file). For example, a command to start an instance of gfServer:

 gfServer start localhost 17779 -stepSize=5 contigsRenamed.2bit &
 

Example of a possible error message, from web-based BLAT after attempting a web-based BLAT query:

 Error in TCP non-blocking connect() 111 - Connection refused
 Operation now in progress
 Sorry, the BLAT/iPCR server seems to be down. Please try again later.
 

Check the following:

1.) Process check

First, make sure your gfServer instance is running.
Type the following command to check for your running gfServer process:

ps aux | grep gfServer

2.) Check for correct path/filename

In your genomes.txt file, does your twoBitPath/filename match what you specified in your command to start gfServer?
In your genomes.txt file, is the location of the instance to your gfServer correct?
To check this, you can cd into the directory where you started your gfServer, then type the command:

hostname -i
Your result should be an IP address, for example, '132.249.245.79'.

Now you can test the connection to your port that you specified, with a simple telnet command.
Type in the following command: telnet yourIP yourPort. For example:

telnet 132.249.245.79 17777

The results should read, "Connected to 132.249.245.79".
Otherwise, if gfServer isn't running or if you typed the wrong location in your telnet command, telnet will say, "Connection refused."
In this example, check your genomes.txt file, and make sure your blat line reads, "blat 132.249.245.79 17777".
You may need to change your genomes.txt file from, for example, "blat localhost 17777" to "blat 132.249.245.79 17777" (use your specific IP/host name where gfServer is running).

3.) Check "gfServer status" check

To request status from the gfServer process, run: gfServer status yourLocation yourPort.
For example:

$ gfServer status 132.249.245.79 17777

You should see output like this:

 version 36x2
 type nucleotide
 host localhost
 port 17777
 tileSize 11
 stepSize 5
 minMatch 2
 pcr requests 0
 blat requests 0
 bases 0
 misses 0
 noSig 1
 trimmed 0
 warnings 0
 

4.) Testing with gfClient

The best troubleshooting test is to take the webpage out of the equation, and use the command line utility, gfClient, to run the query on your instance of gfServer. If you can successfully connect gfClient to gfServer, you will know that your location and port specification are correct.

From the directory that holds your hub's .2bit file (should be the same directory where your instance of gfServer was launched), perform a query using gfClient:

You can type "gfClient" on your command line to see the usage statement.

Use the following command: gfClient yourLocation yourPort pathOf2bitFile yourFastaQuery.fa nameOfOutputFile.psl

FYI: For testing with gfClient, you only need the gfServer binary on your server, not blat.

For example:

gfClient localhost 17777 . query.fa gfOutput.psl

Note the "." after the port, to specify that the query will use the .2bit file in the current directory. After running this command, take a look at the gfOutput.psl file. If successful, you will see BLAT results.

Another example:

Note: In the example below, "yourServer.yourInstitution.edu" is the name of their machine where you run the gfServer command.

From the test machine: Test the DNA alignment, where test.fa is some sequence to find:

gfClient yourServer.yourInstitution.edu 17779 `pwd` test.fa dnaTestOut.psl

From the test machine: Test the protein alignment, where proteinSequence.fa is the sequence to find:

gfClient -t=dnaX -q=prot yourServer.yourInstitution.edu 17779 `pwd` proteinSequence.fa proteinOut.psl

Configuring assembly hubs to use a dynamic gfServer

A dynamic BLAT server is specified with the "dynamic" argument to the blat, transBlat, isPcr definitions in the hub genomes.txt file, followed by the gfServer root-relative path of the directory containing the 2bit and gfidx files.

For example:

 blat yourServer.yourInstitution.edu 4096 dynamic yourAssembly
 transBlat yourServer.yourInstitution.edu 4096 dynamic yourAssembly
 isPcr yourServer.yourInstitution.edu 4096 dynamic yourAssembly
 

The genome and gfServer indexes would be:

 $rootdir/yourAssembly/yourAssembly.2bit
 $rootdir/yourAssembly/yourAssembly.untrans.gfidx
 $rootdir/yourAssembly/yourAssembly.trans.gfidx
 

See Building gfServer indexes for instructions in building the index.

For large hubs, it is possible to have more deeply nest directory, for instance, the following NCBI convention:

 blat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
 transBlat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
 isPcr yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
 

Which will reference these genome files and indexes:

 $rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.2bit
 $rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.untrans.gfidx
 $rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.trans.gfidx
 

Check gfServer status for dynamic servers

A query without specifying a genome is an "I am alive" check:

 % gfServer status myserver 4040
 version 37x1
 serverType dynamic
 

Specifying a genome checks that is is valid and gives information on how to the index was built:

 % gfServer -genome=mm10 -genomeDataDir=test/mm10 status myserver 4040
 version 37x1
 serverType dynamic
 type nucleotide
 tileSize 11
 stepSize 5
 minMatch 2
 

Using -trans checks the translated index:

 % gfServer -genome=mm10 -genomeDataDir=test/mm10 -trans status myserver 4040
 version 37x1
 serverType dynamic
 type translated
 tileSize 4
 stepSize 4
 minMatch 3