5295d58d808a9ef30d0514558d943b7dbf4bd4f7 gperez2 Tue Sep 2 16:03:33 2025 -0700 Updating the GB page for the Assembly Hub Wiki, refs #34740 diff --git src/hg/htdocs/goldenPath/help/assemblyHubHelp.html src/hg/htdocs/goldenPath/help/assemblyHubHelp.html new file mode 100755 index 00000000000..ce79fca0e8c --- /dev/null +++ src/hg/htdocs/goldenPath/help/assemblyHubHelp.html @@ -0,0 +1,848 @@ + + + + + + + +

Assembly Hub User Guide

+ + +

Overview

+An Assembly Data Hub is a set of Internet-accessible data files that define the reference sequence +to be used for a browser instance, as well as all the data files that define the annotation for +that sequence. Assembly Data Hubs allow researchers to use the UCSC Genome Browser to view their +own sequences with associated annotation, without the requirement that UCSC support a browser on that sequence. +

+ +

+Note: if you are working with a genome that has already been submitted +to the NCBI Assembly system, it may +already be available in the UCSC Genome Browser. +Please check the GenArk Assembly Hub collection +to see if your genome of interest is already available. If it is not listed there, you can use the +UCSC Assembly Request page to request that the genome assembly be +added.

+ + +

Web Server

Linking to Your Assembly Hub

Building Tracks

Cyotoband Track

Assembly Hub Resources

G-OnRamp
MakeHub
Example NCBI Assembly Hubs

+ + +

Adding BLAT Servers

Configuring Assembly Hubs to Use a Dedicated gfServer
Troubleshooting BLAT Servers
Configuring Assembly Hubs to Use a Dynamic gfServer
Check gfServer Status for Dynamic Servers

+ + + + +

Web Server

+To display a novel genome sequence in the UCSC Genome Browser, a web server hosted by the +institution (or a free service such as Cyverse) +can be used. For environments operating behind a firewall, hub files can also be loaded locally +through GBiB to provide access to the UCSC Genome +Browser. Hosting hub files over HTTP is strongly recommended, as it is +significantly more efficient than FTP. A hierarchical directory structure must then be +established to organize the files associated with the genome sequence. For example: +

+ +

+myHub/ - directory to organize your files on this hub
+    hub.txt - primary reference text file to define the hub, refers to:
+    genomes.txt - definitions for each genome assembly on this hub
+        newOrg1/ - directory of files for this specific genome assembly
+            newOrg1.2bit - '2bit' file constructed from your fasta sequence
+            description.html - information about this assembly for users
+            trackDb.txt - definitions for tracks on this genome assembly
+            groups.txt - definitions for track groups on this assembly
+            bigWig and bigBed files - data for tracks on this assembly
+            external track hub data tracks
+

+The hub can be referenced by a URL such as: http://yourLab.yourInstitution.edu/myHub/hub.txt

+ +

Assembly Hub Components

+ + + +

hub.txt

+The initial file, hub.txt is the primary URL reference for the assembly hub:

Format of the file:

+hub hubName
+shortLabel genome
+longLabel Comment describing this hub contents
+genomesFile genomes.txt
+email contactEmail@institution.edu
+descriptionUrl aboutHub.html
+

+shortLabel is the name that will appear in the genome pull-down menu at the +UCSC gateway page.

+genomesFile is a reference to the next definition file in this chain that will +describe the assemblies and tracks available at this hub. Typically, genomes.txt is at +the same directory level as this hub.txt; however, it can also be a relative path +reference to a different directory level.

+email provides users with a contact point for questions related to this assembly hub.

+descriptionUrl specifies a relative path or URL link to a webpage describing the hub.

+You can view a working example at hub.txt

+ + +

genomes.txt

The genomes.txt file provides references to the genome assemblies and tracks available in +the assembly hub.

+genome ricCom1
+trackDb ricCom1/trackDb.txt
+groups ricCom1/groups.txt
+description July 2011 Castor bean
+twoBitPath ricCom1/ricCom1.2bit
+organism Ricinus communis
+defaultPos E09R7372:1000000-2000000
+orderKey 4800
+scientificName Ricinus communis
+htmlPath ricCom1/description.html
+transBlat yourLab.yourInstitution.edu 17777
+blat yourLab.yourInstitution.edu 17777
+isPcr yourLab.yourInstitution.edu 17779
+

+Multiple assembly definitions can be included in a single file, separated by blank lines. The file +references are relative paths. In this example, the subdirectory ricCom1 contains +the files for this specific assembly.

genome is equivalent to the UCSC database name. This name appears on title + pages in the Genome Browser.
trackDb points to the file that defines the tracks for this genome + assembly (see the + Track Hub + help documentation for details).
groups points to the file defining track groups, which are collections of + related tracks displayed together under the main Genome Browser image.
description is displayed on the Gateway page and title pages for this + assembly. It also appears in the assembly pull-down menu.
twoBitPath points to the .2bit sequence file for the assembly. + This file is typically generated from FASTA files using the faToTwoBit + kent program. The path can also point to a URL.
organism is displayed alongside the description on title pages. It also + appears in the assembly pull-down menu.
defaultPos defines the initial view in the Genome Browser, usually + highlighting a popular gene or region of interest.
orderKey controls the ordering of assemblies in the pull-down menu.
htmlPath points to the HTML file with assembly information. The HTML file + is displayed on the Gateway page.
transBlat, blat, and isPcr configure + different gfServer instances for amino acid searches, BLAT alignments, and PCR. + More here.

Note: it is strongly recommended that each genome stanza includes defaultPos, +scientificName, organism, description, so that the hub loads with +meaningful defaults and can be more easily searched from the Gateway page.

+ + +

2bit File

+The .2bit file is constructed from the FASTA sequence for the assembly using the +faToTwoBit kent program (available from the +downloads page).

Example:

+faToTwoBit ricCom1.fa ricCom1.2bit
+

+Use twoBitInfo to verify sequences and create a chrom.sizes file, +which is not used in the hub itself but is helpful for constructing big* files: +

+twoBitInfo ricCom1.2bit stdout | sort -k2rn > ricCom1.chrom.sizes
+

+The .2bit file can also be hosted at a URL:

+twoBitInfo -udcDir=https://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubPlants/cshl2013/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes
+

+To extract sequences from a .2bit file: +

+twoBitToFa -seq=chrCp -udcDir=https://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubPlants/cshl2013/ricCom1/ricCom1.2bit stdout > ricCom1.chrCp.fa
+

+ + +

groups.txt

The groups.txt file defines the grouping of track controls under the Genome Browser graphic +display.

Example:

+name map
+label Mapping
+priority 2
+defaultIsClosed 0
+

+ +

The name setting is used in the trackDb.txt file to associate specific tracks with a + group.
The label setting specifies the title of the group in the genome browser. By default, + groups are sorted alphabetically based on the label.
The priority setting dictates the display order of the track groups, with lower + numbers shown first.
The defaultIsClosed setting controls whether the group is initially expanded or + collapsed (0 for expanded, 1 for collapsed).

Refer to the Adding Groups to a Track hub section of the Track Hubs help page for more +details.

+ + +

Single-File Track Hub (useOneFile on)

+Traditionally, an assembly hub required multiple configuration files (hub.txt, +genomes.txt, trackDb.txt, and optionally groups.txt), along +with a .2bit file for the sequence. The useOneFile on option simplifies +this by consolidating everything into a single configuration file. Note: The single-file +format supports one genome assembly per file. For multiple assemblies, use the traditional +multi-file setup.

Example configuration:

+hub mySingleFileHub
+shortLabel My Single-File Hub
+longLabel An example of a single-file UCSC track hub
+useOneFile on
+email myEmail@example.com
+
+genome hg19
+
+track exampleBigWig
+shortLabel BigWig Coverage
+longLabel Coverage data over hg19
+type bigWig
+visibility full
+bigDataUrl http://myServer.com/data/example.bigWig
+
+track exampleVCF
+shortLabel VCF Variants
+longLabel Variant calls over hg19 region
+type vcfTabix
+visibility pack
+bigDataUrl http://myServer.com/data/example.vcf.gz
+

+ +

The hub stanza with the useOneFile on setting replaces hub.txt.
The genome line replaces genomes.txt.
The track stanzas replaces trackDb.txt.

+ +

+If your hub requires a reference genome sequence, you can still provide a .2bit file +with twoBitPath. Grouping (previously in +groups.txt.) can also be integrated here if needed. +

+ +

+Once hosted on a server, the single configuration file (and associated data files such as +.bigWig, .vcf.gz, .2bit) can be loaded into the UCSC Genome +Browser via the My Hubs page.

+ + +

Building Tracks

Tracks are defined in the trackDb.txt file, where each stanza specifies how +tracks are displayed (shortLabel, longLabel, color, visibility), along with other information such +as the group the track belongs to (referencing groups.txt) and whether +additional HTML should be displayed when a user clicks into the track or a track item:

+track gap_
+longLabel Gap
+shortLabel Gap
+priority 11
+visibility dense
+color 0,0,0
+bigDataUrl bbi/ricCom1.gap.bb
+type bigBed 4
+group map
+html ../trackDescriptions/gap
+

+For more information about the syntax of the trackDb.txt file, refer to the +Track Database Definition page. +

Processing genomes to construct tracks often requires a cluster or supercomputer. Small +genomes can be processed on single computers with multiple cores. The process for each track is +unique. For details, refer to the + + Browser Track Construction page, which discusses constructing tracks for assembly +hubs.

+ + +

Cytoband Track

+Assembly hubs can include a Cytoband track, which allows quicker navigation of chromosomes and +displays banding pattern information, if known.

+A simple version of the track can be built using the existing chrom.sizes file for your assembly. +Banding options include: gneg, gpos25, + gpos50, gpos75, gpos100, acen, gvar, or stalk).

Example:

+cat araTha1.chrom.sizes | sort -k1,1 -k2,2n | awk '{print $1,0,$2,$1,"gneg"}' > cytoBandIdeo.bed
+

+The resulting BED file can be converted into a BigBed file and associated with an .as +definition file (see +example) to +to inform the browser that this is not a standard BED:

+bedToBigBed -type=bed4 cytoBandIdeo.bed -as=cytoBand.as araTha1.chrom.sizes cytoBandIdeo.bigBed
+

+In trackDb.txt, if the track is named cytoBandIdeo (e.g., +track cytoBandIdeo), it will automatically load into the assembly +hub.

+ + +

Linking to Your Assembly Hub

+Direct links to the genome(s) within the assembly hub can then be constructed.

+ The hub connect page: +
+ + http://genome.ucsc.edu/cgi-bin/hgHubConnect?hgHub_do_redirect=on&hgHubConnect.remakeTrackHub=on&hgHub_do_firstDb=1&hubUrl=http://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubAssembly/plantAraTha1/hub.txt + +
+ The genome gateway page: +
+ + http://genome.ucsc.edu/cgi-bin/hgGateway?genome=araTha1&hubUrl=http://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubAssembly/plantAraTha1/hub.txt + +
+ Directly to the genome browser: +
+ + http://genome.ucsc.edu/cgi-bin/hgTracks?genome=araTha1&hubUrl=http://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubAssembly/plantAraTha1/hub.txt + +

+ + + + +

Assembly Hub Resources

+Resources for automatically building assembly hubs include G-OnRamp and MakeHub.

+ + +

G-OnRamp

+G-OnRamp is a Galaxy workflow that turns a genome assembly and RNA-Seq data into a Genome Browser +with multiple evidence tracks. Since G-OnRamp is based on the Galaxy platform, becoming familiar +with Galaxy concepts and functionalities is recommended. See their +instruction page +for an overview. +

+ + +

MakeHub

+MakeHub is a command-line tool for fully automatic generation of track data hubs for visualizing +genomes with the UCSC Genome Browser. More information is available on their +GitHub page.

+ + +

Example NCBI assembly hubs

+There is a collection of example NCBI assembly hubs that can be used directly or copied as +templates. A large collection of script-generated assembly hubs can be browsed on the development server, with +links defaulting to the genome-test site. To load these hubs on the public UCSC site, copy +the hub.txt link and replace the test server domain with the public domain.

+The following table provides links to launch various assembly hubs grouped by species subsets. By +scrolling down each page, you can access rows for individual assemblies (or groups of assemblies, +e.g., bacteria). Clicking the "common name" hyperlink (e.g., "African bush +elephant" on the Vertebrate Mammalian page) loads the selected hub.

+ + +

These assemblies use NCBI accession naming patterns. Prototype gene tracks from NCBI gene +predictions are available for a few assemblies. No BLAT servers are provided. Users can copy the +skeleton structure of a hub to run their own BLAT server locally. Brief instructions are available +on each assembly gateway page under "Download files for this assembly hub." + + +

Example: Loading the African bush elephant assembly hub and reviewing the related genomes.txt + and trackDb.txt

+Here are some quick steps to load an example hub from this collection, along with an explanation +of how to view the files behind the hub.

Click the + Vertebrate Mammalian assembly hub link above.
Scroll down to the common name column and click the hyperlink for + "African bush elephant".
You will arrive at a gateway page titled "African bush elephant Genome Browser - + GCA_000001905.1_Loxafr3.0 assembly". This page includes a section, + Data file downloads, where you can access the underlying + files.
Click Go (or use the top Genome Browser blue bar menu) to view this assembly hub. + (Note: this will open on our genome-test site.).
To load this hub on our public site, copy the hyperlink for + African bush elephant and paste it into your browser. + Then, change the beginning of the URL from

+https://genome-test.gi.ucsc.edu/...
+

+https://genome.ucsc.edu/...
+

Exploring the files behind the hub

+To better understand how the hub works, you can review the associated files:

Go to the GCA_000001905.1_Loxafr3.0 directory + link.
Locate the file GCA_000001905.1_Loxafr3.0.ncbi.2bit. This binary indexed file allows + the Browser to display the genome sequence.
Open GCA_000001905.1_Loxafr3.0.genomes.ncbi.txt. This genomes.txt file + defines each assembly in the hub. It points to the genome's .2bit file + (twoBitPath) and specifies the trackDb file that contains the + track definitions. (In the case of this large hub with 204 assemblies, the main + genomes.txt file is one directory up, and this stanza is included there.)
Review GCA_000001905.1_Loxafr3.0.trackDb.ncbi.txt. This trackDb.txt + file defines the tracks displayed in the hub. It contains bigDataUrl lines + that tell the Browser where to retrieve data for each track, along with optional + settings such as:

searchIndex + and searchTrix: support data searches within the hub
url and + urlLabel: create outbound links to external + resources
html: links to a file with descriptive information + displayed when users click into a track

+ + +

Adding BLAT servers

BLAT servers (gfServer) can be configured as either dedicated or +dynamic:

Dedicated BLAT servers index a genome at startup and remain running in memory, allowing + fast responses. The drawback is that they continuously consume memory.
Dynamic BLAT servers pre-index genomes into files and start on demand to handle a + request, exiting afterward. They are more memory-efficient and work well for hubs + with many assemblies or infrequent use. Their response time depends on disk speed + but improves with repeated access due to operating system caching.

+ + + +

Configuring assembly hubs to use a dedicated gfServer

+When running a local BLAT server, assembly hubs can be configured to support BLAT searches by +adding entries to the + genomes.txt file.

+Installation and configuration details for gfServer are provided in the +Running your own gfServer +page.

+In the genomes.txt stanza for the target assembly, include the following lines (note +the capital B in transBlat):

+transBlat yourServer.yourInstitution.edu 17777
+blat yourServer.yourInstitution.edu 17779
+isPcr yourServer.yourInstitution.edu 17779
+

With this configuration, BLAT and PCR searches become available for the assembly. +For example:

+http://genome.ucsc.edu/cgi-bin/hgBlat?hubUrl=http://yourServer.yourInstitution.edu/myHub/hub.txt
+

+This URL opens the BLAT interface, where the assembly will appear in the Genome drop-down menu. +The isPcr line enables the use of a different gfServer instance for PCR queries if +desired.

Firewall note: Some institutions block repeated BLAT server queries. In such cases, +administrators must whitelist the following IP ranges:

128.114.119.* (U.S. site: genome.ucsc.edu)
129.70.40.120 (European mirror: genome-euro.ucsc.edu) +

+Further details on gfServer options are available from the +Source Downloads page +(pre-compiled binaries are located in the blat/ directory) and the +blat documentation.

+gfServers may also be set up within +GBiB +for local operation; see the +GBiB assembly BLAT setup +guide for detailed instructions. + +

To terminate a gfServer instance, run:

gfServer stop localhost 17860

+ + +

Troubleshooting BLAT servers

+Errors may occur if translatedBlat and nucleotideBlat port numbers are reversed. A typical +message in this case is:

Expecting 6 words from server got 2

If a gfServer instance is started from the same directory as the .2bit file, for example:

+gfServer start localhost 17779 -stepSize=5 contigsRenamed.2bit &

an attempt to run a DNA sequence query through the web-based BLAT tool may return:

+Error in TCP non-blocking connect() 111 - Connection refused
+Operation now in progress
+Sorry, the BLAT/iPCR server seems to be down. Please try again later.
+

+ + +

Process check
+ Confirm that a gfServer process is running:

ps aux | grep gfServer

Verify path and filename
+ In the genomes.txt, the twoBitPath/filename must match the .2bit file + used when starting gfServer. The location of the gfServer instance can + be verified by changing into the directory where gfServer was launched and running + the appropriate hostname command. +
```
hostname -i
```
+ This will return an IP address, for example: + 132.249.245.79
+ Test the connection with telnet: + telnet: +
```
telnet yourIP yourPort
```
+ For example: +
```
telnet 132.249.245.79 17777
```
+ A successful connection shows: +
```
Connected to 132.249.245.79
```
+ If Connection refused appears, gfServer may not be running, or the + IP/port configuration is incorrect.
+ The genomes.txt file should also be checked to confirm that the BLAT + line matches the correct IP and port. For example: +
```
blat 132.249.245.79 17777
```
+ Instead of: +
```
blat localhost 17777
```
Check gfServer status
+ Request status directly from gfServer: +
```
gfServer status yourLocation yourPort
```
+ For example: +
```
gfServer status 132.249.245.79 17777
```
+ Sample output might look like:

+version 36x2
+type nucleotide
+host localhost
+port 17777
+tileSize 11
+stepSize 5
+minMatch 2
+pcr requests 0
+blat requests 0
+bases 0
+misses 0
+noSig 1
+trimmed 0
+warnings 0
+

Test with gfClient
+ A reliable troubleshooting method is to bypass the web interface and use the + command-line utility gfClient. If gfClient successfully + connects to gfServer, the IP/port configuration is correct. Running + gfClient directly verifies connectivity independently of the browser + interface. From the directory containing the hub's .2bit file, the + command can be executed as follows: +
```
gfClient yourLocation yourPort pathTo2bitFile yourFastaQuery.fa output.psl
```
+ For example: +
```
gfClient localhost 17777 . query.fa gfOutput.psl
```
+ Note the . after the port, which tells gfClient to use + the .2bit file in the current directory. Check gfOutput.psl for BLAT results.
+
+ Ensure that the yourAssembly.2bit file is present on the test machine. +

+ + +

Configuring assembly hubs to use a dynamic gfServer

A dynamic BLAT server is specified with the "dynamic" argument to the +blat, transBlat, and isPcr definitions in the hub +genomes.txt file, followed by the gfServer root-relative path of the +directory containing the .2bit and .gfidx files.

For example:

+blat yourServer.yourInstitution.edu 4096 dynamic yourAssembly
+transBlat yourServer.yourInstitution.edu 4096 dynamic yourAssembly
+isPcr yourServer.yourInstitution.edu 4096 dynamic yourAssembly
+

The genome and gfServer indexes would be:

+$rootdir/yourAssembly/yourAssembly.2bit
+$rootdir/yourAssembly/yourAssembly.untrans.gfidx
+$rootdir/yourAssembly/yourAssembly.trans.gfidx
+

Refer to the +Building gfServer indexes section for for detailed instructions on building + the index.

For large hubs, it is possible to have more deeply nested directories. For instance, the +following NCBI convention:

+blat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
+transBlat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
+isPcr yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
+

Which will reference these genome files and indexes:

+$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.2bit
+$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.untrans.gfidx
+$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.trans.gfidx
+

+ + + +

Checking gfServer status for dynamic servers

A query without specifying genome acts as an "I am alive" check: +

+% gfServer status myserver 4040
+version 37x1
+serverType dynamic
+

Specifying a -genome checks that it is valid and provides information on how the index was +built:

+% gfServer -genome=mm10 -genomeDataDir=test/mm10 status myserver 4040
+version 37x1
+serverType dynamic
+type nucleotide
+tileSize 11
+stepSize 5
+minMatch 2
+

Using -trans checks the translated index:

+% gfServer -genome=mm10 -genomeDataDir=test/mm10 -trans status myserver 4040
+version 37x1
+serverType dynamic
+type translated
+tileSize 4
+stepSize 4
+minMatch 3
+

+ +

Assembly Hub User Guide

Overview

Contents

Web Server

Assembly Hub Components

Linking to Your Assembly Hub

Building Tracks

Assembly Hub Resources

Adding BLAT Servers

Web Server

Assembly Hub Components

hub.txt

genomes.txt

2bit File

groups.txt

Single-File Track Hub (useOneFile on)

Building Tracks

Cytoband Track

Linking to Your Assembly Hub

Assembly Hub Resources

G-OnRamp

MakeHub

Example NCBI assembly hubs

Example: Loading the African bush elephant assembly hub and reviewing the related genomes.txt + and trackDb.txt

Exploring the files behind the hub

Adding BLAT servers

Configuring assembly hubs to use a dedicated gfServer

Troubleshooting BLAT servers

Configuring assembly hubs to use a dynamic gfServer

Checking gfServer status for dynamic servers