5295d58d808a9ef30d0514558d943b7dbf4bd4f7 gperez2 Tue Sep 2 16:03:33 2025 -0700 Updating the GB page for the Assembly Hub Wiki, refs #34740 diff --git src/hg/htdocs/goldenPath/help/assemblyHubHelp.html src/hg/htdocs/goldenPath/help/assemblyHubHelp.html new file mode 100755 index 00000000000..ce79fca0e8c --- /dev/null +++ src/hg/htdocs/goldenPath/help/assemblyHubHelp.html @@ -0,0 +1,848 @@ + + + + + + + +
+An Assembly Data Hub is a set of Internet-accessible data files that define the reference sequence +to be used for a browser instance, as well as all the data files that define the annotation for +that sequence. Assembly Data Hubs allow researchers to use the UCSC Genome Browser to view their +own sequences with associated annotation, without the requirement that UCSC support a browser on that sequence. +
+ ++Note: if you are working with a genome that has already been submitted +to the NCBI Assembly system, it may +already be available in the UCSC Genome Browser. +Please check the GenArk Assembly Hub collection +to see if your genome of interest is already available. If it is not listed there, you can use the +UCSC Assembly Request page to request that the genome assembly be +added.
+ + ++To display a novel genome sequence in the UCSC Genome Browser, a web server hosted by the +institution (or a free service such as Cyverse) +can be used. For environments operating behind a firewall, hub files can also be loaded locally +through GBiB to provide access to the UCSC Genome +Browser. Hosting hub files over HTTP is strongly recommended, as it is +significantly more efficient than FTP. A hierarchical directory structure must then be +established to organize the files associated with the genome sequence. For example: +
+ ++myHub/ - directory to organize your files on this hub + hub.txt - primary reference text file to define the hub, refers to: + genomes.txt - definitions for each genome assembly on this hub + newOrg1/ - directory of files for this specific genome assembly + newOrg1.2bit - '2bit' file constructed from your fasta sequence + description.html - information about this assembly for users + trackDb.txt - definitions for tracks on this genome assembly + groups.txt - definitions for track groups on this assembly + bigWig and bigBed files - data for tracks on this assembly + external track hub data tracks ++
+The hub can be referenced by a URL such as: http://yourLab.yourInstitution.edu/myHub/hub.txt
+ ++The initial file, hub.txt is the primary URL reference for the assembly hub:
+Format of the file:
++hub hubName +shortLabel genome +longLabel Comment describing this hub contents +genomesFile genomes.txt +email contactEmail@institution.edu +descriptionUrl aboutHub.html ++
+shortLabel is the name that will appear in the genome pull-down menu at the +UCSC gateway page.
++genomesFile is a reference to the next definition file in this chain that will +describe the assemblies and tracks available at this hub. Typically, genomes.txt is at +the same directory level as this hub.txt; however, it can also be a relative path +reference to a different directory level.
++email provides users with a contact point for questions related to this assembly hub.
++descriptionUrl specifies a relative path or URL link to a webpage describing the hub.
++You can view a working example at hub.txt
+ + +The genomes.txt file provides references to the genome assemblies and tracks available in +the assembly hub.
++genome ricCom1 +trackDb ricCom1/trackDb.txt +groups ricCom1/groups.txt +description July 2011 Castor bean +twoBitPath ricCom1/ricCom1.2bit +organism Ricinus communis +defaultPos E09R7372:1000000-2000000 +orderKey 4800 +scientificName Ricinus communis +htmlPath ricCom1/description.html +transBlat yourLab.yourInstitution.edu 17777 +blat yourLab.yourInstitution.edu 17777 +isPcr yourLab.yourInstitution.edu 17779 ++
+Multiple assembly definitions can be included in a single file, separated by blank lines. The file +references are relative paths. In this example, the subdirectory ricCom1 contains +the files for this specific assembly.
+Note: it is strongly recommended that each genome stanza includes defaultPos, +scientificName, organism, description, so that the hub loads with +meaningful defaults and can be more easily searched from the Gateway page.
+ + ++The .2bit file is constructed from the FASTA sequence for the assembly using the +faToTwoBit kent program (available from the +downloads page).
+Example:
++faToTwoBit ricCom1.fa ricCom1.2bit ++
+Use twoBitInfo to verify sequences and create a chrom.sizes file, +which is not used in the hub itself but is helpful for constructing big* files: +
++twoBitInfo ricCom1.2bit stdout | sort -k2rn > ricCom1.chrom.sizes ++
+The .2bit file can also be hosted at a URL:
++twoBitInfo -udcDir=https://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubPlants/cshl2013/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes ++
+To extract sequences from a .2bit file: +
++twoBitToFa -seq=chrCp -udcDir=https://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubPlants/cshl2013/ricCom1/ricCom1.2bit stdout > ricCom1.chrCp.fa ++ + +
The groups.txt file defines the grouping of track controls under the Genome Browser graphic +display.
+Example:
++name map +label Mapping +priority 2 +defaultIsClosed 0 ++ +
Refer to the Adding Groups to a Track hub section of the Track Hubs help page for more +details.
+ + +
+Traditionally, an assembly hub required multiple configuration files (hub.txt,
+genomes.txt, trackDb.txt, and optionally groups.txt), along
+with a .2bit file for the sequence. The useOneFile on option simplifies
+this by consolidating everything into a single configuration file. Note: The single-file
+format supports one genome assembly per file. For multiple assemblies, use the traditional
+multi-file setup.
Example configuration:
++hub mySingleFileHub +shortLabel My Single-File Hub +longLabel An example of a single-file UCSC track hub +useOneFile on +email myEmail@example.com + +genome hg19 + +track exampleBigWig +shortLabel BigWig Coverage +longLabel Coverage data over hg19 +type bigWig +visibility full +bigDataUrl http://myServer.com/data/example.bigWig + +track exampleVCF +shortLabel VCF Variants +longLabel Variant calls over hg19 region +type vcfTabix +visibility pack +bigDataUrl http://myServer.com/data/example.vcf.gz ++ +
hub.txt.genomes.txt.trackDb.txt.
+If your hub requires a reference genome sequence, you can still provide a .2bit file
+with twoBitPath. Grouping (previously in
+groups.txt.) can also be integrated here if needed.
+
+Once hosted on a server, the single configuration file (and associated data files such as
+.bigWig, .vcf.gz, .2bit) can be loaded into the UCSC Genome
+Browser via the My Hubs page.
Tracks are defined in the trackDb.txt file, where each stanza specifies how +tracks are displayed (shortLabel, longLabel, color, visibility), along with other information such +as the group the track belongs to (referencing groups.txt) and whether +additional HTML should be displayed when a user clicks into the track or a track item:
++track gap_ +longLabel Gap +shortLabel Gap +priority 11 +visibility dense +color 0,0,0 +bigDataUrl bbi/ricCom1.gap.bb +type bigBed 4 +group map +html ../trackDescriptions/gap ++
+For more information about the syntax of the trackDb.txt file, refer to the +Track Database Definition page. +
+Processing genomes to construct tracks often requires a cluster or supercomputer. Small +genomes can be processed on single computers with multiple cores. The process for each track is +unique. For details, refer to the + + Browser Track Construction page, which discusses constructing tracks for assembly +hubs.
+ + ++Assembly hubs can include a Cytoband track, which allows quicker navigation of chromosomes and +displays banding pattern information, if known.
+
+A simple version of the track can be built using the existing chrom.sizes file for your assembly.
+Banding options include: gneg, gpos25,
+ gpos50, gpos75, gpos100, acen, gvar, or stalk).
Example:
+
+cat araTha1.chrom.sizes | sort -k1,1 -k2,2n | awk '{print $1,0,$2,$1,"gneg"}' > cytoBandIdeo.bed
+
+
+The resulting BED file can be converted into a BigBed file and associated with an .as
+definition file (see
+example) to
+to inform the browser that this is not a standard BED:
+bedToBigBed -type=bed4 cytoBandIdeo.bed -as=cytoBand.as araTha1.chrom.sizes cytoBandIdeo.bigBed ++
+In trackDb.txt, if the track is named cytoBandIdeo (e.g., +track cytoBandIdeo), it will automatically load into the assembly +hub.
+ + ++Direct links to the genome(s) within the assembly hub can then be constructed.
++Resources for automatically building assembly hubs include G-OnRamp and MakeHub.
+ + ++G-OnRamp is a Galaxy workflow that turns a genome assembly and RNA-Seq data into a Genome Browser +with multiple evidence tracks. Since G-OnRamp is based on the Galaxy platform, becoming familiar +with Galaxy concepts and functionalities is recommended. See their +instruction page +for an overview. +
+ + ++MakeHub is a command-line tool for fully automatic generation of track data hubs for visualizing +genomes with the UCSC Genome Browser. More information is available on their +GitHub page.
+ + ++There is a collection of example NCBI assembly hubs that can be used directly or copied as +templates. A large collection of script-generated assembly hubs can be browsed on the development server, with +links defaulting to the genome-test site. To load these hubs on the public UCSC site, copy +the hub.txt link and replace the test server domain with the public domain.
++The following table provides links to launch various assembly hubs grouped by species subsets. By +scrolling down each page, you can access rows for individual assemblies (or groups of assemblies, +e.g., bacteria). Clicking the "common name" hyperlink (e.g., "African bush +elephant" on the Vertebrate Mammalian page) loads the selected hub.
+ + + +These assemblies use NCBI accession naming patterns. Prototype gene tracks from NCBI gene +predictions are available for a few assemblies. No BLAT servers are provided. Users can copy the +skeleton structure of a hub to run their own BLAT server locally. Brief instructions are available +on each assembly gateway page under "Download files for this assembly hub." + + +
+Here are some quick steps to load an example hub from this collection, along with an explanation +of how to view the files behind the hub.
++https://genome-test.gi.ucsc.edu/... ++ to +
+https://genome.ucsc.edu/... ++
+To better understand how the hub works, you can review the associated files:
+genomes.txt file
+ defines each assembly in the hub. It points to the genome's .2bit file
+ (twoBitPath) and specifies the trackDb file that contains the
+ track definitions. (In the case of this large hub with 204 assemblies, the main
+ genomes.txt file is one directory up, and this stanza is included there.)trackDb.txt
+ file defines the tracks displayed in the hub. It contains bigDataUrl lines
+ that tell the Browser where to retrieve data for each track, along with optional
+ settings such as:BLAT servers (gfServer) can be configured as either dedicated or
+dynamic:
+When running a local BLAT server, assembly hubs can be configured to support BLAT searches by +adding entries to the + genomes.txt file.
++Installation and configuration details for gfServer are provided in the +Running your own gfServer +page.
+
+In the genomes.txt stanza for the target assembly, include the following lines (note
+the capital B in transBlat):
+transBlat yourServer.yourInstitution.edu 17777 +blat yourServer.yourInstitution.edu 17779 +isPcr yourServer.yourInstitution.edu 17779 ++
With this configuration, BLAT and PCR searches become available for the assembly. +For example:
++http://genome.ucsc.edu/cgi-bin/hgBlat?hubUrl=http://yourServer.yourInstitution.edu/myHub/hub.txt ++
+This URL opens the BLAT interface, where the assembly will appear in the Genome drop-down menu.
+The isPcr line enables the use of a different gfServer instance for PCR queries if
+desired.
Firewall note: Some institutions block repeated BLAT server queries. In such cases, +administrators must whitelist the following IP ranges:
+128.114.119.* (U.S. site: genome.ucsc.edu)129.70.40.120 (European mirror: genome-euro.ucsc.edu)
++Further details on gfServer options are available from the +Source Downloads page +(pre-compiled binaries are located in the blat/ directory) and the +blat documentation.
++gfServers may also be set up within +GBiB +for local operation; see the +GBiB assembly BLAT setup +guide for detailed instructions. + +
To terminate a gfServer instance, run:
+gfServer stop localhost 17860+ + +
+Errors may occur if translatedBlat and nucleotideBlat port numbers are reversed. A typical +message in this case is:
+Expecting 6 words from server got 2+
If a gfServer instance is started from the same directory as the .2bit file, for example:
++gfServer start localhost 17779 -stepSize=5 contigsRenamed.2bit &+
an attempt to run a DNA sequence query through the web-based BLAT tool may return:
++Error in TCP non-blocking connect() 111 - Connection refused +Operation now in progress +Sorry, the BLAT/iPCR server seems to be down. Please try again later. ++ + +
ps aux | grep gfServer+
genomes.txt, the twoBitPath/filename must match the .2bit file
+ used when starting gfServer. The location of the gfServer instance can
+ be verified by changing into the directory where gfServer was launched and running
+ the appropriate hostname command.
+ hostname -i+ This will return an IP address, for example: +
132.249.245.79telnet:
+ telnet yourIP yourPort+ For example: +
telnet 132.249.245.79 17777+ A successful connection shows: +
Connected to 132.249.245.79+ If
Connection refused appears, gfServer may not be running, or the
+ IP/port configuration is incorrect.genomes.txt file should also be checked to confirm that the BLAT
+ line matches the correct IP and port. For example:
+ blat 132.249.245.79 17777+ Instead of: +
blat localhost 17777
gfServer:
+ gfServer status yourLocation yourPort+ For example: +
gfServer status 132.249.245.79 17777+ Sample output might look like:
+version 36x2 +type nucleotide +host localhost +port 17777 +tileSize 11 +stepSize 5 +minMatch 2 +pcr requests 0 +blat requests 0 +bases 0 +misses 0 +noSig 1 +trimmed 0 +warnings 0 ++
gfClient. If gfClient successfully
+ connects to gfServer, the IP/port configuration is correct. Running
+ gfClient directly verifies connectivity independently of the browser
+ interface. From the directory containing the hub's .2bit file, the
+ command can be executed as follows:
+ gfClient yourLocation yourPort pathTo2bitFile yourFastaQuery.fa output.psl+ For example: +
gfClient localhost 17777 . query.fa gfOutput.psl+ Note the
. after the port, which tells gfClient to use
+ the .2bit file in the current directory. Check gfOutput.psl for BLAT results.gfClient yourServer.yourInstitution.edu 17779 `pwd` test.fa dnaTestOut.psl+ Protein test +
gfClient -t=dnaX -q=prot yourServer.yourInstitution.edu 17779 `pwd` proteinSequence.fa proteinOut.psl+ + Ensure that the
yourAssembly.2bit file is present on the test machine.
+A dynamic BLAT server is specified with the "dynamic" argument to the
+blat, transBlat, and isPcr definitions in the hub
+genomes.txt file, followed by the gfServer root-relative path of the
+directory containing the .2bit and .gfidx files.
For example:
++blat yourServer.yourInstitution.edu 4096 dynamic yourAssembly +transBlat yourServer.yourInstitution.edu 4096 dynamic yourAssembly +isPcr yourServer.yourInstitution.edu 4096 dynamic yourAssembly ++
The genome and gfServer indexes would be:
++$rootdir/yourAssembly/yourAssembly.2bit +$rootdir/yourAssembly/yourAssembly.untrans.gfidx +$rootdir/yourAssembly/yourAssembly.trans.gfidx ++
Refer to the +Building gfServer indexes section for for detailed instructions on building + the index.
+For large hubs, it is possible to have more deeply nested directories. For instance, the +following NCBI convention:
++blat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3 +transBlat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3 +isPcr yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3 ++
Which will reference these genome files and indexes:
++$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.2bit +$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.untrans.gfidx +$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.trans.gfidx ++ + + +
A query without specifying genome acts as an "I am alive" check:
+
+% gfServer status myserver 4040 +version 37x1 +serverType dynamic ++
Specifying a -genome checks that it is valid and provides information on how the index was
+built:
+% gfServer -genome=mm10 -genomeDataDir=test/mm10 status myserver 4040 +version 37x1 +serverType dynamic +type nucleotide +tileSize 11 +stepSize 5 +minMatch 2 +
Using -trans checks the translated index:
+% gfServer -genome=mm10 -genomeDataDir=test/mm10 -trans status myserver 4040 +version 37x1 +serverType dynamic +type translated +tileSize 4 +stepSize 4 +minMatch 3 ++ +