f4670e11533d6f2a8e2f4f575bb294a2fb0214cb gperez2 Thu Mar 26 12:44:09 2026 -0700 Removing the assemblyHubGuidelines.html since it is an early version draft of the assemblyHubHelp.html, refs #37285 diff --git src/hg/htdocs/goldenPath/help/assemblyHubGuidelines.html src/hg/htdocs/goldenPath/help/assemblyHubGuidelines.html deleted file mode 100755 index 91c47cca13e..00000000000 --- src/hg/htdocs/goldenPath/help/assemblyHubGuidelines.html +++ /dev/null @@ -1,750 +0,0 @@ - - - - - - - -
Please note, if you are working with a genome that has already -been submitted to the -NCBI -Assembly system, it may already be available in the -UCSC Genome Browser.
-Please examine the -GenArk Assembly -Hub collection to see if your genome of interest is already -available. In the case it cannot be found there, you can use the -UCSC Assembly -Request page to request a genome assembly be added to the -UCSC Genome Browser. -
- -- The Assembly Hub function allows you to display your novel - genome sequence using the UCSC Genome Browser. -
- - -To display your novel genome sequence, use a web server at -your institution (or free services like -Cyverse), -for usage behind a firewall you can also load them locally -through docker to supply your files -to the UCSC Genome Browser. Note that hosting hub files on HTTP -is highly recommended and much more efficient than FTP. You then -establish a hierarchy of directories and files to host your -novel genome sequence. For example:
- --myHub/ - directory to organize your files on this hub - hub.txt - primary reference text file to define the hub, refers to: - genomes.txt - definitions for each genome assembly on this hub - newOrg1/ - directory of files for this specific genome assembly - newOrg1.2bit - '2bit' file constructed from your fasta sequence - description.html - information about this assembly for users - trackDb.txt - definitions for tracks on this genome assembly - groups.txt - definitions for track groups on this assembly - bigWig and bigBed files - data for tracks on this assembly - external track hub data tracks --
The URL to reference this hub would be: -http://yourLab.yourInstitution.edu/myHub/hub.txt
-Note: there is now a useOneFile on hub
-setting that allows the hub properties to be specified in a
-single file. More information about this setting can be found
-on the
-Genome
-Browser User Guide.
You can view a working example hierarchy of files at: -Plants
-A smaller slice of this hub is represented in a -Quick -Start Guide to Assembly Hubs.
- - -You can build direct links to the genome(s) in your assembly -hub:
-- The initial file - hub.txt - is the primary URL reference for your assembly hub. The - format of the file: -
--hub hubName -shortLabel genome -longLabel Comment describing this hub contents -genomesFile genomes.txt -email contactEmail@institution.edu -descriptionUrl aboutHub.html --
- shortLabel is the name that will appear in - the genome pull-down menu at the UCSC gateway page. Example: - Plants. -
-- genomesFile is a reference to the next - definition file in this chain that will describe the - assemblies and tracks available at this hub. Typically - genomes.txt is at the same directory level as this - hub.txt, however it can also be a relative path - reference to a different directory level. -
-- The email address provides users a contact - point for queries related to this assembly hub. -
-- The descriptionUrl provides a relative path - or URL link to a webpage describing the overall hub. -
- - -The -genomes.txt -file provides the references to the genome assemblies and tracks -available at this assembly hub. The example file indicates the -typical contents:
--genome ricCom1 -trackDb ricCom1/trackDb.txt -groups ricCom1/groups.txt -description July 2011 Castor bean -twoBitPath ricCom1/ricCom1.2bit -organism Ricinus communis -defaultPos E09R7372:1000000-2000000 -orderKey 4800 -scientificName Ricinus communis -htmlPath ricCom1/description.html -transBlat yourLab.yourInstitution.edu 17777 -blat yourLab.yourInstitution.edu 17777 -isPcr yourLab.yourInstitution.edu 17779 --
There can be multiple assembly definitions in this single -file. Separate these stanzas with blank lines. The references -to other files are relative path references. In this example -there is a sub-directory here called ricCom1 which contains the -files for this specific assembly.
-Note that it is strongly encouraged to give each of your -genomes stanza's a line for defaultPos, scientificName, -organism, description (along with other above settings) so that -when your hub is attached it will load a specified default -location and have text to be more easily searched from the -Gateway page.
- - -- The .2bit file is constructed from the fasta - sequence for the assembly. The kent source program - faToTwoBit is used to construct this file. - Download the program from the - downloads - section of the Browser. For example: -
--faToTwoBit ricCom1.fa ricCom1.2bit --
- Use the twoBitInfo to verify the sequences - in this assembly and create a chrom.sizes - file which is not used in the hub, but is useful in later - processing to construct the big* files: -
--twoBitInfo ricCom1.2bit stdout | sort -k2rn > ricCom1.chrom.sizes --
- The .2bit commands can function with the - .2bit file at a URL: -
--twoBitInfo -udcDir=http://genome-test.gi.ucsc.edu/~hiram/hubs/Plants/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes --
- Sequence can be extracted from the .2bit file with - the twoBitToFa command, for example: -
--twoBitToFa -seq=chrCp -udcDir=http://genome-test.gi.ucsc.edu/~hiram/hubs/Plants/ricCom1/ricCom1.2bit stdout > ricCom1.chrCp.fa -- -
The -groups.txt -file defines the grouping of track controls under the primary -genome browser image display. The example referenced here has -the usual definitions as found in the UCSC Genome Browser.
-Each group is defined, for example the Mapping group:
--name map -label Mapping -priority 2 -defaultIsClosed 0 --
Tracks are defined in the trackDb.txt where -each stanza describes how tracks are displayed -(shortLabel/longLabel/color/visibility) and other information -such as what group the track should belong to (referencing the -groups.txt) and if any additional html should -display when one clicks into the track or a track item:
--track gap_ -longLabel Gap -shortLabel Gap -priority 11 -visibility dense -color 0,0,0 -bigDataUrl bbi/ricCom1.gap.bb -type bigBed 4 -group map -html ../trackDescriptions/gap --
For more informations about the syntax of the -trackDb.txt file, use -UCSC's -Hub Track Database Definition page. It helps to have a -cluster super computer to process the genomes to construct -tracks. It can be done for small genomes on single computers -that have multiple cores. The process for each track is unique. -Please note the continuing document: -Browser -Track Construction for a discussion of constructing tracks -for your assembly hub.
- -Assembly hubs can have a Cytoband track that can allow for -quicker navigation of individual chromosomes and display banding -pattern information if known.
-A quick version of the track can be built using the existing -chrom.sizes files for your assembly (the banding options include -gneg, gpos25, gpos50, gpos75, gpos100, acen, gvar, or -stalk):
-
-cat araTha1.chrom.sizes | sort -k1,1 -k2,2n | awk '{print $1,0,$2,$1,"gneg"}' > cytoBandIdeo.bed
-
-The resulting bed file can be turned into a big bed and given -a .as file -(example -here) to inform the browser it is not a normal bed.
--bedToBigBed -type=bed4 cytoBandIdeo.bed -as=cytoBand.as araTha1.chrom.sizes cytoBandIdeo.bigBed --
In the trackDb, as long as the track is named cytoBandIdeo -(track -cytoBandIdeo -example) it will load in the assembly hub.
- - -There are resources for automatically building assembly hubs -available from -G-OnRamp -and -MakeHub.
-There is also a collection of Example NCBI assembly hubs that -are already working and can either be used or copied as a -template to build further hubs.
-- G-OnRamp is a Galaxy workflow that turns a genome assembly - and RNA-Seq data into a Genome Browser with multiple evidence - tracks. Because G-OnRamp is based on the Galaxy platform, - developing some familiarity with the key concepts and - functionalities of Galaxy would be beneficial prior to using - G-OnRamp. Visit the - G-OnRamp - website for an overview of their process. -
- -- MakeHub is a command line tool for the fully automatic - generation of track data hubs for visualizing genomes with - the UCSC genome browser. More information can be found on - their - GitHub - page. -
-Here are some quick steps to load an example hub from this -collection, and an attempt to explain how to look at the files -behind the hub.
-Now to investigate the files behind the hub to understand the -process involved:
-BLAT servers (gfServer) are configured as either dedicated or -dynamic servers. Dedicated BLAT serves index a genome when -started and remain running in memory to quickly respond to -request. Dynamic BLAT servers pre-index genomes to files and are -run on demand to handle a BLAT request and then exit.
-Dedicated gfServer are easier to configure and faster to -respond. However, the server continually uses memory. A dynamic -gfServer is more appropriate with multiple assemblies and -infrequent use. Their response time is usually acceptable; -however, it varies with the speed of the disk containing the -index. With repeated access, the operating system will cache the -indexes in memory, improving response time.
- -By running your own BLAT server, you can add lines to the -genomes.txt file of your assembly hub to enable the browser to -access the server and activate blat searches.
-Please see -Running -your own gfServer for details on installing and configuring -both dedicated and dynamic gfServers.
--transBlat yourServer.yourInstitution.edu 17777 -blat yourServer.yourInstitution.edu 17779 -isPcr yourServer.yourInstitution.edu 17779 --
Please see more about -configuring -your blat gfServer to replicate the UCSC Browser's settings, -which will also have information about optimizing PCR results. -The -Source -Downloads page offers access to utilities with pre-compiled -binaries such as gfserver found in a blat/ directory for your -machine type -here -and further blat documentation -here, -and the gfServer usage statement for further options.
-Please also know you can set up gfservers on -docker and run it locally.
- -Note: You can stop your instance of gfServer with a command. -For example:
--gfServer stop localhost 17860 -- -
You can see this error if you have the translatedBlat / -nucleotideBlat port numbers the wrong way around:
--Expecting 6 words from server got 2 --
The following is an example of an error message when -attempting to run a DNA sequence query via the web-based BLAT -tool after loading a hub, after starting a gfServer instance -(from the same dir as the 2bit file). For example, a command to -start an instance of gfServer:
--gfServer start localhost 17779 -stepSize=5 contigsRenamed.2bit & --
Example of a possible error message, from web-based BLAT -after attempting a web-based BLAT query:
--Error in TCP non-blocking connect() 111 - Connection refused -Operation now in progress -Sorry, the BLAT/iPCR server seems to be down. Please try again later. --
Check the following:
-First, make sure your gfServer instance is running.
-Type the following command to check for your running gfServer
-process:
ps aux | grep gfServer- -
In your genomes.txt file, does your twoBitPath/filename match
-what you specified in your command to start gfServer?
-In your genomes.txt file, is the location of the instance to
-your gfServer correct?
-To check this, you can cd into the directory where you started
-your gfServer, then type the command:
hostname -i-
Your result should be an IP address, for example, '132.249.245.79'.- -
Now you can test the connection to your port that you
-specified, with a simple telnet command.
-Type in the following command:
-telnet yourIP yourPort. For example:
telnet 132.249.245.79 17777-
The results should read, "Connected to 132.249.245.79".
-Otherwise, if gfServer isn't running or if you typed the wrong
-location in your telnet command, telnet will say, "Connection
-refused."
-In this example, check your genomes.txt file, and make sure
-your blat line reads, "blat 132.249.245.79 17777".
-You may need to change your genomes.txt file from, for example,
-"blat localhost 17777" to "blat 132.249.245.79 17777" (use your
-specific IP/host name where gfServer is running).
To request status from the gfServer process, run:
-gfServer status yourLocation yourPort.
-For example:
$ gfServer status 132.249.245.79 17777-
You should see output like this:
--version 36x2 -type nucleotide -host localhost -port 17777 -tileSize 11 -stepSize 5 -minMatch 2 -pcr requests 0 -blat requests 0 -bases 0 -misses 0 -noSig 1 -trimmed 0 -warnings 0 -- -
The best troubleshooting test is to take the webpage out of -the equation, and use the command line utility, -gfClient, to run the query on your instance of -gfServer. If you can successfully connect gfClient to gfServer, -you will know that your location and port specification are -correct.
-From the directory that holds your hub's .2bit file (should -be the same directory where your instance of gfServer was -launched), perform a query using gfClient:
-You can type "gfClient" on your command line to see the -usage statement.
-Use the following command: gfClient yourLocation yourPort -pathOf2bitFile yourFastaQuery.fa -nameOfOutputFile.psl
-FYI: For testing with gfClient, you only need the gfServer -binary on your server, not blat.
- -For example:
-gfClient localhost 17777 . query.fa gfOutput.psl-
Note the "." after the port, to specify that the query will -use the .2bit file in the current directory. After running this -command, take a look at the gfOutput.psl file. If successful, -you will see BLAT results.
- -Another example:
-Note: In the example below, -"yourServer.yourInstitution.edu" is the name of their machine -where you run the gfServer command.
-From the test machine: Test the DNA alignment, -where test.fa is some sequence to find:
-gfClient yourServer.yourInstitution.edu 17779 `pwd` test.fa dnaTestOut.psl-
From the test machine: Test the protein alignment, -where proteinSequence.fa is the sequence to find:
-gfClient -t=dnaX -q=prot yourServer.yourInstitution.edu 17779 `pwd` proteinSequence.fa proteinOut.psl-
pwd says to find the
- yourAssembly.2bit file in this directory.A dynamic BLAT server is specified with the "dynamic"
-argument to the blat, transBlat, isPcr definitions in the hub
-genomes.txt file, followed by the gfServer
-root-relative path of the directory containing the 2bit and
-gfidx files.
For example:
--blat yourServer.yourInstitution.edu 4096 dynamic yourAssembly -transBlat yourServer.yourInstitution.edu 4096 dynamic yourAssembly -isPcr yourServer.yourInstitution.edu 4096 dynamic yourAssembly --
The genome and gfServer indexes would be:
--$rootdir/yourAssembly/yourAssembly.2bit -$rootdir/yourAssembly/yourAssembly.untrans.gfidx -$rootdir/yourAssembly/yourAssembly.trans.gfidx --
See -Building -gfServer indexes for instructions in building the -index.
-For large hubs, it is possible to have more deeply nest -directory, for instance, the following NCBI convention:
--blat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3 -transBlat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3 -isPcr yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3 --
Which will reference these genome files and indexes:
--$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.2bit -$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.untrans.gfidx -$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.trans.gfidx -- -
A query without specifying a genome is an "I am alive" -check:
--% gfServer status myserver 4040 -version 37x1 -serverType dynamic --
Specifying a genome checks that is is valid and gives -information on how to the index was built:
--% gfServer -genome=mm10 -genomeDataDir=test/mm10 status myserver 4040 -version 37x1 -serverType dynamic -type nucleotide -tileSize 11 -stepSize 5 -minMatch 2 --
Using -trans checks the translated index:
--% gfServer -genome=mm10 -genomeDataDir=test/mm10 -trans status myserver 4040 -version 37x1 -serverType dynamic -type translated -tileSize 4 -stepSize 4 -minMatch 3 --