7a173a092486bb744c3f42d694aab108ecae34d5 dschmelt Wed Apr 10 12:21:42 2019 -0700 Adding GTF to bigGenePred Example4 #21582 diff --git src/hg/htdocs/goldenPath/help/bigGenePred.html src/hg/htdocs/goldenPath/help/bigGenePred.html index 8e1c758..5cf8820 100755 --- src/hg/htdocs/goldenPath/help/bigGenePred.html +++ src/hg/htdocs/goldenPath/help/bigGenePred.html @@ -1,280 +1,279 @@

bigGenePred Track Format

The bigGenePred format stores positional annotation items for collections of exons in a compressed format, similar to how BED files can be compressed into bigBeds. The bigGenePred format includes 8 additional fields that contain details about coding frames, annotation status, and other gene-specific information. This is commonly used in the Browser to display start codons, stop codons, and amino acid translations.

Before compression, bigGenePred files can be described as bed12+8 files. bigGenePred files can be created using the program bedToBigBed, run with the -as option to pull in a special autoSql (.as) file that defines the extra fields of the bigGenePred.

Much like bigBed, bigGenePred files are in an indexed binary format. The advantage of using a binary format is that only the portions of the file needed to display a particular region are read by the Genome Browser server. Because of this, indexed binary files have much faster display performance than regular BED format files when working with large data sets. The bigGenePred file remains on the user's web-accessible server (http, https or ftp) and only the portion needed to display the current genome position is cached as a "sparse file". If you want more information on finding a a web-accessible server or need hosting space for your bigGenePred files, please see the Hosting section of the Track Hub Help documentation.

bigGenePred file definition

The following autoSql definition specifies bigGenePred gene prediction files. This definition, contained in the file bigGenePred.as, is pulled in when the bedToBigBed utility is run with the -as=bigGenePred.as option.

table bigGenePred
 "bigGenePred gene models"
     (
     string chrom;       	"Reference sequence chromosome or scaffold"
     uint   chromStart;  	"Start position in chromosome" 
     uint   chromEnd;    	"End position in chromosome"
     string name;        	"Name or ID of item, ideally both human-readable and unique"
     uint score;         	"Score (0-1000)"
     char[1] strand;     	"+ or - for strand"
     uint thickStart;    	"Start of where display should be thick (start codon)"
     uint thickEnd;      	"End of where display should be thick (stop codon)"
     uint reserved;       	"RGB value (use R,G,B string in input file)"
     int blockCount;     	"Number of blocks"
     int[blockCount] blockSizes; "Comma separated list of block sizes"
     int[blockCount] chromStarts;"Start positions relative to chromStart"
     string name2;       	"Alternative/human readable name"
     string cdsStartStat; 	"Status of CDS start annotation (none, unknown, incomplete, or complete)"
     string cdsEndStat;   	"Status of CDS end annotation (none, unknown, incomplete, or complete)"
     int[blockCount] exonFrames; "Exon frame {0,1,2}, or -1 if no frame for exon"
     string type;        	"Transcript type"
     string geneName;    	"Primary identifier for gene"
     string geneName2;   	"Alternative/human-readable gene name"
     string geneType;    	"Gene type"
     )

The following bed12+8 is an example of a bigGenePred input file .

Creating a bigGenePred track

To create a bigGenePred track, follow these steps:

Step 1. Format your bigGenePred file. The first 12 fields of the bigGenePred bed12+8 format are described by the basic BED file format. You can also read the genePred format. Your bigGenePred file must also contain the 8 extra fields described in the autoSql file definition shown above: name2, cdsStartStat, cdsEndStat, exonFrames, type, geneName, geneName2, geneType. For reference, you can use this example bed12+8 input file, bigGenePred.txt. Your bigGenePred file must be sorted first on the chrom field, and secondarily on the chromStart field. You can use the UNIX sort command to do this:

sort -k1,1 -k2,2n unsorted.bed > input.bed

Step 2. Download the bedToBigBed program from the binary utilities directory.

Step 3. Download the chrom.sizes file for your genome assembly using the fetchChromSizes script from the same directory. Alternatively, you can download the chrom.sizes file for any assembly hosted at UCSC from our downloads page (click on "Full data set" for any assembly). For example, the hg38.chrom.sizes file for the hg38 database is located at http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes.

Step 4. Create the bigGenePred file from your sorted input file using the bedToBigBed utility command:

bedToBigBed -as=bigGenePred.as -type=bed12+8 bigGenePred.txt chrom.sizes myBigGenePred.bb

Step 5. Move the newly created bigGenePred file (myBigGenePred.bb) to a web-accessible http, https, or ftp location. See hosting section if necessary.

Step 6. Construct a custom track using a single track line. Note that any of the track attributes listed here are applicable to tracks of type bigBed. The basic version of the track line will look something like this:

track type=bigGenePred name="My Big GenePred" description="A Gene Set Built from Data from My Lab" bigDataUrl=http://myorg.edu/mylab/myBigGenePred.bb

Step 7. Paste this custom track line into the text box on the custom track management page.

The bedToBigBed program can be run with several additional options. For a full list of the available options, type bedToBigBed (with no arguments) on the command line to display the usage message.

Examples

Example #1

In this example, you will create a bigGenePred custom track using a bigGenePred file located on the UCSC Genome Browser http server, bigGenePred.bb. This file contains data for the hg38 assembly.

To create a custom track using this bigGenePred file:

Construct a track line that references the hosted file:

track type=bigGenePred name="bigGenePred Example One" description="A bigGenePred file" bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigGenePred.bb

Paste the track line into the custom track management page for the human assembly hg38 (Dec. 2013).
Click the button.

Custom tracks can also be loaded via one URL line. The link below loads the same bigGenePred track and sets additional parameters in the URL:

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgct_customText=track%20type=bigGenePred%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigGenePred.bb

After this example bigGenePred track is loaded in the Genome Browser, click on a gene in the browser's track display to view the details page for that gene. Note that the page offers links to several sequence types, including translated protein, predicted mRNA, and genomic sequence.

Example #2

In this example, you will configure the bigGenePred track loaded in Example #1 to display codons and amino acid numbering:

On the bottom of the gene details page, click the "Go to ... track controls" link.
Change the "Color track by codons:" option from "OFF" to "genomic codons" and check that the display mode is set to "full". Then click .
On the Genome Browser tracks display, zoom to a track region where amino acids display, such as chr9:133,255,650-133,255,700, and note that the track now displays codons.
Return to the track controls page and click the box next to "Show codon numbering", then click .
The browser tracks display will now show amino acid numbering.

You can also add a parameter in the custom track line, baseColorDefault=genomicCodons, to set display codons by default:

browser position chr10:67,884,600-67,884,900
 track type=bigGenePred baseColorDefault=genomicCodons name="bigGenePred Example Two" description="A bigGenePred file" visibility=pack bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigGenePred.bb

Paste the above into the hg38 custom track management page to view an example of bigGenePred amino acid display at the beginning of the SIRT1 gene on chromosome 10.

An image of a track with codons colored

Example #3

In this example, you will create your own bigGenePred file from an existing bigGenePred input file.

Save the example bed12+8 input file to your computer, bigGenePred.txt.
Download the bedToBigBed utility (Step 2, in the Creating a bigGenePred section above).
Save the hg38.chrom.sizes text file to your computer. This file contains the chrom.sizes for the human hg38 assembly (Step 3, above).
Save the autoSql file bigGenePred.as to your computer.

Run the bedToBigBed utility to create the bigGenePred output file (step 4, above):

bedToBigBed -type=bed12+8 -tab -as=bigGenePred.as bigGenePred.txt hg38.chrom.sizes bigGenePred.bb

Place the newly created bigGenePred file (bigGenePred.bb) on a web-accessible server (Step 5, above).
Construct a track line that points to the bigGenePred file (Step 6, above).
Create the custom track on the human assembly hg38 (Dec. 2013), and view it in the Genome Browser (step 7, above).

Example #4

In this example, you will convert a genePred file to bigGenePred using command line utilities. +

In this example, you will convert a GTF file to bigGenePred using command line utilities. You can download utilities from the utilities directory.

- Obtain a genePred extended file. In this example, we are downloading the Comprehensive Gencode V28 gene data. -
```
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/wgEncodeGencodeCompV28.txt.gz
```

wget http://genome.ucsc.edu/goldenPath/help/examples/bigGenePredExample4.gtf

- Uncompress the file. -
```
gunzip wgEncodeGencodeCompV28.txt.gz 
```

gtfToGenePred -genePredExt bigGenePredExample4.gtf example4.genePred

-Isolate columns 2 till the end, removing the bin column, and saving as wgCompV28Cut.txt. -
```
cut -f 2- wgEncodeGencodeCompV28.txt > wgCompV28Cut.txt 
```

genePredToBigGenePred example4.genePred ex4BigGenePred.txt

- Convert the genePred extended file to a bigGenePred text file, reordering and adding columns. -
```
genePredToBigGenePred wgCompV28Cut.txt wgEncodeGencodeCompV28BigGP.txt
```
- Obtain input files for the binary conversion. + Obtain helper files for the conversion from pre-bigGenePred to binary bigGenePred.
```
fetchChromSizes hg38 > hg38.chrom.sizes
-wget https://hgwdev.gi.ucsc.edu/goldenPath/help/examples/bigGenePred.as
```

Convert your text bigGenePred to a binary indexed format. -

bedToBigBed -type=bed12+8 -tab -as=bigGenePred.as wgEncodeGencodeCompV28BigGP.txt hg38.chrom.sizes wgEncodeGencodeCompV28.bgp

bedToBigBed -type=bed12+8 -tab -as=bigGenePred.as ex4BigGenePred.txt hg38.chrom.sizes ex4BigGenePred.bb

- Put your binary indexed file in a web-accessible location. See the hosting section for more information.

hosting section

- View your dataset in the Browser by entering your data URL in the bigDataUrl field of the URL. -
```
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgct_customText=track%20type=bigGenePred%20bigDataUrl=https://hgwdev.gi.ucsc.edu/~dschmelt/wgEncodeGencodeCompV28.bgp
```
+ View your dataset in the Browser by entering your hosted data URL in the bigDataUrl field of the + URL. For example, you can paste this link into your web browser. +
```
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgct_customText=track%20type=bigGenePred%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/ex4bigGenePred.bb
```
You can also add your data in the custom track management page. This allows you to set position, configuration options, and write a more complete -desciption. If you want to see codons, you will have to right click to configure codon view or -set this option using the baseColorDefault=genomicCodons code as is done below. +desciption. If you want to see codons, you can right click, then click configure codon view or +set these options using the baseColorDefault=genomicCodons code as is done below.
```
browser position chr10:67,884,600-67,884,900 
-track type=bigGenePred baseColorDefault=genomicCodons name="bigGenePred Example Four" description="BGP Made from genePred" visibility=pack bigDataUrl=https://hgwdev.gi.ucsc.edu/~dschmelt/wgEncodeGencodeCompV28.bgp
```

Sharing your data with others

If you would like to share your bigGenePred data track with a colleague, learn how to create a URL link to your data by looking at Example #6.

Extracting data from bigBed format

Because the bigGenePred files are an extension of bigBed files, which are indexed binary files, it can be difficult to extract data from them. UCSC has developed the following programs to assist in working with bigBed formats, available from the -binary utilities directory.

bigBedToBed — converts a bigBed file to ASCII BED format.
bigBedSummary — extracts summary information from a bigBed file.
bigBedInfo — prints out information about a bigBed file.

As with all UCSC Genome Browser programs, simply type the program name (with no parameters) at the command line to view the usage statement.

Troubleshooting

If you encounter an error when you run the bedToBigBed program, check your input file for data coordinates that extend past the end of the chromosome. If these are present, run the bedClip program (available here) to remove the problematic row(s) before running the bedToBigBed program.

+ + +