c7ff827180db5c246eb48e0b52cec414c44370cb jnavarr5 Thu Oct 31 15:51:46 2024 -0700 Adding examples of how to extract positions using rsIDs or extract rsIDs using a list of positions. Max says it is a commonly asked question so updating our FAQ documentation, refs #34701 diff --git src/hg/htdocs/FAQ/FAQreleases.html src/hg/htdocs/FAQ/FAQreleases.html index 3e2180c..947044a 100755 --- src/hg/htdocs/FAQ/FAQreleases.html +++ src/hg/htdocs/FAQ/FAQreleases.html @@ -10,30 +10,35 @@

Topics

List of UCSC genome releases
Initial assembly release dates
Patch sequences for human and mouse
UCSC assemblies
Comparison of UCSC and NCBI human assemblies
Looking for a genome assembly not shown in the tree?
Differences between UCSC and NCBI mouse assemblies
Accessing older assembly versions
Frequency of GenBank data updates
Coordinate changes between assemblies
Converting positions between assembly versions
Converting SNPs between assembly versions

How can I convert SNP annotation coordinates between assembly versions?
How can I convert a large set of SNP annotations?
How can I extract a list of rsIDs using chrom:start-end or vise versa?

Missing annotation tracks
What next with the human genome?
Mouse strain used for mouse genome sequence

Return to FAQ Table of Contents

List of UCSC genome releases

How do UCSC's release numbers correspond to those of other organizations, such as NCBI?

The first release of an assembly is given a name using the first three characters of the organism's genus and species classification in the format gggSss#, with subsequent assemblies incrementing @@ -564,53 +569,88 @@ Browser using the dbSNP track for human assemblies (i.e. hg19 or hg38) or the EVA SNP track on mouse assemblies (i.e. mm10 or mm39) to perform the conversion.

To summarize the setps:

Create a file of all rsIDs
Use the Table Browser to map the file of rsIDs to the other assembly's coordinates
Create another file containing any rsIDs that were not mapped by the Table Browser
Using the file from the previous step, use the Table Browser to create a BED4 file for the rsIDs that were not mapped by the Table Browser
Run LiftOver on the BED4 file to get the new coordinates in the other assembly
Use the Data Integrator to map the LiftOver results to new rsIDs where possible
Combine the Table Browser rsID-mapped BED4 with the LiftOver/Data Integrator-mapped BED4. Beware duplicates that will cause downstream problems. You will need to decide whether to remove duplicates as unreliable or resolve duplicates

+ +

How can I convert a large set of SNP annotations?

For bulk conversions, the Table Browser can be used to extract the coordinates for the rsIDs on the target assembly. More information about performing batch queries on the Table Browser can be found on the following Table Browser help page. An example of using the Table Browser to convert SNP between assemblies can be found on a previously answered question available on the mailing list archive.

If you are using versions dbSNP 153 and above, the data are formatted as bigBed files instead of being stored as a MariaDb table. For very large queries, this may cause the Table Browser to timeout before the query finishes as dbSNP has grown to include over 700 million variants. If you find that your Table Browser query timesout for your list of rsIDs, you can use the bigBedNamedItems command-line tool to extract the rsID coordinates directly from the -bigBed file instead of using the Table Browser. More information and examples using the +bigBed file instead of using the Table Browser.

More information and examples using the bigBedNamedItems utility can be found on the following FAQ entry. As a reminder, you can run any Kent command-line tool without arguments to get the usage statement.

+ +

How can I extract a list of rsIDs using chrom:start-end or vise versa?

+Several utilities for working with bigBed-formatted binary files can be downloaded +here. +Run a utility with no arguments to see a brief description of the utility and its options. +

bigBedInfo provides summary statistics about a bigBed file including the number of + items in the file. With the -as option, the output includes an + autoSql definition of data columns, useful for interpreting the column values.
bigBedToBed converts the binary bigBed data to tab-separated text. + Output can be restricted to a particular region by using the -chrom, -start + and -end options. +
bigBedNamedItems extracts rows for one or more rs# IDs.

+ + +

Examples:

Retrieve all variants in the region chr1:200001-200400

bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp155.bb -chrom=chr1 -start=200000 -end=200400 stdout

Retrieve variant rs6657048 +

bigBedNamedItems dbSnp155.bb rs6657048 stdout

Retrieve all variants with rs# IDs in file myIds.txt

bigBedNamedItems -nameFile dbSnp155.bb myIds.txt dbSnp155.myIds.bed

Missing annotation tracks

Why is my favorite annotation track missing from your latest release?

The initial release of a new genome assembly typically contains a small subset of core annotation tracks. New tracks are added as they are generated. In many cases, our annotation tracks are contributed by scientists not affiliated with UCSC who must first obtain the sequence, repeatmasked data, etc. before they can produce their tracks. If you have need of an annotation that has not appeared on an assembly within a month or so of its release, feel free to send an inquiry to genome@soe.ucsc.edu. Messages sent to this address will be posted to the moderated genome mailing list, which is archived on a SEARCHABLE, PUBLIC Google Groups forum.