01e223245a85e1ae6d1f2e914c64367cfd006ea9 ccpowell Tue Jul 9 16:00:29 2019 -0700 Changing MySQL to MariaDB in documentation, refs #23597 diff --git src/hg/htdocs/FAQ/FAQdownloads.html src/hg/htdocs/FAQ/FAQdownloads.html index d22d10d..621592b 100755 --- src/hg/htdocs/FAQ/FAQdownloads.html +++ src/hg/htdocs/FAQ/FAQdownloads.html @@ -30,31 +30,31 @@

N characters at beginning of human chr22

Erroneous duplicated chrY_random region on Mouse Build 34 (mm6)

Mapping chimp chromosome numbers to human chromosomes numbers

Converting genome coordinates between assemblies

Linking gene name with accession number

Obtaining a list of Known Genes

Repeat-masking data

Availability of repeat-masked data

RepeatMasker version differences - UCSC vs. Repeatmasker website

Obtaining promoter sequence

Data from Evolutionary Conservation Score tracks

Minus strand coordinates - axtNet files

Mapping UCSC STS marker IDS to those of other groups

deCODE map data

Direct MySQL access to data

Direct MariaDB (MySQL) access to data

Name of fourth column in BED output

Track data access

Known issues with Table Browser GTF output

Table Browser output file not ordered

'Permisssion denied' error when trying to use command-line utilities

Restricted Track Data

Return to FAQ Table of Contents

Downloading sequence and annotation data

How do I obtain the sequence and/or annotation data for a release?

@@ -141,31 +141,31 @@

sex

source

tissue

These tables are also accessible from:

The Table Browser, as connected tables and joined fields described when clicking the "describe table schema " button
- One of our two public access MySQL servers + One of our two public access MariaDB servers in the US and Europe

Extracting sequence in batch from an assembly

I have a lot of coordinates for an assembly and want to extract the corresponding sequences. What is the best way to proceed?

There are two ways to extract genomic sequence in batch from an assembly:

A. Download the appropriate fasta files from our ftp server and extract sequence data using your own tools or the tools from our source tree. This is the recommended method when you have very large sequence datasets or will be extracting data frequently. Sequence data for most assemblies is located in the assembly's "chromosomes" subdirectory on the downloads server. For example, @@ -247,34 +247,34 @@

Microsoft Word or any program that can handle large text files will do. Some of the chromosomes begin with long blocks of Ns. You may want to search for an A to get past them.

Unless you have a particular need to view or use the raw data files, you might find it more interesting to look at the data using the Genome Browser. Type the name of a gene in which you're interested into the position box (or use the default position), then click the submit button. In the resulting Genome Browser display, click the DNA link on the menu bar at the top of the page. Select the Extended case/color options button at the bottom of the next page. Now you can color the DNA sequence to display which portions are repeats, known genes, genetic markers, etc.

Data differences between downloaded data and browser display

I downloaded the genome annotations from your MySQL database tables, but the mRNA locations +

I downloaded the genome annotations from your MariaDB database tables, but the mRNA locations didn't match what was showing in the Genome Browser. Shouldn't they be in synch?

-Yes. The Genome Browser and Table Browser are both driven by the same underlying MySQL database. +Yes. The Genome Browser and Table Browser are both driven by the same underlying MariaDB database. Check that your downloaded tables are from the same assembly version as the one you are viewing in the Genome Browser. If the assembly dates don't match, the coordinates of the data within the tables may differ. In a very rare instance, you could also be affected by the brief lag time between the update of the live databases underlying the Genome Browser and the time it takes for text dumps of these databases to become available in the downloads directory.

Strange characters in FASTA file

I noticed several characters other than A, C, G, T, and N in my fasta file, for example y, k, s, etc. Is the file corrupted or are these characters valid?

The characters most commonly seen in sequence are A, C, G, T, and N, but there are several other valid characters that are used in clones to indicate ambiguity about the identity of certain bases in the sequence. It's not uncommon to see these @@ -774,40 +774,40 @@ this ID to look it up in the stsMap table where the marker is located. For example, D10S249 has UCSC ID 2880 and is located at chr10:240791-241019.

deCODE map data

Where can I get more information about the deCODE map?

You can obtain this information from the combination of a couple of tables. The stsMap table contains the physical position of all STS markers, including those on the deCODE map. This file also contains information about the position on the genome-wide maps, including the deCODE map. A second file, stsInfo2, contains additional information about each marker, including aliases, primer sequence information, etc. This table is related to the first table by an ID (the identNo field in both files).

Direct MySQL access to data

Direct MariaDB (MySQL) access to data

Is it possible to run SQL queries directly on the database rather than using the Table Browser interface?

Yes. See our documentation on Downloading Data using -MySQL.

+MariaDB.

-Connect to the US MySQL server using the command:

+Connect to the US MariaDB server using the command:

mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A

Or to the European MySQL server using the command:

Or to the European MariaDB server using the command:

mysql --user=genome --host=genome-euro-mysql.soe.ucsc.edu -A

Name of fourth column in BED output

When using the Table Browser to extract exons from a Gene track, what does the "Name" column (fourth BED column) refer to?

The fourth column of the BED output contains a lot of information separated by underscores. For example:

uc009vjk.2_cds_1_0_chr1_324343_f

This information is represented as follows:

ucscId_sequenceType_sequenceTypeNumber_basesAdded_chromosome_positionOfFirstBaseOfItem_strand

@@ -834,53 +834,53 @@ listed in this section of the 4th column is actually 1 based. It will be the exact coordinate the feature starts on as displayed in the browser.
Strand: forward(f) or reverse(-) strand.

Track Data Access

How do I access the data underlying a track?

The raw data underlying a track can be explored interactively with the Table Browser, Data Integrator, or Variant Annotation Integrator. For automated analysis, the genome annotation can be downloaded from the downloads server, one of our two -public MySQL servers, or +public MariaDB servers, or using our JSON API.

bigBed data: For bigBed files, individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range using one of the hgdownload servers, example:

North American server:

bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/path/to/file/bigBedfile.bb -chrom=chr21 -start=0 -end=1000000 stdout

European server:

bigBedToBed http://hgdownload-euro.soe.ucsc.edu/gbdb/path/to/file/bigBedfile.bb -chrom=chr21 -start=0 -end=1000000 stdout

-SNP data: If queries against the SNP table on one of our public MySQL servers or on your -own MySQL installation are slow, then they can be sped up by using the "bin" field; you +SNP data: If queries against the SNP table on one of our public MariaDB servers or on your +own MariaDB installation are slow, then they can be sped up by using the "bin" field; you can contact us for more information.

Read more in our blog about Accessing the Genome Browser Programmatically to acquire data.

Obtaining GTF (Gene Transfer Format)

What is the best method for obtaining GTF output?

Currently, the Table Browser does not have an option return data as GTF files. Currently, the best method to obtain GTF files is to use the command-line format conversion utility, genePredToGtf. This can be set up @@ -897,31 +897,31 @@ includes proper start and stop codons.

Some tables in older genome assemblies are not supported.

GenePred (short for Gene Predictions) is a table format commonly used for gene tracks in the UCSC Genome Browser where each transcript has a single row. Tables are not stored in GTF as it would require many rows to describe a single transcript since each gene feature (i.e., exon) requires a separate line. The genePredToGtf command-line utility can be used to convert genePred to GTF. Download the genePredToGtf operating system-specific command-line utility from the utilities directory.

Please see the Genes in GTF or GFF Format wiki page for examples and various methods for conversion. The genePredToGtf utility can convert files from several sources, such as Table Browser output from a genePred table, a local downloaded gene set table like refGene.txt, or from querying -public MySQL tables.

+public MariaDB tables.

Table Browser output file order

My table browser output file is not ordered by position, how is it ordered?

Most of our tables have a special first column called "bin" that helps with quickly displaying data on the Genome Browser. This (chrom,bin) index causes query results to be ordered first by bin, then by chromStart. This allows us to query and return results more quickly than if they were sorted by chromStart.

A quick way to sort an output BED file by position is to use the following UNIX command on our Table Browser output BED file:

sort -k1,1 -k2n,2n example.bed > example.sorted.bed