62b1428915cc2c8b9acbf46a86c1248833257efe mspeir Mon Nov 17 18:50:32 2025 -0800 changing rest of genome-source references to github, refs #34485 diff --git src/hg/utils/automation/genbank/README.txt src/hg/utils/automation/genbank/README.txt index 8b92b7fa2ca..a93819a0601 100644 --- src/hg/utils/automation/genbank/README.txt +++ src/hg/utils/automation/genbank/README.txt @@ -1,71 +1,71 @@ ############################################################################ UCSC procedure to mirror genbank/refseq assemblies and generate assembly hubs This README.txt file is from: -http://genome-source.soe.ucsc.edu/gitlist/kent.git/blob/master/src/hg/utils/automation/genbank/README.txt +http://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/utils/automation/genbank/README.txt ############################################################################ # mirror genbank/refseq hierarchy There is a cron job running once a day M-F that runs an rsync command to fetch a select set of files from genbank/refseq. The script here is: cronUpdate.sh with one argument specifying either the refseq or genbank assembly hierarchy, e.g.: cronUpdate.sh refseq The accumulating hierarchy at UCSC is in: export outside="/hive/data/outside/ncbi/genomes/refseq" Or, if genbank assemblies: export outside="/hive/data/outside/ncbi/genomes/genbank" The rsync command has a number of include statements to pick up a select set of files: cd "${outside}" rsync -avPL --include "*/" --include="chr2acc" --include="*chr2scaf" \ --include "*placement.txt" --include "*definitions.txt" \ --include "*_rm.out.gz" --include "*agp.gz" --include "*regions.txt" \ --include "*_genomic.fna.gz" --include "*gff.gz" \ --include "*_assembly_report.txt" --exclude "*" \ rsync://ftp.ncbi.nlm.nih.gov/genomes/genbank/${D}/ ./${D}/ \ || /bin/true The "${D}" directory name is one of the directories at ftp.ncbi.nlm.nih.gov/genomes/{refseq|genbank}/${D}/ Slightly different directory listings for refseq vs. genbank. For refseq assemblies: for D in archaea bacteria fungi invertebrate mitochondrion plant plasmid \ plastid protozoa vertebrate_mammalian vertebrate_other viral For genbank assemblies: for D in archaea bacteria fungi invertebrate other plant protozoa \ vertebrate_mammalian vertebrate_other The cron script also maintains file listings and logs in: ${outside}/fileListings/yyyy/mm/ ${outside}/logs/yyyy/mm/ to track changes over time and provide a record of activity ############################################################################ # processing NCBI files into assembly hub files Processing takes place in the two hierarchies: /hive/data/inside/ncbi/genomes/refseq/ /hive/data/inside/ncbi/genomes/genbank/ The scripts from this source directory: kent/src/hg/utils/automation/genbank/ are used to process the NCBI files into assembly hub files. This processing is done with a cluster computer set up since there are a large number of steps to perform. The primary driving script is: ncbiToUcsc.sh It takes two arguments: 1. genbank|refseq - to specify which hierarchy to work on 2. <pathTo>/*_genomic.fna.gz - fasta sequence file in the 'outside' path ############################################################################