src/hg/utils/otto/sarscov2phylo/bioSampleIdToText.sh 99599c7130790107ff0de9f043930da6aa7fddf1

99599c7130790107ff0de9f043930da6aa7fddf1
angie
  Mon Nov 16 16:35:58 2020 -0800
Scripts for automating SARS-CoV-2 Phylogeny tracks (refs #26530): fetching
sequences and metadata from several public sources, mapping GISAID IDs to
public seq IDs, downloading the latest release of the phylogenetic tree from
github.com/roblanf/sarscov2phylo/ , making VCFs from GISAID and public
sequences, and using github.com/yatisht/usher to resolve ambiguous alleles,
make protobuf files for hgPhyloPlace, and add public sequences that have not
been mapped to GISAID sequences to the sarscov2phylo tree for a comprehensive
public tree+VCF.

This is still not fully otto-mated because certain crucial inputs like
GISAID sequences still must be downloaded using a web browser, but the goal
is to automate as much as possible and maybe someday have it fully cron-driven.

There are two main top-level scripts which call other scripts, which may in turn
call scripts, in this hierarchy:

updateIdMapping.sh
getCogUk.sh
getNcbi.sh
searchAllSarsCov2BioSample.sh
bioSampleIdToText.sh
bioSampleTextToTab.pl
gbMetadataAddBioSample.pl
fixNcbiFastaNames.pl

updateSarsCov2Phylo.sh
getRelease.sh
processRelease.sh
cladeLineageColors.pl
mapPublic.sh
extractUnmappedPublic.sh
addUnmappedPublic.sh

many of the above:
util.sh

publicCredits.sh will hopefully be folded into updateSarsCov2Phylo.sh when I
figure out how to automate fetching of author/institution metadata from NCBI
and COG-UK.

diff --git src/hg/utils/otto/sarscov2phylo/bioSampleIdToText.sh src/hg/utils/otto/sarscov2phylo/bioSampleIdToText.sh
new file mode 100755
index 0000000..68d0cfa
--- /dev/null
+++ src/hg/utils/otto/sarscov2phylo/bioSampleIdToText.sh
@@ -0,0 +1,45 @@
+#!/bin/bash
+
+set -beEu -o pipefail
+
+# stdin: series of BioSample GI# IDs (numeric IDs, *not* accessions)
+# stdout: full text record for each BioSample
+
+url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
+db="biosample"
+retmode="text"
+tool="bioSampleIdToText"
+email="$USER%40soe.ucsc.edu"
+baseParams="db=$db&retmode=$retmode&tool=$tool&email=$email"
+# Add &id=... for each id in input, request in batches...
+
+batchSize=100
+
+TMPDIR=/dev/shm
+paramFile=`mktemp`
+
+initBatch() {
+    count=0
+    echo -n $baseParams > $paramFile
+}
+
+sendBatch() {
+    curl -s -S -X POST -d @$paramFile "$url"
+    # Give NCBI a rest
+    sleep 1
+}
+
+initBatch
+
+while read id; do
+    echo -n "&id=$id" >> $paramFile
+    count=$(expr $count + 1)
+    if [ $count == $batchSize ]; then
+        sendBatch
+        initBatch
+    fi
+done
+if [ $count != 0 ]; then
+    sendBatch
+fi
+rm $paramFile