src/hg/mouseStuff/blastzPaper/blastzPaper.c 4898794edd81be5285ea6e544acbedeaeb31bf78

4898794edd81be5285ea6e544acbedeaeb31bf78
max
  Tue Nov 23 08:10:57 2021 -0800
Fixing pointers to README file for license in all source code files. refs #27614

diff --git src/hg/mouseStuff/blastzPaper/blastzPaper.c src/hg/mouseStuff/blastzPaper/blastzPaper.c
index 22f0d98..ae89d25 100644
--- src/hg/mouseStuff/blastzPaper/blastzPaper.c
+++ src/hg/mouseStuff/blastzPaper/blastzPaper.c
@@ -1,377 +1,377 @@
 /* Copyright (C) 2002 The Regents of the University of California 
- * See README in this or parent directory for licensing information. */
+ * See kent/LICENSE or http://genome.ucsc.edu/license/ for licensing information. */
 
 Human-mouse alignments with Blastz
 schwartz, kent, smit, zhang, baertsch, hardison, haussler, miller
 
 Introduction:
 
 One of the goals set by the Mouse Genome Analysis Consortium (cite Nature paper)
 is to study mutation and selection, the main forces shaping the mouse and
 human genomes.  Specific aims included
 (1) estimating the fraction of the human genome that is under selection
 (cite Krish et al.),
 (2) determining the degree to which genome comparisons can pinpoint the
 regions under selection (cite Elnitski et al.) and
 (3) measuring regional variation in the rate and pattern of
 neutral evolution (Hardison et al.).  Attaining these goals requires an
 alignment program with higher sensitivity than is needed for other purposes
 such as predicting novel protein-coding segments or identifying large
 genomic intervals where gene order is conserved.
 
 Ideally, our alignment program would identify orthologous regions
 of the human and mouse genomes, whether or not they are under selection.
 It would determine correspondences between genomic positions that are 
 descended from the same position in the ancestral genome,
 allowing nucleotide substitutions.
 In practice, success in reaching that goal is measured by the program's
 sensitivity (fraction of orthologous positions that it aligns) and
 specificity (fraction of the aligned positions that are indeed orthologous).
 Most of our aims could have been addressed by a program that aligned neutrally
 evolving regions with a modest degree of sensitivity.  For instance,
 regional variations (aim 3) could be assessed from a relatively small sample,
 (say, 1 out of 10 orthologous regions), provided that there were no critical
 biases in the sampling process.  Demands on specificity were higher,
 but it was acceptable for perhaps 5-10% of the aligned positions to be
 non-orthologous.
 
 To meet out needs, we enhanced the Blastz alignment program (cite PipMaker),
 and developed a new program, axtBest to separate paralogous from orthologous
 alignment.  Here we describe the programs, the hardware environment, and 
 several validation studies.  Our results indicate that we have correctly 
 aligned the majority of what can be aligned between the human and mouse 
 genomes.
 
 Results:
 
 Software Design Issues
 
 Our goal is to align an appreciable fraction of the neutrally evolving DNA in
 the human and mouse genomes.  This sensitivity requirement
 disqualified several existing programs (cite SSAHA, BLAT)
 that sacrifice sensitivity to attain very short running times.
 Preliminary studies indicated that an appropriate level of sensitivity and
 specificity was attained by a program called Blastz, which is used by the
 PipMaker (cite Schwartz) Web server to align genomic sequences.
 Blastz follows the three-step strategy used by Gapped Blast
 (Altschul et al. 1997): find short near-exact matches,
 extend each short match without allowing gaps, and extend each
 ungapped match that exceeds a certain threshold by a dynamic programming
 procedure that permits gaps.
 
 Two differences between Blastz and Gapped Blast were exploited in our
 whole-genome alignments.  First, Blastz has an option to require that 
 the matching regions that it reports must occur in the same order and 
 orientation in both sequences.  To enforce this restiction, it uses a 
 method described by Zhang et al. 1994.  Second, Blastz uses an alignment 
 scoring scheme derived and evaluated by Chiaromonte et al. 2002.  Nucleotide 
 substitutions are scored by the matrix
 
      A    C    G    T
 A   91 -114  -31 -123
 C -114  100 -125  -31
 G  -31 -125  100 -114
 T -123  -31 -114   91
 
 and a gap of length k is penalized by subtracting 400 + 30k from the score.
 To determine whether an ungapped alignment is sufficiently promising to
 warrant initiation of a dynamic programming extension step, the sum of scores
 for its columns is multiplied by a value between 0 and 1 that measures sequence
 complexity, as described in detail by Chiaromonte et al.
 This makes it harder for a region of extremely biased nucleotide content to
 trigger a gapped alignment.
 
 Two changes to Blastz significantly improved its execution speed for
 aligning entire genomes.  First, when the program realizes that many 
 regions of the mouse genome align to the same human segment, that 
 segment is dynamically masked, i.e., marked so that it will be 
 ignored in later steps of the alignment process.
 Second, we adapted a very clever idea of Wu el al. 2002 for determining the
 initial short match that may seed an alignment.  Formerly, Blastz looked for 
 identical runs of 8 consecutive nucleotides in each sequence. 
 Wu et al. propose looking for runs of 19 consecutive nucleotides in each
 sequence, within which the 12 positions indicated by a "1" in the string
 1110100110010101111
 are identical.  To increase sensitivity, we allow a transition
 (A-G, G-A, C-T or T-C) in any one of the 12 positions.
 
 Later, Blastz was further modified to utilize the increasing contiguity of
 the mouse sequence.
 For the earliest alignments, the mouse sequence was presented in unassembled
 ``reads'' of around 500 bp each.   An ungapped alignment was required to
 score at least 3000 (relative to the above substitution scores as adjusted
 by the measure of sequence complexity) before the dynamic programming step
 was initiated.
 Such a high threshold was needed to attain reasonable specificity.
 Availability of longer mouse segments suggested looking for pairs of alignments
 that connect nearby regions in both human and mouse genomes; the alignment
 procedure can be run with a lower threshold in the regions between the
 alignments.
 If the separation between the two alignments is under 50 kb in both sequences,
 then Blastz recursively searches the intervening regions for 7-mer exact
 matches and requires a threshold of 2200 for initiating dynamic programming
 (with the adjustment for sequence complexity).
 If the separation is under 10 kb, the threshold is lowered to 2000.
 In either case, the higher-sensitivity matches are required to occur with
 an order and orientation consistent with the stronger flanking matches.
 
 Although the fee for initiating a gapped alignment is steep (e.g., 3000),
 once started the alignment keeps extending as long as the average score of the
 added alignment remains positive.  This observation suggests a strategy of
 physically removing lineage specific interspersed repeats (i.e., that inserted
 after the human-mouse split).
 Then, an alignment that is initiated on one side of the insertion point can
 reach the other side, without incurring a penalty for jumping the
 insertion, where it may detect alignable nucleotides
 that may not meet the steep initiation fee on their own.
 We now remove recent repeat elements, run Blastz, then adjust
 positions in the alignment to refer to the original sequences.
 These last two improvements, i.e., recursive application of Blastz between
 adjacent alignments and excision of lineage specific repeats, increased
 alignment coverage from 33% of the human genome to 40%. [Scott, check numbers.]
 
 The modified Blastz was used to compare all the human sequence with all of the
 mouse, i.e., to produce a complete catalog of matching regions.
 more than one region of the mouse sequence aligned to the same
 region of the human sequence.  [jk new] This is a natural consequence of
 duplications in the mouse genome and the human/mouse common ancestor.
 These duplications include paralogous genes, processed and unprocessed
 psuedogenes, tandem repeats, simple repeats, etc.  For many purposes 
 one wants the single best, orthologous, match for each human region .  
 Typically when looking at a region spanning several thousand bases it is 
 clear which alignment is the ortholog and which are the paralogs.  The 
 orthologous alignment usually is longer,  and overall has a greater 
 sequence identity.  On the other hand over a small regions by chance a 
 paralog may have greater sequence identity than a paralog.  
 
 We wrote a program, axtBest, which automatically picks out an alignment
 likely to be the orthologous alignment.  The program works in two passes.
 A window of 5000 bases on either side of a human base is used to score
 that base.  The score is the blastz score at the end of the window minus
 the blastz score at the beginning of the window.  Bases less than 5000
 bases from the start or end of the alignment have to be handled specially,
 and the details of this differ between the first and the second pass.
 In the first pass the window is truncated to only cover the alignment
 itself.  This tends to make bases on the edge of an alignment or bases in
 a short alignment score less than bases in the middle of a long alignment.
 Unless an alignment is the best scoring alignment over at least one
 human base in the first pass, it is thrown out.   In the second pass
 bases on the edges of an alignment are given the same score as the base
 5000 bases into the alignment.  The portions of each alignment remaining
 after the first pass that are the best scoring over at least 10 bases of
 the human genome are then written out.  This two pass scheme with the
 first pass being done on softer-edged windows and the second pass on 
 harder-edged windows in most cases yields the same result one would
 obtain with a simpler one pass scheme.  However when applied to regions
 with multiple paralogs it results in less fragmented 'best' alignments
 than the single pass schemes we have tried.
 
 
  Software Implementation [Scott]
 
 Blastz, in common with other members of the blast family computes a set
 of local alignments in stages.
 
 The first stage reads a target sequence and builds a hash table relating
 short patterns to locations.
 
 Traditionally the patterns were k-mers, but following PatternHunter
 blastz now allows a discontinuous selection of 12 base-pairs out of 19.
 Because each bp consumes two bits, this requires 64 bit integers.
 Compilers for 32 bit machines typically provide a synthesized 64 bit
 data type (ISO C 1999 defines a standard int64_t), which is reasonably
 inexpensive in the event that we only use simple logical operations.
 At the same time, specialized bit extraction operators have been proposed
 by some architects, and would be useful in this application.
 
 We also store all possible transitions of a given sample, which
 improves sensitivity at the expense of an order of magnitude greater
 space consumption.
 
 The current implementation sizes the table such that the load factor
 will be very low, using 2^24 entries.  Space expended on the table is
 balanced by not requiring that the hash code be stored in every node,
 since every bit is used to select the current bucket.
 
 The second stage reads a query sequence, looking up each 12of19 pattern
 in the blast table, then doing the same for the reverse compliment of the
 query sequence.  Each "high scoring pair" (HSP) that is selected in
 this procedure is then extended into an ungapped alignment using a fixed
 but well tuned scoring scheme.  Profiling shows that these two steps
 typically consume the bulk of the runtime in an alignment.  In the case
 where no matches are found, the hashing and lookup operations dominate.
 In the case where matches are found, the ungapped extension dominates.
 In a few other, less common, cases the following steps consume significant
 time.
 
 During this stage, but before extension, blastz supports the option of
 chaining the HSP, reducing a cloud of hits to a single high scoring path.
 
 The third stage involves a full blown dynamic programming step,
 implementing the gapped blast algorithm.  Two innovations play a role
 at this stage.  First, after aligning each contig, blastz counts
 the number of times each base pair participated in an alignment.
 In the event that this crosses a user supplied threshold, that base is
 dynamically masked, and so will not participate in future alignments.
 Second, blastz examines the list of alignments that were just generated,
 and in the event of closely paired results, it attempts to fill the
 gap between them by realigning those regions at reduced stringency.
 The current implementation somewhat arbitrarily uses 7-mers, and chaining
 in this step.
 
 
 * 
 
 In it's originally intended role, blastz consumed either two long
 sequences, or one long sequence and a moderately sized collection of
 contigs, with an output record reflecting each of these.  Whole genome
 alignments required adjustments in a number of areas.
 
 In our earliest experiments, finished human against unassembled mouse
 reads, it was necessary to suppress output describing non-aligning
 reads, simply to limit the size of the output we generated.  In later
 experiments, when the mouse was well assembled, the situation was closer
 to the originally conceived scenario of fairly long inputs.  In this
 case, the main constraint was the 32bit address space on our CPUs, and
 the amount of RAM available to each processor (512MB).  After a number
 of trials, we decided to chop each human chromosome into 1Mbase long
 sections with 10Kbase overlap, consuming approximately 128MB of memory.
 The mouse sequence was chopped into 30Mbase sections, consuming about
 240MB.  Additional space up to the limit was allocated for dynamic
 programming and other bookkeeping.
 
 Throughout the code, emphasis was on simplicity and maintainability.
 While we sometimes expended effort to tune particular sections, by far
 the greatest performance improvements came from large scale algorithmic
 adjustments.  Thus, the organization of the program reflects our need to
 make changes, rather than to efficiently realize current tactics.  So,
 for example, the program occasionally chooses to consume extra memory
 where it simplified or improved the implementation of some procedure.
 Although we are happy with this tradeoff, profiling data revealed cases
 where optimization was required, particularly in scenario, where large
 numbers of small or medium sized contigs would be processed in each run.
 There, we went to some trouble to minimize the overhead of memory
 management.  The most significant of these changes involved the use of
 a sparse array in the dynamic programming code.  Normally one maintains
 an array to record the computation's progress along each diagonal,
 but measurements found that the cost of maintaining (initializing in
 particular) that array was a significant overhead.
 
 While traditional implementations used sophisticated linear space
 techniques in the computation of the alignment, this version uses a large,
 but fixed sized traceback buffer.  In this way, we achieved significant
 simplifications in the implementation and in memory management.
 
 * 
 
 To align a pair of genomes, we precompute a list of jobs, essentially
 the cross product of the N/1MBase intervals of the first genome, by the
 M/30Mbase intervals of the second.  The scheduling software available on
 the 1024 CPU cluster that we use doesn't support processor affinity of any
 sort; each job is entirely independent. 
 
 Blastz is able to read traditional FASTA format inputs, but for this
 application we prefer inputs in a packed, two base-pair per byte format,
 which reduces the I/O by half, and simplifies and accelerates the
 code that has to read the sequences.  We have sufficient local disk on
 each node to provide a copy of all the input that might be required.
 Only the output needs to be communicated to a shared resource, and this
 is the main I/O bottleneck in our procedure, due to the latency of using
 a network filesystem.
 
 Dynamic masking was invented to handle cases like chr19, where zinc finger
 or olfactory receptors match a huge number of times, but are not flagged
 as repeats.  Oblivious scheduling partially defeats this optimization,
 however, since each section of the target sequence will be compared
 many times to different parts of the query sequence, and not be able
 to share dynamic masking information.  Similarly, we could reduce disk
 space requirements by devoting some nodes to particular portions of the
 target sequence, at the expense of significant administrative overhead.
 
 While we expect to see some improvement from better scheduling, the
 current system works sufficiently well that we don't feel the need for
 additional tuning at the present time.
 
 *
 
 Once the principle phase has been completed, we are left with a large
 collection of output files containing blastz alignments of many sections
 of the genome.  We then perform several post-processing steps, to refine
 and improve the results.  One incidental operation is to transform
 the output, "lav" format, into "axt" format.  The former is a space
 efficient format which describes the alignments by coordinates within
 named sequences.  The axt format is a textual representation that includes
 the actual base pairs.  While this is very space inefficient, for many
 post processing steps the improved locality of reference (avoiding the
 need to retrieve parts of multi gigabyte datasets) is a clear necessity.
 
 A second, and more important post-processing step is to compute the
 single-coverage rendition of the alignment with axtBest.
 Other sorts of post-processing include the generation of Percent Identity
 Plots, Dotplots, and tracks on the Genome Browser.
 
 
 
 Program Execution [Santa Cruz configuration and performance data -- Scott]
 
 To align 2.8 Gb of human sequence vs. 2.5 Gb of mouse sequence
 took 481 hours of CPU time and a half day of wall clock time on
 a cluster of 1000 833 Mhz Pentium III CPUs.  This produced 9 Gig
 of output in a relatively dense format, .lav.  The .lav files were
 expanded to another format, .axt, which includes the mouse and
 human sequence used in the alignment.  The axt files were 20 Gig.
 The axt files after axtBest were 2.5 Gig.  Only 3.3% of the human 
 genome is covered by multiple alignments, but some of these places,
 particularly on chromosome 19, are covered to a great depth.
 
 
 Software Evaluation
 
 While Blastz's sensitivity (success at identifying orthologous,
 or at least homologous, regions) was less critical than its specificity
 (success at not aligning non-homologous regions), we were interesting in
 estimating how much of the currently unaligned regions could rightfully be
 aligned.
 One hypothesis is that unaligned regions are actually orthologous, but
 have suffered so many small mutations that Blastz no longer recognizes them as
 related.
 Another hypothesis is that the unaligned regions are not ortholgous to
 sequence in the other genome, but instead were either inserted in one
 species or deleted from the other after the human-mouse split.
 The Nature paper (ref) argues that the second hypothesis correctly explains
 most of the unaligned regions.
 More precisely, the amount of DNA aligned by Blastz, 40% of the human genome,
 is in good agreement with the amount that should align (i.e, that shares a
 common ancestor with a region of the mouse genome).
 We will not repeat that reasoning here.
 
 Knowing that we're aligning approximately the proper number of nucleotides
 in essence makes sensitivity equivalent to specificity, reasoning as follows.
 High specificity implies high sensitivity, since if few alignments are
 incorrect, then almost all of the aligned 40% is properly aligned,
 which is almost everything that should align.
 High sensitivity implies high specificity, since if almost all of the 40%
 that should align in fact does align, then this accounts for almost all of
 the alignments, leaving room for few incorrect alignments.
 Thus, we are left with the problem of bounding the number of incorrect
 alignments.
 
 For an answer, we estimated the number of incorrect alignments that Blastz
 would produce when comparing sequences of the same lengths and nucleotide
 compositions as the human and mouse genomes.
 Our strategy was to compare the human sequence with the reverse (NOT THE
 REVERSE COMPLEMENT) of the mouse sequence.
 Reversal retains the local complexity (e.g., a tandem dinucleotide repeat is
 preserved) and guarantees that all generated alignments are inappropriate.
 
 [jk new] Another test of the specificity of blastz alignments is based
 on synteny.  Human chromosome 20 is considered to be completely 
 syntenic to parts of mouse chromosome 2.  After filtering through axtBest
 only 3.3% of the human chromosome 20 bases in alignments aligned outside
 of mouse chromosome 2.   A certain degree of non-syntenic alignment is
 to be expected from processed psuedogenes.  Since only approximately
 96% of the mouse genome is sequenced, in some cases a paralog on another
 chromosome will fill in for a non-sequenced ortholog on chromosome 2 as
 well.