OCHOTERENA, HELGA* AND MARK PITKIN SIMMONS. L.H. Bailey Hortorium, Cornell University, Ithaca, NY 14853. - Justification and methods for incorporating gap characters in sequence-based phylogenetic analyses.
We discuss the theoretical arguments for, and implications of, the
different methods of treating gaps in phylogenetic analyses. We
consider alignment (making hypotheses of primary homology) and tree
searches (testing hypotheses of primary homology) to be logically
independent steps in phylogenetic analysis. Although both operations
may be incorporated into a single step, they need not be. Gaps do not
occur in organisms and therefore cannot be directly observed.
However, gaps are as much a part of the pattern of aligned sequences
as bases are. Because this aligned pattern is used to code characters
for tree searches, the informative variation from gaps should be
incorporated along with base characters into tree searches. We argue
that gaps are properly coded as separate, equally-weighted,
presence/absence characters (not 5th character states for nucleotides
or 21st character states for amino acids). Although gaps are
alternative forms of aligned positions, gaps are not alternative forms
of positions that do not exist in organismal sequences. In addition
to evidence that contiguous gap positions originate as single indel
events, the parsimony criterion favors the interpretation of coding
contiguous gap positions as single characters because of the
co-occurring pattern. In consequence of these arguments, we propose
two methods by which gaps coded as characters can be implemented in
tree searches. Simple indel coding, in which all gaps are coded as
separate binary characters, is easy to implement but does not utilize
all available information and can cause ambiguous optimizations of gap
characters. Complex indel coding, in which non-overlapping gaps are
coded as separate binary characters and overlapping gaps that share a
common terminus (or are subsets of longer gaps) are coded as separate
multistate characters in step matrices, is harder to implement but
allows all available information to be utilized.
Key words: deletions, gap characters, indels, insertions, molecular systematics, sequence alignment