Ochoterena, Helga* and Mark Pitkin Simmons.

Justification and methods for incorporating gap characters in sequence-based phylogenetic analyses.

We discuss the theoretical arguments for, and implications of, the different methods of treating gaps in phylogenetic analyses. We consider alignment (making hypotheses of primary homology) and tree searches (testing hypotheses of primary homology) to be logically independent steps in phylogenetic analysis. Although both operations may be incorporated into a single step, they need not be. Gaps do not occur in organisms and therefore cannot be directly observed. However, gaps are as much a part of the pattern of aligned sequences as bases are. Because this aligned pattern is used to code characters for tree searches, the informative variation from gaps should be incorporated along with base characters into tree searches. We argue that gaps are properly coded as separate, equally-weighted, presence/absence characters (not 5th character states for nucleotides or 21st character states for amino acids). Although gaps are alternative forms of aligned positions, gaps are not alternative forms of positions that do not exist in organismal sequences. In addition to evidence that contiguous gap positions originate as single indel events, the parsimony criterion favors the interpretation of coding contiguous gap positions as single characters because of the co-occurring pattern. In consequence of these arguments, we propose two methods by which gaps coded as characters can be implemented in tree searches. Simple indel coding, in which all gaps are coded as separate binary characters, is easy to implement but does not utilize all available information and can cause ambiguous optimizations of gap characters. Complex indel coding, in which non-overlapping gaps are coded as separate binary characters and overlapping gaps that share a common terminus (or are subsets of longer gaps) are coded as separate multistate characters in step matrices, is harder to implement but allows all available information to be utilized.

Key words: deletions, gap characters, indels, insertions, molecular systematics, sequence alignment