An Extreme Value Theory for Sequence Matching
Arratia, Richard ; Gordon, Louis ; Waterman, Michael
Ann. Statist., Tome 14 (1986) no. 2, p. 971-993 / Harvested from Project Euclid
Consider finite sequences $X_1, X_2,\cdots, X_m$ and $Y_1, Y_2,\cdots, Y_n$ where the letters ${X_i}$ and ${Y_i}$ are chosen i.i.d. on a countable alphabet with $p=P(X_1=Y_1)\in(0,1)$ We study the distribution of the longest contiguous run of matches between the X's and Y's allowing at most k mismatches. The distribution is closely approximated by that of the maximum of (1 - p)mn i.i.d. negative binomial random variables. The latter distribution is in turn shown to behave like the integer part of an extreme value distribution. The expectation is approximately $\log(qmn)+k\log\log(qmn)+k\log(q/p)-\log(k!)+\gamma\log(e)-\frac{1}{2}$, where q = 1 - p, log denotes logarithm base 1/p, and y is the Euler constant. The variance is approximated by $(\pi\log(e))2/6+\frac{1}{2}$. The paper concludes with an example in which we compare segments taken from the DNA sequence of the bacteriophage lambda.
Publié le : 1986-09-14
Classification:  Extreme value,  matching,  Poisson process,  inclusion-exclusion,  DNA sequences,  62E20,  62P10
@article{1176350045,
     author = {Arratia, Richard and Gordon, Louis and Waterman, Michael},
     title = {An Extreme Value Theory for Sequence Matching},
     journal = {Ann. Statist.},
     volume = {14},
     number = {2},
     year = {1986},
     pages = { 971-993},
     language = {en},
     url = {http://dml.mathdoc.fr/item/1176350045}
}
Arratia, Richard; Gordon, Louis; Waterman, Michael. An Extreme Value Theory for Sequence Matching. Ann. Statist., Tome 14 (1986) no. 2, pp.  971-993. http://gdmltest.u-ga.fr/item/1176350045/