Back
Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources
We present a novel information theoretic algorithm for unsupervised segmentation of sequences into alternating Variable Memory Markov sources. The algorithm is based on competitive learning between Markov models, when implemented as Prediction Suffix Trees (Ron et al., 1996) using the MDL principle. By applying a model clustering procedure, based on rate distortion theory combined with deterministic annealing, we obtain a hierarchical segmentation of sequences between alternating Markov sources. The algorithm seems to be self regulated and automatically avoids over segmentation. The method is applied successfully to unsupervised segmentation of multilingual texts into languages where it is able to infer correctly both the number of languages and the language switching points. When applied to protein sequence families, we demonstrate the method‘s ability to identify biologically meaningful sub-sequences within the proteins, which correspond to important functional sub-units called domains.
@inproceedings{6588, title = {Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources}, journal = {In the proceeding of the 18th International Conference on Machine Learning (ICML 2001)}, abstract = {We present a novel information theoretic algorithm for unsupervised segmentation of sequences into alternating Variable Memory Markov sources. The algorithm is based on competitive learning between Markov models, when implemented as Prediction Suffix Trees (Ron et al., 1996) using the MDL principle. By applying a model clustering procedure, based on rate distortion theory combined with deterministic annealing, we obtain a hierarchical segmentation of sequences between alternating Markov sources. The algorithm seems to be self regulated and automatically avoids over segmentation. The method is applied successfully to unsupervised segmentation of multilingual texts into languages where it is able to infer correctly both the number of languages and the language switching points. When applied to protein sequence families, we demonstrate the method‘s ability to identify biologically meaningful sub-sequences within the proteins, which correspond to important functional sub-units called domains.}, pages = {513-520}, organization = {Max-Planck-Gesellschaft}, school = {Biologische Kybernetik}, year = {2001}, slug = {6588}, author = {Seldin, Y. and Bejerano, G. and Tishby, N.} }