In pursuit of Insight!: Sequence Classification

Generally, a sequence is an ordered list of events. An event can be represented as a symbolic value, a numerical real value, a vector of real values or a complex data type.

Basics:

Given an alphabet of symbols { E1; E2; E3; :::; En }, a simple symbolic sequence is an ordered list of the symbols from the alphabet. For example, a DNA sequence is composed of four animo acid A;C; G; T and a DNA segment, such as ACCCCCGT, is a simple symbolic sequence.

A simple time series is a sequence of real values ordered in timestamp ascending order. For example, <(t1; 0:1)(t2; 0:3)...(tn; 0:3)> is a simple time series recording the data from time stamp t1 to tn.

The task of sequence classification is to learn a sequence classifier, which is a function mapping a sequence to a class. In sequence classification, each sequence is associated with only one class. For example, a time series of ECG data may come from a healthy or ill person. A DNA sequence may belong to a gene coding area or a non-coding area.

Example:

Let us see an example:

Figure 1: A stochastic automaton for generating symbol sequences.

Figure 1 shows a stochastic automaton for generating sequences. To produce a string of symbols, we begin in the start state. From there we move to state 1 and emit the symbol "a''. In state 1 there is a choice of three alternatives: We can go to state 1 again and emit another "a''. We can go to state 2 and emit a "b''. We can go to state 3 and emit an ``e''. Whenever there are alternatives, the automaton selects among them with equal probability. States 4 and 5 are final states and correspond to sequence endings. We say that a sequence that ends in state 4 is of class 1. We say that a sequence that ends in state 5 is of class 2.

Here are typical sequences of class 1: abcd, aaaaaaabcd, aaabccccccd.
Here is a typical sequence of class 2: aaaeccccccd.

The goal for a (yet unspecified) sequence classifier is to read sequences, one symbol at a time, and to learn to classify them with respect to whether they belong to class 1 or class 2. The classifier does not know anything about the automaton. All it sees are the successive symbols and a teacher signal at the end of each sequence. The teacher signal provides the information about the desired classification. Consider the last two sequences examples above. They illustrate that the difference between class 1 and class 2 only depends on the first letter in the sequence that is not an "a''. If this letter is a "b'', then the sequence is of class 1. If this letter is an "e'', then the sequence is of class 2. Therefore, the classifier somehow has to learn to store the occurrence of the first letter that is not an "a'' and to memorize it for an unknown number of discrete time steps - until the end of the sequence is reached. Otherwise it will not be able to classify correctly.

The sequence classification methods can be divided into three large categories.

The first category is feature based classification, which transforms a sequence into a feature vector and then apply conventional classification methods. Feature selection plays an important role in this kind of methods.
The second category is sequence distance based classification. The distance function which measures the similarity between sequences determines the quality of the classification significantly.
The third category is model based classification, such as using hidden markov model (HMM) and other statistical models to classify sequences.

Let us explore sequence distance based classification. Sequence distance based methods define a distance function to measure the similarity between a pair of sequences. Once such a distance function is obtained, we can use some existing classification methods, such as K nearest neighbor classifier (KNN) and SVM for sequence classification.

For simple time series classification, Euclidean distance is a widely adopted option. For two time series s and s', Euclidean distance is:

Classification techniques:

Support vector machines (SVM)
K nearest neighbor classifier (KNN)
Naive Bayes classifier
Markov Model and Hidden Markov Model

Challenges in sequence classification:

There are three major challenges in sequence classification.

Most of the classifiers, such as decision trees and neural networks, can only take input data as a vector of features.However, there are no explicit features in sequence data.
Even with various feature selection methods, we can transform a sequence into a set of features, the feature selection is far from trivial. The dimensionality of the feature space for the sequence data can be very high and the computation can be costly.
Besides accurate classification results, in some applications, we may also want to get an interpretable classifier. Building an interpretable sequence classifier is difficult since there are no explicit features.

Applications:

Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection.

Other interesting examples include classifying query log sequences to distinguish web-robots from human users and classifying transaction sequence data in a bank for the purpose of combating money laundering.

Sequence classification is very useful in computer intrusion detection. We can extract the profile of a user from its UNIX commands sequences and then classify a given sequence in one of the user profile. The user profile can be built off-line.

In pursuit of Insight!

Download Android App

Saturday, January 7, 2012

Sequence Classification

No comments:

Post a Comment