In pursuit of Insight!: January 2012

Sunday, January 15, 2012

Neuro Fuzzy Reasoner

Neuro-fuzzy reasoner (NFR) is a software component capable of learning the set of predefined fuzzy rules. The fuzzy approach enables approximate reasoning and it is suitable for modeling human decision process.

An example rule:

IF (TEST_RESULT IS HIGH )
THEN STUDENT_CLASS IS EXCELLENT

This rule says that if a student has high result on a test, he is classified as an excellent student. The expression
(TEST_RESULT IS HIGH) is the premise, and the expression (STUDENT_CLASS IS EXCELLENT) is the consequence of this fuzzy rule. TEST_RESULT and STUDENT_CLASS are linguistic variables, and their corresponding values are HIGH and EXCELLENT. The premise of a fuzzy rule is always a fuzzy value, but the consequence may be a fuzzy or a crisp value. In this example, HIGH is a fuzzy set, and EXCELLENT is a crisp value - class representing the classification of the student.

Another example:

IF ( (TEST_RESULT IS HIGH) AND (STUDENT_SPEED IS FAST) )
THEN STUDENT_CLASS IS EXCELLENT

Application example:

Let us take an example of student modeling using a neuro-fuzzy system. The system enables classification of students based on qualitative observations of their characteristics. Since the fuzzy model is very close to verbal model, NFR makes it easy to create fuzzy rule system according to the expert's knowledge .

A NFR model for student classification based on test results and the time needed to complete the test is described below.

1. Input and output values Input values:
- Test score [0..100]
- The time needed to complete the test [0..120]

Output values:
- Classes of students: {Bad, Good, Very good, Excellent}

2. Fuzzy sets
The input space is partitioned by the following fuzzy sets:
- Test score: Bad, Low, Mid, High
- The time needed to complete the test, interpreted as speed: Slow, Moderate, Fast

Creating and training the neural network

When the fuzzy model is defined, the construction of the corresponding NFR model is straightforward. The NFR model that corresponds to the previously defined fuzzy model is shown in fig. 4.

The network is constructed using the following principles:
1. The number of cells in the input layer L1 is equal to the number of inputs;
2. The number of cells in the fuzzyfication layer L2 is equal to the number of fuzzy sets;
3. The number of cells in the premise layer is equal to the number of rules;
4. The number of cells in the output layer is equal to the the number of classification classes;
5. The connection pattern is the same for all NFR models and it is shown in fig. 4.

Experimental evaluation done on the model above based on real world data resulted in close to 90% accuracy.

NFR offers a customized, multidimensional classification, and a well structured, semantically rich classification model. But it is important to note that NFR does not offer real-time evaluation. In the example above, NFR can be used to evaluate the total test result by observing several parameters defined by the teacher.

Friday, January 13, 2012

Web Usage Mining - Part 1

Introduction

Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services.

Web mining is categorized into 3 types.

Content Mining (Examines the content of web pages as well as results of web Searching)
Structure Mining (Exploiting Hyperlink Structure)
Usage Mining (analyzing user web navigation)

Web usage mining is one of the prominent research area due to these following reasons:

One can keep track of previously accessed pages of a user. These pages can be used to identify the typical behavior of the user and to make prediction about desired pages. Thus personalization for a user can be achieved through web usage mining.
Frequent access behavior for the users can be used to identify needed links to improve the overall performance of future accesses. Prefetching and caching policies can be made on the basis of frequently accessed pages to improve latency time.
Common access behaviors of the users can be used to improve the actual design of web pages and for making other modifications to a Web site.
Usage patterns can be used for business intelligence in order to improve sales and advertisement by providing product recommendations.

There are two classes of data mining namely i) to summarize or characterize general properties of data in repository which is called Descriptive and ii) to perform inference on current data, to make predictions based on the historical data which is called Prescriptive. There are various data mining techniques available which also can be applied to web data mining. Few techniques are listed below.

1) Association Rules Mining: When the book Data Mining Concepts and Techniques is bought, 40% of the time the book Database System is bought together, and 25% of the time the book Data Warehouse is bought together. Those rules discovered from the transaction database of the book store can be used to rearrange the way of how to place those related books, which can further make those rules more strong.

2) Sequential Pattern Mining: Association rule mining does not take the time stamp into account, the rule can be Buy A=>Buy B. If we take time stamp into account then we can get more accurate and useful rules such as: Buy A implies Buy B within a week, or usually people Buy A every week. As we can see with the second kind of rules, business organizations can make more accurate and useful prediction and consequently make more sound decisions. A database consists of sequences of values or events that change with
time, is called a time-series database, a time-series database records the valid time of each dataset. For example, in a time-series database that records the sales transaction of a supermarket, each transaction includes an extra attribute indicate when the transaction happened. Timeseries database is widely used to store historical data in a diversity of areas such as, financial data, medical data, scientific data and so on. Different mining techniques have been designed for mining time-series data, basically there are four kinds of patterns we can get from various types of timeseries data:1) Trend analysis, 2) Similarity search, 3) Sequential patterns and 4) Periodical patterns.

Sequential patterns: sequential pattern mining is trying to find the relationships between occurrences of sequential events, to find if there exists any specific order of the occurrences. We can find the sequential patterns of specific individual items; also we can find the sequential patterns cross different items. Sequential pattern mining is widely used in analyzing of DNA sequence. An example of sequential patterns is that every time Microsoft stock drops 5%, IBM stock will also drops at least 4% within three days.

3) Classification: Classification is to build (automatically) a model that can classify a class of objects so as to predict the classification or missing attribute value of future objects (whose class may not be known). It is a two-step process. In the first process, based on the collection of training data set, a model is constructed to describe the characteristics of a set of data classes or concepts. Since data classes or concepts are predefined, this step is also known as supervised learning (i.e., which class the training sample belongs to is provided). In the second step, the model is used to predict the classes of future objects or data. A decision tree for the class of buy laptop, indicate whether or not a customer is likely to purchase a laptop. Each internal node represents a decision based on the value of corresponding attribute, also each leaf node represents a class (the value of buy laptop=Yes or No). After this model of buy laptop has been built, we can predict the likelihood of buying laptop based on a new customer's attributes such as age, degree and profession. That information can be used to target customers of certain products or services, especially widely used in insurance and banking.

4) Clustering: Classification can be taken as supervised learning process, clustering is another mining technique similar to classification. However clustering is a unsupervised learning process. Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects, so that objects within the same cluster must be similar to some extent, also they should be dissimilar to those objects in other clusters. In classification which record belongs which class is predefined, while in clustering there is no predefined classes. In clustering, objects are grouped together based on their similarities. Similarities between objects are defined by similarity functions, usually similarities are quantitatively specified as distance or other measures by corresponding domain experts. For example, based on the expense, deposit and draw patterns of the customers, a bank can clustering the market into different groups of people. For different groups of market, the bank can provide different kinds of loans for houses or cars with different budget plans. In this case the bank can provide a better service, and also make sure that all the loans can be reclaimed.

To be continued in Part 2.

Saturday, January 7, 2012

Sequence Classification

Generally, a sequence is an ordered list of events. An event can be represented as a symbolic value, a numerical real value, a vector of real values or a complex data type.

Basics:

Given an alphabet of symbols { E1; E2; E3; :::; En }, a simple symbolic sequence is an ordered list of the symbols from the alphabet. For example, a DNA sequence is composed of four animo acid A;C; G; T and a DNA segment, such as ACCCCCGT, is a simple symbolic sequence.

A simple time series is a sequence of real values ordered in timestamp ascending order. For example, <(t1; 0:1)(t2; 0:3)...(tn; 0:3)> is a simple time series recording the data from time stamp t1 to tn.

The task of sequence classification is to learn a sequence classifier, which is a function mapping a sequence to a class. In sequence classification, each sequence is associated with only one class. For example, a time series of ECG data may come from a healthy or ill person. A DNA sequence may belong to a gene coding area or a non-coding area.

Example:

Let us see an example:

Figure 1: A stochastic automaton for generating symbol sequences.

Figure 1 shows a stochastic automaton for generating sequences. To produce a string of symbols, we begin in the start state. From there we move to state 1 and emit the symbol "a''. In state 1 there is a choice of three alternatives: We can go to state 1 again and emit another "a''. We can go to state 2 and emit a "b''. We can go to state 3 and emit an ``e''. Whenever there are alternatives, the automaton selects among them with equal probability. States 4 and 5 are final states and correspond to sequence endings. We say that a sequence that ends in state 4 is of class 1. We say that a sequence that ends in state 5 is of class 2.

Here are typical sequences of class 1: abcd, aaaaaaabcd, aaabccccccd.
Here is a typical sequence of class 2: aaaeccccccd.

The goal for a (yet unspecified) sequence classifier is to read sequences, one symbol at a time, and to learn to classify them with respect to whether they belong to class 1 or class 2. The classifier does not know anything about the automaton. All it sees are the successive symbols and a teacher signal at the end of each sequence. The teacher signal provides the information about the desired classification. Consider the last two sequences examples above. They illustrate that the difference between class 1 and class 2 only depends on the first letter in the sequence that is not an "a''. If this letter is a "b'', then the sequence is of class 1. If this letter is an "e'', then the sequence is of class 2. Therefore, the classifier somehow has to learn to store the occurrence of the first letter that is not an "a'' and to memorize it for an unknown number of discrete time steps - until the end of the sequence is reached. Otherwise it will not be able to classify correctly.

The sequence classification methods can be divided into three large categories.

The first category is feature based classification, which transforms a sequence into a feature vector and then apply conventional classification methods. Feature selection plays an important role in this kind of methods.
The second category is sequence distance based classification. The distance function which measures the similarity between sequences determines the quality of the classification significantly.
The third category is model based classification, such as using hidden markov model (HMM) and other statistical models to classify sequences.

Let us explore sequence distance based classification. Sequence distance based methods define a distance function to measure the similarity between a pair of sequences. Once such a distance function is obtained, we can use some existing classification methods, such as K nearest neighbor classifier (KNN) and SVM for sequence classification.

For simple time series classification, Euclidean distance is a widely adopted option. For two time series s and s', Euclidean distance is:

Classification techniques:

Support vector machines (SVM)
K nearest neighbor classifier (KNN)
Naive Bayes classifier
Markov Model and Hidden Markov Model

Challenges in sequence classification:

There are three major challenges in sequence classification.

Most of the classifiers, such as decision trees and neural networks, can only take input data as a vector of features.However, there are no explicit features in sequence data.
Even with various feature selection methods, we can transform a sequence into a set of features, the feature selection is far from trivial. The dimensionality of the feature space for the sequence data can be very high and the computation can be costly.
Besides accurate classification results, in some applications, we may also want to get an interpretable classifier. Building an interpretable sequence classifier is difficult since there are no explicit features.

Applications:

Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection.

Other interesting examples include classifying query log sequences to distinguish web-robots from human users and classifying transaction sequence data in a bank for the purpose of combating money laundering.

Sequence classification is very useful in computer intrusion detection. We can extract the profile of a user from its UNIX commands sequences and then classify a given sequence in one of the user profile. The user profile can be built off-line.