ANGEL - Automated Network Games Enhancement Layer
Machine Learning
This sub-page provides a brief overview of the field of Machine Learning. Machine
Learning is used within the scope of the ANGEL sub-project to classify network
traffic flows. The results of the classification are then intended to be used to
reconfigure the network to improve overall network
application performance.
Machine Learning (ML) usually refers to systems performing tasks associated with
Artificial Intelligence (AI). Such tasks involve recognition, diagnosis, planning,
prediction etc. [Mitchell 1997] defines Machine Learning as follows:
"A computer program is said to learn from experiment E with respect to some
class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E"
[Witten and Frank 2000] state:
"things learn when they change their behavior in a way that makes them perform
better in the future"
Machine Learning can be viewed as general inductive process that automatically
builds a model by learning the inherent structure of a dataset depending on the
characteristics. Over the past decade, Machine Learning has evolved from a field
of laboratory demonstrations to a field of significant commercial value. Machine
Learning techniques have been very successful in the areas of Data Mining, speech
and voice recognition, text recognition, face recognition etc.
A good model should be descriptive (describe the training data), predictive
(generalise well for unseen test data) and explanatory (provide a plausible
description).
Terminology
The following terms are widely used in the field of Machine Learning:
- Instance: An instance is an example in the training data. An instance is described
by a number of attributes. One attribute can be a class label
- Attribute/Feature: An attribute is an aspect of an instance (e.g. temperature,
humidity). Attributes are often called features in Machine Learning. A special attribute is the
class label that defines the class this instance belongs to (required for supervised
learning)
- Classification: A classifier is able to categorise/classify given instances (test data)
with respect to pre-determined/learned classification rules
- Training/Learning: A classifier learns the classification rules based upon a given set
of instances (training data). Please not that some algorithms have no training phase but do
all the work when classifying an instance (so called lazy learning algorithms)
- Clustering: The process of dividing a unlabeled data set (no class information
given) into clusters that contain similar instances
Supervised vs. Unsupervised Learning
The basic notion of supervised learning is that of the classifier. A teacher
helps to construct a classification model by defining classes and providing positive
and negative examples of instances belonging to these classes. The system then
determines the commonalities and differences between the various classes in order
to generate classification rules for unknown instances. The resulting rules assign
a class label to an instance based on the values of the instance's attributes.
In unsupervised learning there are no classes defined a-priori. The algorithm
itself divides the instances into different classes and finds descriptions for these
classes. This process is often also referred to as clustering. The resulting rules
will be a summary of some properties of the instances in the database: which classes
are present and what differentiates them. However, this will only be what the
algorithm has found. There may be many other ways of dividing the instances into
classes and of describing each class.
Semi-supervised learning is a combination of supervised and unsupervised learning.
In this approach a user defined amount of supervision is imposed on the algorithm.
The advantage is that only part of the data must have been classified a-priori,
requiring less teaching effort, but the classification accuracy is usually higher
than for purely unsupervised approaches.
Feature Selection
Feature selection involves selecting a subset of the available feature set to
be used in learning and classification. The effectiveness of a system depends on
the number and the types of features. It is important to minimise the number features
to increase the performance of the learning and future classifications, in terms of
processing time and memory required. It is also important to remove irrelevant
features to increase the classification accuracy, as some Machine Learning algorithms
cannot cope well with irrelevant features.
Example
An example that is often used to illustrate Machine Learning is to determine
whether an unspecified sport can be played based on the current weather condition.
Consider the following data (taken from Weka
3: Data Mining Software in Java) that includes the outlook, the temperature,
the humidity, whether it is windy or not and the information whether playing is
possible (this is the class attribute).
outlook | temperature | humidity | windy | play |
sunny
| 85
| 85
| false
| no
|
sunny
| 80
| 90
| true
| no
|
overcast
| 83
| 86
| false
| yes
|
rainy
| 70
| 96
| false
| yes
|
rainy
| 68
| 80
| false
| yes
|
rainy
| 65
| 70
| true
| no
|
overcast
| 64
| 65
| true
| yes
|
sunny
| 72
| 95
| false
| no
|
sunny
| 69
| 70
| false
| yes
|
rainy
| 75
| 80
| false
| yes
|
sunny
| 75
| 70
| true
| yes
|
overcast
| 72
| 90
| true
| yes
|
overcast
| 81
| 75
| false
| yes
|
rainy
| 71
| 91
| true
| no
|
The task for the Machine Learning algorithm is to find the rules inherent in
this data set. Applying the C4.5 (named J48 in
Weka) tree induction
algorithm we get the following decision tree. This decision tree is a perfect
classifier for our training data in the table. No data instance is misclassified
and therefore the accuracy of the classifier on the training data is 100%.
Each node of the tree resembles an attribute/feature. The branches below each
node divide the attribute value space using the test associated with each branch.
The leaves of the tree contain the class and the number of instances. Tree-based
algorithms such as C4.5 already perform some kind of feature selection. The order
in which the attributes appear in the tree (from top to bottom) depends on their
ability divide to the feature space most quickly into sets of instances that can
be accurately mapped to a single class. C4.5 uses a metric called information gain
to determine the best sequence of attributes and the optimal splits of the attribute
value space. Attributes with low information gain appear in the bottom of the tree
or are not part of the tree at all. For instance, the temperature attribute is not
present in the above tree because it does not have a strong correlation with the
class.
The decision tree can be easily translated into the following set of classification
rules:
if outlook=sunny and humidity<=75 then play=yes
if outlook=sunny and humidity>75 then play=no
if outlook=overcast then play=yes
if outlook=rainy and windy=true then play=no
if outlook=rainy and windy=false then play=yes
Detecting Network Game Traffic Flows
Recent years have seen a dramatic increase of traffic of highly interactive
games (such as multiplayer first person shooter games) in the Internet. The ability
to dynamically identify and classify traffic flows caused by network games is highly
beneficial for trend analyses (estimating the size and origins of capacity demand
trends) and adaptive, network-based Quality of Service (QoS) marking without direct
involvement of the end-host.
The most common traffic identification technique is based on the inspection of
`known port numbers'. Often game servers are running on widely known default port,
and the port information in network traffic flows can be used to detect games.
However, in general game servers can be configured to run on any port and already
game servers often do not use the default port because multiple servers are running
on the same host (IP address). Given that the port number range is not large (65535
port numbers) but more and more applications are developed, it is likely that port
numbers of different applications will more and more overlap making them unusable
for application identification.
A more reliable technique involves stateful reconstruction of session and application
information from packet content. Although this technique avoids reliance on fixed
port numbers, it imposes significant complexity and processing load on the traffic
identification device. Recently signature-based methods have been proposed as a more
lightweight albeit accurate alternative to stateful reconstruction. The main problem
with both these methods is that they require some knowledge about the protocol
semantics. However, the protocols used for network games are usually binary protocols
and their semantics are kept secret to preclude users from exploiting this knowledge
for the purpose of cheating. For instance, proxy applications that alter the traffic
from the client to server could be developed to improve players aiming skills (so
called aimbots).
Previous research used a number of different parameters (features) to describe
network traffic including the size and duration of flows, packet length and interarrival
time distributions, flow idle times etc. We will develop a system that uses these
features and ML algorithms to automatically classify and identify network game traffic.
References
The following references provide a nice introduction to the topic of Machine Learning:
- Ian H. Witten, Eibe Frank, "Data Mining: Practical Machine Learning Tools and
Techniques (Second Edition)", Morgan Kaufmann, June 2005.
- Tom M. Mitchell, "Machine Learning", McGraw-Hill Education (ISE Editions), December
1997.
- Nils J. Nilsson, "Introduction to Machine
Learning"
Also see the publications page for more references
and information about Machine Learning.
|