As part of a broader organisational restructure, data networking research at Swinburne University of Technology has moved from the Centre for Advanced Internet Architecture (CAIA) to the Internet For Things (I4T) Research Lab.

Although CAIA no longer exists, this website reflects CAIA's activities and outputs between March 2002 and February 2017, and is being maintained as a service to the broader data networking research community.

ANGEL - Automated Network Games Enhancement Layer

Machine Learning

This sub-page provides a brief overview of the field of Machine Learning. Machine Learning is used within the scope of the ANGEL sub-project to classify network traffic flows. The results of the classification are then intended to be used to reconfigure the network to improve overall network application performance.

Machine Learning (ML) usually refers to systems performing tasks associated with Artificial Intelligence (AI). Such tasks involve recognition, diagnosis, planning, prediction etc. [Mitchell 1997] defines Machine Learning as follows:

"A computer program is said to learn from experiment E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E"

[Witten and Frank 2000] state:

"things learn when they change their behavior in a way that makes them perform better in the future"

Machine Learning can be viewed as general inductive process that automatically builds a model by learning the inherent structure of a dataset depending on the characteristics. Over the past decade, Machine Learning has evolved from a field of laboratory demonstrations to a field of significant commercial value. Machine Learning techniques have been very successful in the areas of Data Mining, speech and voice recognition, text recognition, face recognition etc.

A good model should be descriptive (describe the training data), predictive (generalise well for unseen test data) and explanatory (provide a plausible description).

Terminology

The following terms are widely used in the field of Machine Learning:

  • Instance: An instance is an example in the training data. An instance is described by a number of attributes. One attribute can be a class label
  • Attribute/Feature: An attribute is an aspect of an instance (e.g. temperature, humidity). Attributes are often called features in Machine Learning. A special attribute is the class label that defines the class this instance belongs to (required for supervised learning)
  • Classification: A classifier is able to categorise/classify given instances (test data) with respect to pre-determined/learned classification rules
  • Training/Learning: A classifier learns the classification rules based upon a given set of instances (training data). Please not that some algorithms have no training phase but do all the work when classifying an instance (so called lazy learning algorithms)
  • Clustering: The process of dividing a unlabeled data set (no class information given) into clusters that contain similar instances

Supervised vs. Unsupervised Learning

The basic notion of supervised learning is that of the classifier. A teacher helps to construct a classification model by defining classes and providing positive and negative examples of instances belonging to these classes. The system then determines the commonalities and differences between the various classes in order to generate classification rules for unknown instances. The resulting rules assign a class label to an instance based on the values of the instance's attributes.

In unsupervised learning there are no classes defined a-priori. The algorithm itself divides the instances into different classes and finds descriptions for these classes. This process is often also referred to as clustering. The resulting rules will be a summary of some properties of the instances in the database: which classes are present and what differentiates them. However, this will only be what the algorithm has found. There may be many other ways of dividing the instances into classes and of describing each class.

Semi-supervised learning is a combination of supervised and unsupervised learning. In this approach a user defined amount of supervision is imposed on the algorithm. The advantage is that only part of the data must have been classified a-priori, requiring less teaching effort, but the classification accuracy is usually higher than for purely unsupervised approaches.

Feature Selection

Feature selection involves selecting a subset of the available feature set to be used in learning and classification. The effectiveness of a system depends on the number and the types of features. It is important to minimise the number features to increase the performance of the learning and future classifications, in terms of processing time and memory required. It is also important to remove irrelevant features to increase the classification accuracy, as some Machine Learning algorithms cannot cope well with irrelevant features.

Example

An example that is often used to illustrate Machine Learning is to determine whether an unspecified sport can be played based on the current weather condition. Consider the following data (taken from Weka 3: Data Mining Software in Java) that includes the outlook, the temperature, the humidity, whether it is windy or not and the information whether playing is possible (this is the class attribute).

outlooktemperaturehumiditywindyplay
sunny
85
85
false
no
sunny
80
90
true
no
overcast
83
86
false
yes
rainy
70
96
false
yes
rainy
68
80
false
yes
rainy
65
70
true
no
overcast
64
65
true
yes
sunny
72
95
false
no
sunny
69
70
false
yes
rainy
75
80
false
yes
sunny
75
70
true
yes
overcast
72
90
true
yes
overcast
81
75
false
yes
rainy
71
91
true
no

The task for the Machine Learning algorithm is to find the rules inherent in this data set. Applying the C4.5 (named J48 in Weka) tree induction algorithm we get the following decision tree. This decision tree is a perfect classifier for our training data in the table. No data instance is misclassified and therefore the accuracy of the classifier on the training data is 100%.

decision tree

Each node of the tree resembles an attribute/feature. The branches below each node divide the attribute value space using the test associated with each branch. The leaves of the tree contain the class and the number of instances. Tree-based algorithms such as C4.5 already perform some kind of feature selection. The order in which the attributes appear in the tree (from top to bottom) depends on their ability divide to the feature space most quickly into sets of instances that can be accurately mapped to a single class. C4.5 uses a metric called information gain to determine the best sequence of attributes and the optimal splits of the attribute value space. Attributes with low information gain appear in the bottom of the tree or are not part of the tree at all. For instance, the temperature attribute is not present in the above tree because it does not have a strong correlation with the class.

The decision tree can be easily translated into the following set of classification rules:

if outlook=sunny and humidity<=75 then play=yes
if outlook=sunny and humidity>75 then play=no
if outlook=overcast then play=yes
if outlook=rainy and windy=true then play=no
if outlook=rainy and windy=false then play=yes

Detecting Network Game Traffic Flows

Recent years have seen a dramatic increase of traffic of highly interactive games (such as multiplayer first person shooter games) in the Internet. The ability to dynamically identify and classify traffic flows caused by network games is highly beneficial for trend analyses (estimating the size and origins of capacity demand trends) and adaptive, network-based Quality of Service (QoS) marking without direct involvement of the end-host.

The most common traffic identification technique is based on the inspection of `known port numbers'. Often game servers are running on widely known default port, and the port information in network traffic flows can be used to detect games. However, in general game servers can be configured to run on any port and already game servers often do not use the default port because multiple servers are running on the same host (IP address). Given that the port number range is not large (65535 port numbers) but more and more applications are developed, it is likely that port numbers of different applications will more and more overlap making them unusable for application identification.

A more reliable technique involves stateful reconstruction of session and application information from packet content. Although this technique avoids reliance on fixed port numbers, it imposes significant complexity and processing load on the traffic identification device. Recently signature-based methods have been proposed as a more lightweight albeit accurate alternative to stateful reconstruction. The main problem with both these methods is that they require some knowledge about the protocol semantics. However, the protocols used for network games are usually binary protocols and their semantics are kept secret to preclude users from exploiting this knowledge for the purpose of cheating. For instance, proxy applications that alter the traffic from the client to server could be developed to improve players aiming skills (so called aimbots).

Previous research used a number of different parameters (features) to describe network traffic including the size and duration of flows, packet length and interarrival time distributions, flow idle times etc. We will develop a system that uses these features and ML algorithms to automatically classify and identify network game traffic.

References

The following references provide a nice introduction to the topic of Machine Learning:

  • Ian H. Witten, Eibe Frank,  "Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)", Morgan Kaufmann, June 2005.
  • Tom M. Mitchell, "Machine Learning", McGraw-Hill Education (ISE Editions), December 1997.
  • Nils J. Nilsson, "Introduction to Machine Learning"

Also see the publications page for more references and information about Machine Learning.

Last Updated: Wednesday 16-Sep-2009 17:44:15 AEST | Maintained by: Jason But (jbut@swin.edu.au) | Authorised by: Grenville Armitage ( garmitage@swin.edu.au)