PDF Archive

Easily share your PDF documents with your contacts, on the Web and Social Networks.

Send a file File manager PDF Toolbox Search Help Contact



1I14 IJAET0514279 v6 iss2 573to582 .pdf



Original filename: 1I14-IJAET0514279_v6_iss2_573to582.pdf
Author: "Editor IJAET" <editor@ijaet.org>

This PDF 1.5 document has been generated by Microsoft® Word 2013, and has been sent on pdf-archive.com on 13/05/2013 at 13:08, from IP address 117.211.x.x. The current document download page has been viewed 573 times.
File size: 558 KB (10 pages).
Privacy: public file




Download original PDF file









Document preview


International Journal of Advances in Engineering &amp; Technology, May 2013.
©IJAET
ISSN: 2231-1963

EXAMINING OUTLIER DETECTION PERFORMANCE FOR
PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS
ROBUSTIFICATION METHODS
Nada Badr, Noureldien A. Noureldien
Department of Computer Science
University of Science and Technology, Omdurman, Sudan

ABSTRACT
Intrusion detection has gasped the attention of both commercial institutions and academic research area. In this
paper PCA (Principal Components Analysis) was utilized as unsupervised technique to detect multivariate
outliers on the dataset of an hour duration of time. PCA is sensitive to outliers since it depend on non-robust
estimators. This lead us using MCD (Minimum Covariance Determinant) and PP (Projection Pursuit) as two
different robustification techniques for the PCA. The results obtained from experiments show that PCA
generates a high false alarms due to masking and swamping effects, while MCD and PP detection rate is much
accurate and both reveals the effects of masking and swamping undergo the PCA method .

KEYWORDS:

Multivariate Techniques, Robust Estimators, Principal Components, Minimum Covariance
Determinant, Projection Pursuit.

I.

INTRODUCTION

Principal Components Analysis (PCA) is a multivariate statistical method that concerned with
analyzing and understanding data in high dimensions, that is to say, PCA method analyzes data sets
that represent observations which are described by several dependent variables that are inter
correlated. PCA is one of the best known and most used multivariate exploratory analysis technique
[5].
Several robust competitors to classical PCA estimators have been proposed in the literature. A natural
way to robustify PCA is to use robust location and scatter estimators instead of the PCA's sample
mean and sample covariance matrix when estimating the eigenvalues and eigenvectors of the
population covariance matrix. The minimum covariance determinant (MCD) method is a highly
robust estimator of multivariate location and scatter. Its objective is to find h observations out of n
whose covariance matrix has the lowest determinant. The MCD location estimate then is the mean of
these h points, and the estimate of scatter is their covariance matrix. Another robust method for
principal component analysis uses the Projection-Pursuit (PP) principle. Here, one projects the data on
a lower-dimensional space such that a robust measure of variance of the projected data will be
maximized.
In this paper we investigate the effectiveness of the robust estimators provided by MCD and PP, by
applying PCA on Abilene dataset and compare its detection performance of dataset outliers to MCD
and PP.
The rest of this paper is organized as follows. Section 2 is an overview to related work. Section 3 was
dedicated for classical PCA. PCA robustification methods, MCD and PP are discussed in section 4.
In section 5 the experiment results are shown, conclusions and future work are drawn in section 6.

II.

RELATED WORK

A number of researches have utilized principal components analysis to reduce the dimensionality and
to detect anomalous network traffic. The use of PCA to structure network traffic flows was introduced

573

Vol. 6, Issue 2, pp. 573-582

International Journal of Advances in Engineering &amp; Technology, May 2013.
©IJAET
ISSN: 2231-1963
by Lakhina [13] whereby principal components analysis is used to decompose the structure of OriginDestination flows from two backbone networks into three main constituents, namely periodic
trends, bursts and noise.
Labib [2] utilized PCA in reducing the dimension of the traffic data and for visualizing and
identifying attacks. Bouzida et, al. [7] presented a performance study of two machine learning
algorithms, namely, nearest neighbors and decision trees algorithms, when used with traffic data with
or without PCA. They discover that when PCA is applied to the KDD99 dataset to reduce dimension
of the data, the algorithms learning speed was improved while accuracy remained the same.
Terrel [9] used principal components analysis on features of aggregated network traffic of a link
connecting a university campus to the Internet in order to detect anomalous traffic. Sastry [10]
proposed the use of singular value decomposition and wavelet transform for detecting anomalies in
self similar network traffic data. Wong [12] proposed an anomaly intrusion detection model based on
PCA for monitoring network behaviors. The model utilizes PCA in reducing the dimensions of a
historical data and in building the normal profile, as represented by the first few components
principals. An anomaly is flagged when distance between the new observation and normal profile
exceeds a predefined threshold.
Mei-ling [4] proposed an anomaly detection scheme on robust principal components analysis. Two
classifiers were implemented to detect anomalies, one was based on major components that capture
most of the variations in the data, and the second was based on minor components or residuals. A new
observation is considered to be an outlier or anomalous when the sum of squares of the weighted
principal components exceeds the threshold in any of the two classifiers.
Lakhina [6] applied principal components analysis to Origin-Destination (OD) flows traffic , the
traffic isolated into normal and anomalous spaces by projecting the data onto the resulting principal
components one at a time, ordered from high to low, Principal components (PC) are added to the
normal space as long as a predefined threshold is not exceeded. When the threshold is exceeded, then
the PC and the subsequent PCs are added to anomalous space. New OD flow traffic is projected into
the anomalous space and anomaly is flagged if the value of the square prediction error or Q-statistic
exceeds a predefined limit.
Therefore PCA is widely used to identify lower dimensional structure in data, and is commonly
applied to high-dimensional data. PCA represents data by a small number of components that account
for the variability in the data. This dimension reduction step can be followed by other multivariate
methods, such as regression, discriminant analysis, cluster analysis, etc.
In classical PCA the sample mean and the sample covariance matrix are used to derive the principal
components. These two estimators are highly sensitive to outlying observations, and render PCA
unreliable, when outliers are encountered.

III.

CLASSICAL PCA MODEL

The PCA detection model detects outliers by projecting observations of the dataset on the new
computed axes known as PCs. The outliers detected by PCA method are two types, outliers detected
by major PCs, and outliers detected by minor PCs.
The basic goals of PCA [5] are to extract important information from data set, to compress the size of
the data set by keeping only this important information and to simplify the description of data and
analyze the structure of the observation and variables (finding patterns with similarities and
difference).
To achieve these goals PCA calculate new variables from the original variables, called Principal
Components (PCs). The computed variables are linear combination of the original variables (to
maximize variance of the projected observation) and uncorrelated. The first computed PCs, called
major PCs has the largest inertia ( total variance in data set ), while the second calculated PCs, called
minor PCs has the greater residual inertia ,and orthogonal to the first principal components.
The Principal Components define orthogonal directions in the space of observations. In other words,
PCA just makes a change of orthogonal reference frame, the original variables being replaced by the
Principal Components.

574

Vol. 6, Issue 2, pp. 573-582

International Journal of Advances in Engineering &amp; Technology, May 2013.
©IJAET
ISSN: 2231-1963
3.1 PCA Advantages
PCA common advantages are:
3.1.1 Exploratory Data Analysis
PCA is mostly used for making 2-dimensional plots of the data for visual examination and
interpretation. For this purpose, data is projected on factorial planes that are spanned by pairs of
Principal Components chosen among the first ones (that is, the most significant ones). From these
plots, one will try to extract information about the data structure, such as the detection of outliers
(observations that are very different from the bulk of the data).
Due to most researches [8][11], PCA detect two types of outliers, type(1): the outlier that inflate
variance and this is detected by the major PCs and type (2): outlier that violate structure, which are
detected by minor PCs.
3.1.2 Data Reduction Technique
All multivariate techniques are prone to the bias variance tradeoff, which states that the
number of variables entering a model should be severely restricted. Data is often described
by many more variables than necessary for building the best model. PCA is better than
other statistical reduction techniques in that, it select and feed the model with reduced
number of variables.
3.1.3 Low Computational Requirement
PCA needs low computational efforts since its algorithm constitutes simple calculations.

3.2 PCA Disadvantages
It may be noted that the PCA is based on the assumptions that, the dimensionality of data can be
efficiently reduced by linear transformation and most information is contained in those directions
where input data variance is maximum.
As it is evident, these conditions are by no means always met. For example, if points of an input set
are positioned on the surface of a hyper sphere, no linear transformation can reduce dimension
(nonlinear transformation, however, can easily cope with this task). From the above the following
disadvantage of PCA are concluded.
3.2.1 Depending On Linear Algebra
It relies on simple linear algebra as its main mathematical engine, and is quite easy to interpret
geometrically. But this strength is also a weakness, for it might very well be that other synthetic
variables, more complex than just linear combinations of the original variables, would lead to a more
complex data description.
3.2.2 Smallest Principal Components Have No Attention in Statistical Techniques
The lack of interest is due to the fact that, compared with the largest principal components that
contain most of the total variance in the data, the smallest principal components only contain the
noise of the data and, therefore, appear to contribute minimal information. However, because outliers
are a common source of noise, the smallest principal components should be useful for outlier
detection.
3.2.3 High False Alarms
Principal components are sensitive to outliers, since the principal components are determined by
their directions and calculated from classical estimator such classical mean and classical covariance
or correlation matrices.

IV.

PCA ROBUSTIFICATION

In real datasets, it often happens that some observation are different from the majority, such
observation are called outliers, intrusion, discordant, etc. However classical PCA method can be

575

Vol. 6, Issue 2, pp. 573-582

International Journal of Advances in Engineering &amp; Technology, May 2013.
©IJAET
ISSN: 2231-1963
affected by outliers so that PCA model cannot detect all the actual real deviating observation, this is
known as masking effect. In addition some good data points might even appear to be outliers which
are known as swamping effect .
Masking and swamping cause PCA to generate a high false alarm. To reduce this high false alarms
using robust estimators was proposed, since outlying points are less likely to enter into the
calculation of the robust estimators.
The well-known PCA Robustification methods are the minimum covariance determinant (MCD) and
Projection-Pursuit (PP) principle. The objective of the raw MCD is to find h &gt; n/2 observations out
of n whose covariance matrix has the smallest determinant. Its breakdown value is (bn= [n- h+1]/n),
hence the number h determines the robustness of the estimator. In Projection-Pursuit principle [3],
one projects the data on a lower-dimensional space such that a robust measure of variance of the
projected data will be maximized. PP is applied where the number of variables or dimensions is very
large, so PP has an advantage over MCD, since the MCD proposes the dimensions of the dataset not
to exceed 50 dimensions.
Principal Component Analysis (PCA) is an example of the PP approach, because they both search for
directions with maximal dispersion of the data projected on it, but PP instead of using variance as
measure of dispersion, they use robust scale estimator [4].

V.

EXPERIMENTS AND RESULTS

In this section we show how we test PCA and its robustification methods MCD and PP on a dataset.
The data that was used consist of OD (Origin-Destination) flows which, are collected and made
available by Zhang [1]. The dataset is an extraction of sixty minutes traffic flows from first week of
the traffic matrix on 2004-03-01, which is the traffic matrix Yin Zhang was built from Abilene
network. Availability of the dataset is on offline mode, where it is extracted from offline traffic
matrix.

5.1 PCA on Dataset
At first, the dataset or the traffic matrix is arranged into the data matrix X, where rows represent
observations and columns represent variables or dimensions.
𝑥1,1 ⋯ 𝑥1,12

⋮ ],
X (144×12) =[ ⋮
𝑥144,1 ⋯ 𝑥144,12
The following steps are considered in apply PCA method on the dataset.
 Centering the dataset to have zero mean, so the mean vector is calculated from the following
equation:
1
𝜇 = 𝑛 ∑𝑛𝑖=1 𝑥𝑖
(1)
and subtracted off the mean for each dimension.
The product of this step is another centered data matrix Y, which has the same size as original dataset
𝑌(𝑛,𝑝) = (𝑥𝑖,𝑗 – 𝜇(𝑋))
(2)


Covariance matrix is calculated from the following equation:
1
𝐶(𝑋)𝑜𝑟Σ(𝑋) = 𝑛−1 (𝑋 − 𝑇(𝑋))𝑇 . (𝑋 − 𝑇(𝑋))
(3)
 Finding eigenvectors and eigenvalues from the covariance matrix where eigenvalues are diagonal
elements of the matrix by using eigen-decomposition technique in equation (4).
𝐸 −1 × Σ Y×E =ʎ
(4)
Where E is the eigenvectors, ʎ is the eigenvalues .
 Ordering eigenvalues in decreasing order and sorting eigenvectors according to the ordered
eigenvalues in loadings matrix. The Eigenvectors matrix is then sorted to be loading matrix.
 Calculating scores matrix (dataset projected on principal components), which declares the
relations between principal components and observations. The scores matrix is calculated from
the following equations:
𝑠𝑐𝑜𝑟𝑒𝑠(𝑛,𝑝) = 𝑌(𝑛,𝑝) × 𝑙𝑜𝑎𝑑𝑖𝑛𝑔𝑠(𝑝,𝑝)
(5)

576

Vol. 6, Issue 2, pp. 573-582

International Journal of Advances in Engineering &amp; Technology, May 2013.
©IJAET
ISSN: 2231-1963


Applying the 97.5 tolerance ellipse of the bivariate dataset (data projected on first PCS, data
projected on minor PCS) to reveal outliers automatically. The ellipse is defined by these data
points whose distance is equal to the chisquare root of the 97.5 quantile with 2 degrees of
freedom. The form of the distance is 𝑑𝑖𝑠𝑡 ≤ √𝑥 2 𝑝,0.975

(6)

The screeplot is used and studied and the first and the second principal components accounted for
98% of total variance of the dataset, so retaining the first two principal components to represent the
dataset as whole, figure (1) shows the screeplot, the plotting of the data projected onto the first two
principal components in order to reveal the outliers on the dataset visually is shown in figure (2).
7

100

2

data projected on major pcs

x 10

120

90
1.5

119

70
1

135

60

PC2

totalvariance variances

80

50

0.5
84
128
134
121
141 127
136
1
5
4
16
645
61
11
346
2
50
49
48
47
20
19
18
17
25
24
23
22
35
34
33
32
31
52
54
21
53
55
29
28
106
51
30
8
7
15
27
43
10
40
39
38
37
942
14
13
44
123
122
12
91
139
36
57
101
59
58
60
109
70 69
26
63
62
65
108
110
76
114
140
56
64
107
85
115
112
94
92
93
113
79
95
78
77
80
82
83
71
138
88
87
81
68
75
103
143111 89 90 67
73
74 72
137
132
105
142
98 104102
86
126
96
133
124

40
0

30
20

125

-0.5

0

0

2

4

6
8
principal components

10

12

-1
-2

-1

131

130
144
129

10

0

66

116
117

1

2

3

118

4

5

6

PC1

Figure 1: PCA Screeplot

7
7

x 10

Figure 2: PCA Visual outliers

Figure (3) shows tolerance ellipse on major PCS, and figures (4) and (5) shows the visual recording of
outliers from scatter plots of data projected on robust minor principal components and the outliers
detected by robust minor principal components tuned by tolerance ellipse respectively.
6

5

Tolerance ellipse (97.5%)

x 10

8

data projected on minor pcs

x 10

120
15

76

6

119
116
4

10

last PC

PC2

135

5

0

84 128
134
121 136
141 127
41
46
61
11
16
50
49
48
47
21
33
54
542
4
3
2
1
6
106
123
20
19
18
17
25
24
23
22
32
31
35
34
53
52
55
30
29
15
28
27
40
43
10
39
8
7
964
122
51
38
45
14
37
44
13
12
91
139
36
57
59
58
109
60
26
63
62
108
110
65
76
114
140
107
85
115
112 1017188 10070 69
56
94
93
113
79
92
78
77
95
82
80
111
83
138
87
89 90 67
81
75
68
10399 73 72
143
74
137
132
105
102
142
98
97104
86
126
96
133
124

-5

125

-4

-2

0

2

4

81

0

-2

116
117

10099

66

131

130
144
129

2

-4

68
67
127 84
141
128
86
85
133 132
70
64
134
62 135
72
130
139
12965
140 63
61
122
123 11 16 73
126
12456 75 36
98
87
88 31 138
1 93
5
4
125
829
7
6 74
9
13
12
137
58
57
28
10
25
24
39
40
27
15
14
354
2
17
18
33
32
60
30
20
19
35
34
49
38 92
37
121
59
42
53
52
43
78
48
47
50
55
44
45
77
23
22
80
46
7995
51
21106
109 94
89
110
41 118
90
120
117 107 66114
26
96
136 108 143
115
142
119 113
112
101
83
82
91
131
103
105
102
104
144
71
111

118
-6
-8

6

PC1

Figure 3: PCA Tolerance Ellipse

7

x 10

-6

-4

-2
0
last PC-1

2

4

6
5

x 10

Figure 4: PCA type2 Outliers

.

577

Vol. 6, Issue 2, pp. 573-582

International Journal of Advances in Engineering &amp; Technology, May 2013.
©IJAET
ISSN: 2231-1963
5

Tolerance ellipse (97.5%)

x 10

76

6
116
4

68
67

141
127
128

PC12

2

10099

0

81

-2

-4

-6

84
85

86
133 132
69
70
64
134
63
62
72
130
135
139
129 65
140
122
123 1161 16 73
126
12456 75 36
97
98
88 31138
87
1
5
4
125
8
7
93
92
6
13
12
9
137
58
10
25
24
57
27
15
14
28
3
2
40
39
30
18
17
33
32
20
19
38
74 79
60
59
42
53
52
29
47
54
35
34
50
49
48
37
121
43
78
55
77
23
44
45
22
80
46
106
51
94
21
95
109
89
110
90
41 118
107
120
117
108
26
114
66
96
136
115
143
142
119 113
112
101
83
91
82
131
103
105
102
104
144
71
111
-4

-2

0

2

4

PC11

5

x 10

Figure 5: Tuned Minor PCS

5.2 MCD on Dataset
Testing robust statistics MCD (Minimum Covariance Determinant) estimator yields robust location
measure Tmcd and robust dispersion Σmcd.
The following steps are applied to test MCD on the dataset in order to reach the robust principal
components.
 MCD measure is calculated from the formula:
R=(xi-Tmcd(X))T.inv(Σmcd(X)).(xi-Tmcd(X) )
for i=1 to n
(7)
Tmcd or µmcd =1.0e+006 *
 From robust covariance matrix Σmcd calculating the followings:
C(X)mcd or Σ(x)mcd = 1.0e+012 *
* find robust eigenvalues as diagonal matrix as in equation (4) by replacing n with h
* find robust eigenvectors as loading matrix as in equation (5).
 Calculating robust scores matrix as in the following form
𝑟𝑜𝑏𝑢𝑠𝑡𝑠𝑐𝑜𝑟𝑒𝑠(𝑛,𝑝) = 𝑌(𝑛,𝑝) × 𝑙𝑜𝑎𝑑𝑖𝑛𝑔𝑠(𝑝,𝑝)
(8)
The robust screeplot retaining the first two robust principal components which accounted above of
98% of total variance is shown in figure (6). Figures (7) and (8) shows respectively the visual
recording of outliers from scatter plots of data projected on robust major principal components, and
the outliers detected by robust major principal components tuned by tolerance ellipse, and Figures (9)
and (10) shows the visual recording of outliers from scatter plots of data projected on robust minor
principal components and the outliers detected by robust minor principal components tuned by
tolerance ellipse respectively.
7

robust mcd screeplot to retain robust PCS
100

2.5

major pcs from robust estimator

x 10

90

120

2
80

119
1.5

robustmcd PC2

total variance

70
60
50
40
30

135
1

0.5

66

0
116

20

131

118

-0.5

134
128
69 70 99
127 141 84121
71 101
72
73 100103
88
136
87
139
41
74
75
102
91
11
61
46
68
67 90 89 112
113
115
114
76
9
10
39
106
27
40
28
6
21
29
18
17
20
25
24
23
31
35
55
54
53
52
16
19
22
34
33
32
50
49
5
4
3
2
1
48
47
78
140
57
13
12
42
14
37
44
123
7
15
38
43
45
8
30
51
98
111
77
93
92
109
58
60
59
122
36
83
82
95
79
94107
108
110
105
80
85
65
63
62
26
104 97
132
56
138
133
143
81 64
86
96
137
142
124

117

130
129

10
0

0

2

4

6
8
principal components

Figure 6: MCD screeplot

578

10

12

-1
-8

-7

-6

-5

-4
-3
robustmcd PC1

-2

-1

125

0

1
7

x 10

Figure 7: MCD Visual Outliers

Vol. 6, Issue 2, pp. 573-582

International Journal of Advances in Engineering &amp; Technology, May 2013.
©IJAET
ISSN: 2231-1963
6

6

Tolerance ellipse (97.5%)

x 10

2

data project on robustmcd minor PCS

x 10

120

20

119

1

15

69
70

10

robustmcd last pc

robustmcdPC2

0.5
135

5
66
0
116
118

-5

-6

134
128
69 70100
127 14184
71101
72
73 99103
12141
136
88
87
139
75
10274
91
11
113
115
114
106
61
21
16
46
6
1
5
67 90 89 112
140
76
123
12
14
13
10
15
27
40
39
38
43
45
28
30
29
53
52
55
9
20
19
18
17
25
24
23
22
35
34
33
32
31
50
49
54
48
47
84
7
3
2
78
77
122
57
42
44
37
51
98
111
93
92
109
58
60
59
36
83
82
80
95
79
94
107
108
110
105
85
65
63
62
26
13268
56
138
133 10497
143
8164
86
96
137
126 142
124
131
130
144
129

117

-4

0

97
98

101

10099

-0.5

71
73
134
86
102 26 8067
104
441
624
144 56
117
139
136
8188112
11374
126
91 141 1187685

116

84

-1
119

-1.5

120

-2

125

-2.5
66

-2
0
robustmcdPC1

2

-3
-2.5

4

-2

-1.5

7

x 10

Figure 8: MCD Tolerance Ellipse
6

-1
-0.5
0
robustmcd last-1 pc

0.5

1

1.5
6

x 10

Figure 9: MCD type2 Outliers
Tolerance ellipse (97.5%)

x 10

96

131

1.5
1

69
70

0.5

robustmcd pclast

96

131

1.5

97
98

0
10099
-0.5

101

71
73
134
67
77125
68 86
137
78
79
80
102138
89
26
61
104
36
142
110
109
21
60
108
107
13
9
114
59
58
10
42
16
115
133
143
37
12
57
7
124
38
44
14
127
15
43
8
1
5
4
6
90
94
39
56
40
45
27
28
30
29
18
17
103
23
22
53
52
55
20
19
35
34
33
32
48
47
51
54
50
49
82
105
111
128
83
25
24
3
2
75
72 117
41
95
144 132
46
121
106
129
139
11
140
31
130
123
122
135
92
93
62
63 74
136
8165
64
113
87
112
126 88
85
91 141 118 76

116

84

-1
119

-1.5

120

-2
-2.5
66
-3
-2.5

-2

-1.5

-1
-0.5
0
robustmcd pclast-1

0.5

1
6

x 10

Figure 10: MCD Tuned Minor PCs

5.3 Projection Pursuit on Dataset
Testing the projection pursuit method on the dataset is included in the following steps:
 Center the data matrix X(n,p) , around L1-median to reach centralized data matrix Y(n,p) as :
(9)
𝑌(𝑛,𝑝) = (𝑋 (𝑛,𝑝) − 𝐿1(𝑋))
Where L1(X) is high robust estimator of multivariate data location with 50% resist of outliers [11].
 Construct the directions pi as normalized rows of matrix , `this process include the following:
(10)
𝑃𝑌 = (𝑌[𝑖, : ])′ 𝑓𝑜𝑟 𝑖 , 1: 𝑛
 𝑙𝑒𝑡 𝑁𝑃𝑌 = max(𝑆𝑉𝐷(𝑃𝑌))
(11)
Where SVD stand for singular value decomposition.
𝑃𝑌
(12)
𝑃𝑖 =
𝑁𝑃𝑌

 Project all dataset on all possible directions.
𝑇𝑖 = 𝑌 × (𝑃𝑖 )𝑡
(13)
 Calculate robust scale estimator for all the projections and find the directions that maximize qn
(14)
estimator,𝑞 = max(𝑞𝑛(𝑇𝑖 ))
qn is a scale estimator, essentially it is the first quartile of all pairwise distance between two data
points [5]. The results of these steps yields the robust eigenvectors (PCs), and the squared of
value of the robust scale estimator is the eigenvalues.
 project all data on the selected direction q to obtain robust principal components as in the
following :
𝑇𝑖 = 𝑌𝑛,𝑝 × 𝑃𝑞𝑡
(15)
 Update data matrix by its orthogonal complement as in the followings:
𝑌 = 𝑌 − (𝑃𝑞 × 𝑃𝑞𝑡 ). 𝑌
(16)

579

Vol. 6, Issue 2, pp. 573-582

International Journal of Advances in Engineering &amp; Technology, May 2013.
©IJAET
ISSN: 2231-1963


Project all data on the orthogonal complement,
𝑠𝑐𝑜𝑟𝑒𝑠 = 𝑌 × 𝑃𝑖
(17)
The Plotting of the data projected on the first two robust principal components to detect outliers
visually, is shown in figure (11), and the tuning the first two robust principal components by
tolerance ellipse is shown in figure (12). Figures (13) and (14) show respectively the plotting of
the data projected on minor robust principal components to detect outliers visually, and the tuning
of the last robust principal components by tolerance ellipse.
7

1

7

0.5

1
144129
130
142
137
107
81
143
138
94
79
121
76
85
91
95
80
114
140
126
93
92
82
77
78
111
83
139
84115
113
112
136
141 89
127
101
90 86
128
96
131
7167
68
132
87
75
88
105
103
97 104
117
13498
74
73
100
99
70 102
72 133
69

-1
-1.5
-2

118
116

135

125
124
144
130
129
48
50
47
49
19
34
22
23
20
35
17
18
32
33
54
25
24
51
31
36
26
5412338
46
16
55
53
52
122
29
30
21
61
15
43
28
45
27
44
656
63
62
78109
123
40
14
39
37
10
42
12
13
57
59
965
106
60
58
64
11
41
110
108
142
137
107
81
138
143
94
79
121
76
85
91
95
80
114
140
126
92
93
115
82
77
111
78
83
139
84
113
112
136
141
89
127
101
90 86
128
96 131
71
68
67
132
87
75
88
105
103
97104
117
13498
74102
73
100
9970
72133
69

0
PProbust PC2

0
-0.5
PP robust PC2

Tolerance ellipse (97.5%)

x 10

data projected on robust major PCS by PP method

x 10

-1

-2

66

135

-2.5
-3

-3

119

118
116
66

119

-3.5
120

-4
-1

0

1

2

3
4
PProbust PC1

5

6

120

-4
7

-4

8

-2

7

x 10

Figure 11: PP Visual Outliers
6

1

97

116

1

-0.5

6
7

x 10

66

69
70

99
100

0.5

68
67

129 125117
130126118
85 141
1247211
61
16
127
36
73
131
56
128 84
96 90
94
93
92
45128
964
8995
38
37
39
758
10
40
827
12
13
106
6109
57
88
60
14
30
29
35
62
23110
15
20
19
18
23
22
34
33
32
50
49
48
47
63
80
79
87
134
137
144
17
25
24
43
42
59
55
54
45
44
53
52
142
51
26
123
122
46
65
31
78
21
138
8377
132
143
107
82
108
121
140
11486
103
81
139
133
41
7571 11591 102
74
112 136
105113
101
111
104
135

0

98
97

116
76

4

Tolerance ellipse (97.5%)

x 10

70

99
100

0.5
PP robust PC12

6

data projected on robust minor PCS by PP

x 10

2
PProbust PC1

Figure 12: PP Tolerance Ellipse

PProbust PC12

1.5

0

76

68
67

129 125117
130126118
85 141
1247211
61127
16
36
73
131
56
128
84
96 90
95
94
92
93
10
1
4
5
89142
37
38
40
39
7
12
13
58
89 77
106
88
57
6
60
137
134
144
62
14
30
15
19
34
80
79
63
87
22
23
20
35
29
50
47
49
32
33
43
17
18
25
24
27
28
59
42
3
2
110
54
55
48
45
53
52
122
44
64
51
123
26
46
65
31
21
138
8378
132
1109
43
82
107
108
86
121
140
103
114
139
81
133
41
7571 11591102
74
112 136
113
105
101
111
104
135

0

-0.5

-1

-1

119
-1.5

120

119
-1.5

-2
-3

-2

-1

0
1
PProbust PC11

2

3

Figure 13: MCD type2 Outliers

4
6

x 10

120
-2

-1

0
1
PProbust PC11

2

3
6

x 10

Figure 14: MCD Tuned Minor PCs

5.4 Results
Table (1) summarizes the outliers detected by each method. The table shows that PCA suffers from
both masking and swamping. The MCD and PP methods results reveal the effects of masking and
swamping of the PCA method. The PP method results are similar to MCD with slight difference
since we use 12 dimensions on the dataset.
PCA Outlier
detected by major
and Minor PCS
66
99
100
116
117
118
119
120

580

Table 1: Outliers Detection
MCD Outliers
PP Outliers
detected by major and detected by major
minor PCS
and minor PCS
66
66
99
99
100
100
116
116
117
117
118
118
119
119
120
120

False alarms effects
Masking
Swamping
No
No
No
No
No
No
No
No

No
No
No
No
No
No
No
No

Vol. 6, Issue 2, pp. 573-582

International Journal of Advances in Engineering &amp; Technology, May 2013.
©IJAET
ISSN: 2231-1963
129
131
135
Normal
Normal
71
76
81
101
104
111
144
Normal
Normal
Normal
Normal

VI.

129
131
135
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Normal
84
96
97
98

129
131
135
69
70
normal
normal
normal
normal
normal
normal
normal
normal
normal
97
98

No
No
No
Yes
Yes
No
No
No
No
No
No
No
Yes
Yes
Yes
Yes

No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No

CONCLUSION AND FUTURE WORK

The study has examined the PCA and its robustification methods (MCD, PP) performance for
intrusion detection by presenting the bi-plots and extracted outlying observation that are very
different from the bulk of data. The study showed that tuned results are identical to visualized one.
The study returns the PCA false alarms shortness due to masking and swamping effect. The
comparison proved that PP results are similar to MCD with slight difference in outliers type 2 since
are considered as source of noise. Our future work will go into applying the hybrid method
(ROBPCA), which takes PP as reduction technique and MCD as robust measure for further
performance, and applying dynamic robust PCA model with regards to online intrusion detection.

REFERENCES
[1]. Abilene TMs, collected by Zhang . www.cs.utexas.edu/yzhang/ research, visited on 13/07/2012
[2]. Khalid Labib and V.Rao Vemuri. &quot;An application of principal Components analysis to the detection
and visualization of computer network &quot;. Annals of telecommunications, pages 218-234, 2005 .
[3]. C. Croux, A. Ruiz-Gazen, &quot;A fast algorithm for robust principal components based on projection
pursuit&quot;, COMPSTAT: Proceedings in Computational Statistics, Physica-Verlag, Heidelberg,1996, 211–
217.
[4]. Mei-ling Shyu, Schu-Ching Chen,Kanoksri Sarinnapakorn,and Li Wuchang. &quot;Anovel anomaly detection
scheme based on principal components classifier&quot;. In proceedings of the IEEE foundations and New
directions of Data Mining workshop, in conjuction with third IEEE international conference on data mining
(ICOM03) .
[5]. J.Edward Jackson . &quot;A user guide to principal components&quot;. Wiely interscience Ist edition 2003.
[6]. Anukool Lakhina,. Mark Crovella, and Christoph Diot. &quot;Diagnosing network wide traffic anomalies&quot;
.Proceedings of the 2004 conference on Applications, technologies, architectures, protocols for computer
communication. ACM 2004.
[7]. Yacine Bouzida, Frederic Cuppens, NoraCuppens-Boulahio, and Sylvain Gombaul. &quot;Efficient Intrusion
Detection Using Principal Component Analysis &quot;. La londe, France, June 2004.
[8]. R.Gnandesikan, &quot;Methods for statistical data analysis of multivariate observations&quot;. Wiely-interscience
publication New York, 2nd edition 1997.
[9]. J.Terrel, K.Jeffay L.Zhang, H.Shen, Zhu, and A.Nobel, &quot;Multivariate SVD analysis for a network
anomaly detection &quot;. In proceedings of the ACM SIGOMM Conference 2005.
[10]. Challa S.Sastry, Sanjay Rawat, Aurn K.Pujari and V.P Gulati, &quot;Netwok traffic analysis using singular
value decomposition and multiscale transforms &quot;. information sciences : an international journal 2007.

581

Vol. 6, Issue 2, pp. 573-582

International Journal of Advances in Engineering &amp; Technology, May 2013.
©IJAET
ISSN: 2231-1963
[11]. I.T.Jollif, &quot;Principal components analysis&quot;, springer series in statistics, Springer Network ,2nd edition
2007.
[12]. Wei Wong, Xiachong Guan, and Xiangliong Zhang, &quot;Processing of massive audit data streams for real
time anomaly intrusion detection&quot;. Computer communications , Elsevier 2008.
[13]. A Lkhaina, K Papagiannak, M Crovella, C-Diot, E Kolaczy, and N. Taft, &quot;Structural Analysis of
network traffic flows&quot;. In proceedings of SIGMETRICS, New York, NY, USA, 2004.

AUTHORS BIOGRAPHIES
Nada Badr earned her BSC in Mathematical and Computer Science at University of
Gezira, Sudan. She received the MSC in Computer Science at University of Science and
Technology. She is pursuing her PHD in Computer Science at University of Science and
Technology, Omdurman, Sudan. She currently serving lecturer at the University of
Science and Technology, Faculty of Computer Science and Information Technology.

Noureldien A. Noureldien is working as an associate professor in Computer Science,
department of Computer Science and Information Technology, University of Science and
Technology, Omdurman, Sudan. He received his B.Sc. and M.Sc. from School of
Mathematical Sciences, University of Khartoum, and received his PhD in Computer
Science in 2001 from University of Science and Technology, Khartoum, Sudan. He has
many papers published in journals of repute. He currently working as the dean of the
Faculty of Computer Science and Information Technology at the University of Science
and Technology, Omdurman, Sudan.

582

Vol. 6, Issue 2, pp. 573-582


Related documents


PDF Document 1i14 ijaet0514279 v6 iss2 573to582
PDF Document 44i14 ijaet0514282 v6 iss2 954to960
PDF Document outlier methods external
PDF Document tapan desai res
PDF Document ijetr2226
PDF Document supplementary


Related keywords