I. INTRODUCTION
Traditional rule-based security solutions hardly detect advanced attacks such as zero-day attacks and advanced persistent threats (APT). Attackers acquire advanced skills and exploit unknown vulnerabilities to bypass security solutions. Machine-learning (ML) is becoming a prevalent way of detecting advanced attacks with unexpected patterns [1]. ML is based on statistical and mathematical algorithms rather than rule-based algorithms. ML techniques contributes to improving performance of intrusion detection systems (IDS). Numerous studies have been addressing ML-based IDS techniques since KDD CUP 99(KDD) appeared in 1999. KDD is the most widely used dataset generated by Defense Advanced Research Projects Agency (DARPA) for IDS evaluation. KDD consists of four types of attack-labeled data including denial of service (DoS), probe, U2R (User-to-Root), and R2L (Remote-to-Local). Hasan [2] focuses on 2-class classification and multi-class classification using support vector machine (SVM). Mulay [3] employs Random Forest as well as SVM for intrusion detection and preprocess the KDD dataset through binary encoding and data rescale. Beghdad [4] classifies normal and malicious traffic based on SVM and then detect attacks based on Decision Tree.
Numerous studies employ deep-learning (DL) for intrusion detects. Jia [5] and Yuchen [6] proposes a convolutional neural network (CNN) model for IDS and trains KDD with images. Le [7], Staudemeyer [8], and Kim [9] suggest an IDS model based on LSTM-RNN. Later, NSL-KDD and gureKDD appeared to improve the problems of KDD. The latest well-known datasets are CIC IDS 2017(CIC-2017) [10] and CSE-CIC-IDS 2018 (CIC-2018) [11]. CIC-2017 contains recent attacks with similar form of PCAP. It was created based on B-Profile system [12]. After CIC-2017 released, several studies suggest intrusion detection model using CIC-2017 based on ML [13-14]. CIC-2018 is the most up-to-date dataset including common attacks for IDS evaluation. CIC-2018 was not generated based on KDD and consists of 7 types of attack scenarios-labeled data (specifically 16 types of attacks) including brute-force, DoS, and Botnet. CIC-2018 contains massive network traffic and system logs. We can hardly find CIC-2018 studies using DL compared to ML-based studies on CIC-2018 [15-16].
In this paper, we develop an intrusion detection model based on CNN, one of DL algorithms used to train image datasets. We first convert the CIC-2018 numerical data into images. We then develop a CNN-based intrusion detection model by organizing convolutional layers and max-pooling layers. Furthermore, we train the images based on the proposed model and evaluate its performance by comparing experimental results with that of a recurrent neural network (RNN) model. Lastly, we discuss on a way of improving the performance. CNN and RNN are fundamental deep learning models for image data and time-series data, respectively. Inception [25] as well as ResNet [26] are based on CNN. Long Short-Term Memory (LSTM) [27] is an advanced model of RNN. By employing these fundamental models, we are able to identify the optimal analysis model for the characteristics of CIC-2018. Furthermore, we could improve the performance using those advanced models in the future. The remainder of this paper is organized as follows. Section 2 briefly describes existing ML-based studies on intrusion detection as well as DL algorithms we use in this work. In Section 3, we design our CNN-based intrusion model along with features. We evaluate the proposed model discuss a preprocessing issue for the better performance in Section 4. Finally, the conclusion is in Section 5.
II. RELATED WORKS
KDD CUP 99(KDD) was generated for IDS evaluation and includes four types of attacks such as DoS, R2L, U2R, and probing. KDD consists of 41 features including traffic features, basic and content features of each TCP connection. KDD has been widely used for data mining and ML studies on intrusion detection. Table 1 shows existing ML/DL-based studies on intrusion detection using KDD.
Some studies employ ML technique such as SVM, Decision Tree, and Artificial Neural Network (ANN) [6, 17]. Most of DL-based studies use CNN, RNN, LSTM and Deep Neural Network (DNN) algorithms [7-9], [17-18]. Moreover, some studies focus on preprocessing techniques of KDD [19-20]. NSL-KDD was generated to resolve some issues in KDD, especially duplicated records and lack of patterns of several attacks. Chuanlong [21] studies an intrusion detection model using Recurrent Neural Network (RNN) using NSL-KDD.
Canadian Institute for Cybersecurity (CIC) generated IDS datasets in 2012, 2017 and 2018. In 2012, ISCX IDS 2012 (CIC-2012)[22] was generated by injecting 4 types of attacks including infiltration attacks from inside, HTTP DoS attacks, DDoS(distributed denial of service) attacks and brute force attacks. Tamim [23] detects attacks in CIC-2012 based on CNN. He generates input images by converting destination payloads and classifies the images into normal and attack, while we classify two or more attacks in CIC-2018 based on a multi-class classification. CIC-2017 [10] and CIC-2018 [11] are the most up-to-date datasets for IDS evaluation. CIC-2017 contains network traffic with most common attack families including brute force attacks, heartbleed attacks, botnets, DDOS attacks and web attacks. Faker [13] studies intrusion detection using CIC-2017 and UNSW-NB15 datasets. This study removes socket information to prevent model overfitting. To reduce data size, they remove null values and unimportant traffic information. They also convert string values into numerical values and normalize the values. If there are missing data or infinite data, they make two versions of data set. First, replace all of missing and infinite data into average data. Second, remove all the missing and infinite data. They evaluate their model with the two kinds of datasets. As training algorithms, DNN (Deep Neural Network), Random Forest, and Gradient Boosting Tree classification are used. X. Zhang [14] focuses on intrusion detection using Deep Forest. They preprocess the datasets using based on the P-ZigZag encoding method and apply an inverse discrete cosine transform (IDCT) into the preprocessed datasets.
CIC-2018 contains more recent network traffic with/without attacks. CIC-2018 was generated by collecting network traffic and system logs for about 80 features. Qianru [15] analyzes the CIC-2018 dataset employing ML techniques. This study preprocesses the dataset by eliminating normal data and noise data, and then remove unnecessary values after decimal point. With these preprocessing methods, the size of CIC-2018 decreased by 4MB.
As ML techniques, they Random Forest, Decision Tree, Gaussian Naïve bayes classifier, Multi-Layer Perceptron (MLP), K-nearest neighbors classifier, and Quadratic discriminant analysis classifier. Table 2 and Table 3 show IDS studies using CIC-2017 and CIC-2018. We can hardly find DL-based IDS studies using CIC-2018. In this work, we suggest an IDS model employing DL techniques.
Algorithm\Dataset | CIC-IDS 2017(CIC-2017) | ||
---|---|---|---|
ML | Random Forest | O | - |
GNB | - | - | |
Decision Tree | - | - | |
MLP | - | - | |
DL | DNN | O | - |
GBT | O | - | |
XGBoost | - | O | |
CNN | - | - | |
RNN | - | - | |
pre-processing | Convert the dataset into images | Remove socket data | |
Data Padding | Remove white space | ||
P-ZigZag Encoding | Encode label | ||
- | Normalize data | ||
- | Replace or Remove missing/infinite data | ||
- | Remove normal traffic data | ||
evaluation | Binary Classification - DNN | P-ZigZag | |
Binary Classification - GBT | |||
Multiclass Classification - DNN | OHE | ||
Multiclass Classification - GBT | |||
reference | [13] | [14] |
Algorithm\Dataset | CSE-CIC-IDS 2018 (CIC-2018) | ||
---|---|---|---|
ML | Random Forest | O | - |
GNB | O | - | |
Decision Tree | O | - | |
MLP | O | - | |
DL | DNN | - | - |
GBT | - | - | |
XGBoost | - | - | |
CNN | - | O | |
RNN | - | O | |
pre-processing | Remove normal/noise data | Remove null values and infinite values | |
Eliminate unnecessary value after decimal | Convert numerical data into images | ||
Replace untreatable value | - | ||
evaluation | Classify each Zero-Day attack & benign data | Multiclass Classification - CNN | |
Classify mixed Zero-Day attack & benign data | Multiclass Classification - RNN | ||
reference | [15] | Our approach |
III. METHODS
CSE-CIC-IDS2018(CIC-2018) is a dataset containing network traffic and system logs. CIC-2018 consists of 10 days of sub-datasets collected on different days through injecting 16 types of attacks. This dataset was generated using CICFlowMeter-V3 [24] and contains about 80 types of features. These features provide forward and backward directions of network flow and packets. The size of CIC-2018 is more than 400GB, which is the larger amount than that of CIC-2017. We can develop a DL-based IDS model and evaluate its performance using CIC-2018.
Table 4 shows a list of the injected attacks and the amount of each sub-dataset we use in this work. We have preprocessed incomplete data such as null values and infinite values.
We have extracted 80 types of common features from all the sub-datasets. Using these common features, we can develop and evaluate each attack model in the same environment. We finally choose 79 features except ‘Timestamp’ as shown in Table 5. The details of the features can be found at the website of CIC-2018 [11].
CNN is the most commonly used deep learning algorithm for image training. In order to develop a CNN-based intrusion model, converting the CIC-2018 dataset into images is required. We convert each labeled data into 13x6 size of images because each data contains 78 features except the ‘Label’ feature. The ‘Label’ is used for image classification. A CNN model consists of convolutional layers, max-pooling layers, and a fully connected layer. We can find out the optimal CNN model by organizing those layers along with modeling parameters such as a kernel size, number of kernels, and ratio of dropout. Figure 1 shows our CNN model for CIC-2018.
We deploy two convolutional layers and the two max-pooling layers behind each convolutional layer. Although the max pooling layer is not mandatory for a CNN model, we deploy the layer because there is very low possibility of losing important features from the max pooling as the converted images only contain numerical data rather than hidden signatures. In addition, we use ‘relu’ as an activation function for each convolutional layer. In order to reduce overfitting, dropout is applied after each step of the max pooling. Finally, a fully connected layer is deployed behind the last max-pooling layer.
IV. RESULTS AND DISCUSSION
We train each sub-dataset described in Table 4 based on our CNN model. Thirty percent of each sub-dataset is used for the testing set. We set the training parameters as shown in Table 6.
Parameters | Value |
---|---|
Optimization algorithm (learning rate) | Adam (0.001) |
Size of batch | 100 |
Number of epochs | 10 |
In order to evaluate the performance of our model, we also train the dataset based on RNN model and compare the experimental results with each other. We design the RNN model based on ‘vanilla RNN’ with 10 units. Figure 2 shows the experimental results of CNN and RNN models. In most sub-datasets, our CNN model has a higher accuracy than that of the RNN model.
The accuracy is measured as follows:
Especially, the accuracies of SD-2, SD-3, SD-5, and SD-9 with CNN are about 10% to 60% higher than that of using RNN.
Although our experimental results show that our CNN model detects attacks in CIC-2018 with high accuracy, we still need to figure out a way of improving the accuracy of each attack. For instance, the accuracy of SD-3 is 0.9677 in Figure 2. According to the confusion matrix of SD-3, however, the accuracies of ‘DoS-GoldenEye’ and ‘DoS-Slowloris’ are 0.66 and 0.47 while the accuracy of ‘benign’ is 0.99 as shown Table 7.
It means our model has the best performance in classifying benign data because the CIC-2018 dataset provides much more ‘benign’ data than attack-labeled data.
Here we adjust the ratio of labeled data for the better performance of DL, although the original dataset better represents the real-world network environment and distinguishing anomalous traffic from massive benign traffic in the real network is challenging. In ML and DL, a data preprocessing is an important strategy for high. We preprocessed sub-datasets (SD-3, SD-6, SD-7, SD-8, and SD-9) with low accuracy in attack-labeled data so that the amount of benign data must not be more than five times than that of the smallest amount of attack-labeled data. We then train the datasets using our CNN model. Figure 3 compares the accuracy of each attack before and after the preprocessing considering the data ratio. The experimental results show that the accuracies of most attacks dramatically increase through the preprocessing. We can find out the optimal ratio of benign and attack-labeled data through repeatitive preprocessing and training.
V. CONCLUSION
We have employed DL techniques for intrusion detection. CIC-2018 has been used as an IDS dataset in this work. We have designed a CNN model consisting of two convolutional layers and two max-pooling layers and converted the dataset into images. These images have been trained based on the proposed CNN model and the experimental results showed that our model detects benign and attack data in CIC-2018 with high accuracy. In order to evaluate the performance of our model, we have also trained the dataset using RNN. In the multi-class classification, our CNN model is more accurate than the RNN model when applied to CIC-2018, the latest CIC dataset, using the image-based deep learning method introduced in Tami’s work [23]. Furthermore, we have suggested a way of improving the performance by preprocessing the dataset considering ratio of benign and attack-labeled data. The experimental results showed that the accuracy of attack-labeled data increased through the preprocessing method. In the future, we will train another IDS dataset based on our CNN model and find out the optimal model by reorganizing the convolutional layers along with CNN parameters.