📢 Notice 📢

4 minute read

Notes on the COMP3009 Data Mining assignment and my reflection.

Purpose and Background

For this assignment in COMP3009 – Data Mining, the goal was to build a machine learning pipeline to classify network activity patterns.
The dataset contained various network traffic features, and the task was to predict the type of activity occurring in the network.

This project focused on a classical classification problem in machine learning. Instead of deep learning, the emphasis was on:

  • understanding the dataset,
  • preparing the data properly,
  • experimenting with multiple machine learning models,
  • and evaluating the results carefully.

Although this unit was an elective, I found it genuinely interesting and enjoyable. It gave me a deeper appreciation of how practical machine learning systems are built step by step.

First Look at the Data

One of the first things I checked was the class distribution.

The plot above immediately showed that the dataset was highly imbalanced. Classes such as Generic, Normal, and Exploits made up a large portion of the data, while minority classes like Worms and Shellcode had very few samples.

This matters because a model can appear to perform well simply by predicting the majority classes most of the time. From the start, it was clear that this assignment was not only about training a classifier, but also about thinking carefully about how to handle skewed data.

Feature Relationships and EDA

To better understand the dataset, I also looked at the relationships between numeric features using a correlation heatmap.

The heatmap helped highlight which features were strongly related and which seemed more independent. This was useful when thinking about:

  • possible redundancy between features,
  • how different models might react to correlated variables,
  • and which parts of the dataset might need more careful preprocessing.

I also inspected the distributions of several numeric features.

Many numeric attributes were clearly skewed, with long tails and uneven distributions. This reinforced the need to think carefully about preprocessing, feature scaling, and how model assumptions interact with real-world data.

Data Preparation

Before building models, I performed several preprocessing steps:

  • handling missing values,
  • correcting inconsistent values,
  • removing irrelevant attributes,
  • converting categorical features when needed,
  • and standardising numerical features for distance-based models.

Exploratory data analysis (EDA) helped identify class imbalance, skewed numeric features, and correlations between variables. Proper data preparation turned out to be one of the most important parts of the entire project.

Models Explored

I experimented with several classical machine learning models:

k-Nearest Neighbors (kNN)

kNN is a distance-based classifier, meaning the distance between samples determines the prediction. Because of this:

  • feature scaling is essential,
  • and high-dimensional noise can significantly affect performance.

Despite its simplicity, it provided a useful baseline model.

Naïve Bayes

Naïve Bayes is a probabilistic classifier based on Bayes’ theorem.

It assumes that features are independent given the class label, which is often not strictly true in real datasets. However, it still performs surprisingly well in many classification tasks and is computationally efficient.

Decision Tree

Decision Trees split the data based on feature values to create interpretable classification rules.

They are less sensitive to feature scaling and can capture nonlinear relationships. However, they are prone to overfitting, so controlling tree depth and other parameters becomes important.

XGBoost

I also experimented with XGBoost, a gradient boosting algorithm that combines multiple decision trees.

Tree-based ensemble models are often very effective for structured tabular data, and XGBoost provided one of the stronger performances in this project.

Handling Skewed Data

Because the dataset was imbalanced, relying on accuracy alone would be misleading.

To address this, I focused on additional evaluation metrics such as:

  • Precision
  • Recall
  • F1-score
  • Macro-F1

Macro-F1 was especially helpful because it treats each class equally, rather than letting majority classes dominate the evaluation.

To deal with the skewed dataset, I explored several approaches:

  • using stratified train/validation splits so class proportions remain consistent,
  • experimenting with class weights during model training,
  • and testing SMOTE (Synthetic Minority Over-sampling Technique) to generate additional minority-class samples.

These techniques helped me better understand how imbalanced datasets affect model behaviour, and why model evaluation needs to reflect more than just headline accuracy.

Reflection

Overall, I really enjoyed this assignment and the Data Mining unit.

Even though it was an elective, it was very engaging to explore how machine learning models can be used to classify network activity. Understanding how to approach skewed datasets and how different models behave on the same data was one of the most interesting parts for me.

One thing I feel slightly regretful about is that I could not push the performance even further. During Semester 2 of my third year, I was also heavily working on my capstone project, so time was limited. If I had more time, I would have liked to experiment more with feature engineering, hyperparameter tuning, and model optimisation to improve the final accuracy.

That said, this project made me realise how interesting the data science side of computing can be. Exploring patterns in data, testing different models, and understanding how evaluation metrics reflect real performance was both challenging and rewarding.

It was a good reminder that machine learning is not just about fitting models, but about understanding the data and designing the right approach for the problem.

Leave a comment