UFCFFY-15-M Cyber Security Analytics
Portfolio Assignment: Worksheet 2
Conduct an investigation on a URL database to develop a DGA
classification system using machine learning techniques
For this task, the company “UWEcyberSolutions” enlist your help once more. They have identified a number of
suspicious URLs on their logging systems, suspecting that these URLs contain various malware, and so require your
expertise to investigate these further. Specifically, they seek a machine learning approach to identify the malware families
as observed on their network.
You will need to develop a machine learning tool using Python and scikit-learn that can identify URLs based on Domain
Generator Algorithms (DGA), widely used by command and control malware to avoid static IP blocking.
You are expected to show experimental design of appropriate feature engineering to characterise the data, that will be
used to inform your machine learning classifiers. Specifically, you should show experimentation of how different schemes
of feature selection can impact the performance of the classifiers.
You are also expected to compare 3 different classifiers using the scikit-learn library, and show how the model parameters
can impact the performance of the classifiers. It is suggested that you use a Logistic Regression, a Random Forest
Classifier, and a Multi-Layer Perceptron Classifier.
Finally, you show investigate the performance of your classifiers for completing the task, using confusion matrices, and
performance metrics (e.g., accuracy, precision, recall). In reporting your findings, you should reflect on how and why the
set of feature selection and model parameters maximises the performance of the classifier.
Dataset: Please see the folder *“Portfolio Assignment”* under the Assignment tab on Blackboard for further detail
related to the access and download of the necessary dataset.
Hint: You should conduct research using the scikit-learn documentation and API reference based on the sample code
provided. You should also think about a suitable means of generating input features for your classifier that capture
sequential properties of text data.
Assessment and Marking
The completion of this worksheet is worth 20% of your portfolio assignment for the UFCFFY-15-M Cyber Security
Analytics (CSA) module.
This task is unguided task that will be graded against four core criteria:
Criteria 0-1 2-3 4-5 6-7 8-10
Suitable feature engineering
stages
No or very little
evidence of
progress
Limited attempt
to address this
criteria
A possible solution but
with weaknesses
A good solution
with some
justification
An excellent
solution with clear
justification
Suitable usage of machine
learning library
No or very little
evidence of
progress
Limited attempt
to address this
criteria
Some fair attempt but
with weaknesses
A good solution
with some
justification
An excellent
solution with clear
justification
Suitable experimental
approach and rationale of
improvement
No or very little
evidence of
progress
Limited attempt
to address this
criteria
Some fair attempt but
with weaknesses
A good solution
with some
justification
An excellent
solution with clear
justification
Clarity and presentation
No or very little
evidence of
progress
Limited attempt
to address this
criteria
A reasonable attempt
but with some
weaknesses
Good detail and
presentation
Excellent detail,
professional
presentation
Submission Documents
Your submission for this task should include:
1 Jupyter Notebook exported in PDFviaHTML format:
You should complete your work using the iPYNB file provided (i.e., this document). Once you have completed your work,
you should use the export function in Jupyter to save your notebook as an HTML document (“File”, “Save and Export
Notebook As”, “PDFviaHTML”). *Do not submit your ipynb file – we will not execute any code during marking.
Therefore, you must ensure that all code cell output is presented clearly in your PDF document before you make
your final submission.*
The deadline for your portfolio submission is TUESDAY 2ND MAY @ 14:00. This assignment is eligible for the 5-day late
window policy, however module staff will not be able to assist with any queries after the deadline.
The portfolio will be submitted to Blackboard as 4 independent documents:
*STUDENT_ID-TASK1.pdf* (a PDF document exported from your Jupyter notebook)
*STUDENT_ID-TASK2.pdf* (a PDF document exported from your Jupyter notebook)
*STUDENT_ID-TASK3.pdf* (a PDF report of your research investigation)
*STUDENT_ID-TASK4.mp4* or *STUDENT_ID-TASK4.txt* (either the video file of your presentation, or a text file
that contains instructions for accessing your video online)
Contact
Questions about this assignment should be directed to your module leader ([email protected]). You should use the
online Q&A form to ask questions related to this module and this assignment, as well as utilising the on-site teaching
sessions.
Student ID: -ENTER STUDENT NUMBERBy submitting this assignment to Blackboard as part of your portfolio, I declare that the submission is my
own work.
Domain Family
0 google.com benign
1 facebook.com benign
2 youtube.com benign
3 twitter.com benign
4 instagram.com benign
… … …
23995 fhyibfwhpahb.su locky
23996 nlgusntqeqixnqyo.org locky
23997 awwduqqrjxttmn.su locky
23998 ccxmwif.pl locky
23999 yhrryqjimvgfbqrv.pw locky
24000 rows × 2 columns
In [1]: # Import libraries as required
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option(‘display.max_rows’, 10)
from collections import Counter
from timeit import timeit
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder
In [2]: # Load in the data set as required
df = pd.read_csv(‘./task2-dga/dga-24000.csv’)
df
Out[2]:
In [3]: # Count how many entries exist for each malware family (plus the benign class)
df.value_counts(‘Family’)
Family
banjori 1000
benign 1000
tinba 1000
symmi 1000
suppobox 1000
…
locky 1000
gameover 1000
flubot 1000
emotet 1000
virut 1000
Length: 24, dtype: int64
Start your investigation…
Carry on with the investigation based on the initial code provided above. Conclude you investigation with a summary of
your findings.
Out[3]:
In [ ]: