Coursework - Australia Assignments

COMP1702	Big Data	Faculty Header ID	Contribution: 100% of course
Course Leader: Hai Huang	Coursework	Deadline Date: 13 April 2023(23:30, UK time)
Feedback and grades are normally made available within 21 calendar days of the coursework deadline
Learning Outcomes: 1 Explain the concept of Big Data and its importance in a modern economy 2 Explain the core architecture and algorithms underpinning big data processing 3 Analyse and visualize large data sets using a range of statistical and big data technologies 4 Critically evaluate, select and employ appropriate tools and technologies for the development of big data applications

Plagiarism is presenting somebody else’s work as your own. It includes
copying information directly from the Web or books without referencing the
material; submitting joint coursework as an individual effort; copying another
student’s coursework; stealing coursework from another student and
submitting it as your own work. Suspected plagiarism will be investigated and
if found to have occurred will be dealt with according to the procedures set
down by the University. Please see your student handbook for further details
of what is / isn’t plagiarism.
All material copied or amended from any source (e.g. internet, books) must be
referenced correctly according to the reference style you are using.
Your work will be submitted for plagiarism checking. Any attempt to bypass our
plagiarism detection systems will be treated as a severe Assessment Offence.
Coursework Submission Requirements
• An electronic copy of your work for this coursework must be fully uploaded on
the Deadline Date of 13 April 2023 using the link on the coursework Moodle
page for COMP1702.
• For this coursework you must submit a single report in PDF format. In
general, any text in the document must not be an image (i.e. must not be
scanned) and would normally be generated from other documents (e.g. MS
Office using “Save As .. PDF”). An exception to this is handwritten
mathematical notation, but when scanning do ensure the file size is not
excessive.
• There are limits on the file size (see the relevant course Moodle page).
• Make sure that any files you upload are virus-free and not protected by a
password or corrupted otherwise they will be treated as null submissions.
• Your work will not be printed in colour. Please ensure that any pages with
colour are acceptable when printed in Black and White.
• You must NOT submit a paper copy of this coursework.
• All coursework must be submitted as above. Under no circumstances can
they be accepted by academic staff
2
The University website has details of the current Coursework Regulations, including
details of penalties for late submission, procedures for Extenuating Circumstances,
and penalties for Assessment Offences. See http://www2.gre.ac.uk/currentstudents/regs
Detailed Specification
You are expected to work individually and complete a report that
addresses the following tasks. You need to cite all sources you rely
on with in-text style. You may include material discussed in the
lectures or labs, but additional credit will be given for independent
research. Note: References should be in Harvard format. The word
count does NOT include references.
• Part A (25 Marks)
o Task A.1 [mark 10] Explain the main characteristics of Big Data. (Word
count: 200 words ±10%)
o Task A.2 [mark 15] Compare Hadoop and Relational Database Systems. Give
an application scenario that is well suited to Hadoop and explain your reason.
(Word count: 300 words ±10%)
• Part B (30 Marks): MapReduce Programming
Suppose that there is a computer science bibliography file stored on Hadoop. Each line
of this file contains information of a paper in the following format:
authors|title|conference|years
The different fields are separated by the “|” character, and authors (the first field) are
separated by commas (“,”). You can assume that there are no duplicate records, and
each distinct author or conference has a different name.
An example line is:
3
D Zhang, Daniel H, D Cai, J Lu|Self-Taught Hashing for Fast Similarity
Search|SIGIR|2010
Please design a MapReduce algorithm (using Pseudo-codes or Java Codes) to output
the number of papers by each author in each year if the number is large than 0.
The algorithm is expected to be as efficient as possible.
You should also explain how the input is mapped into (key, value) pairs by the map
stage, i.e., specify what is the key and what is the associated value in each pair, and,
how the key(s) and value(s) are computed. Then you should explain how the output
(key, value) pairs of the map stage are processed by the reduce stage to get the final
answer(s). You need to discuss the efficiency of your algorithm (How your design
make your algorithm efficient). (Word count: 300 words ±10%)
• Part C (45 marks): Big Data Project Analysis
The CropY company is a leading provider of precision agriculture service. Precision
agriculture is the science of gathering, processing, and analysing temporal, spatial and
individual data. It combines other information to support management decisions
according to estimated variability for improved resource use efficiency, productivity,
quality, profitability.
The CropY company is now plan to develop a big data project to meet the following
requirements: help worldwide users better understanding the implications of the weather
and making contingency plans; buying supplies, such as fertilizer and seeds; as well as
maintaining and monitoring the quality of yield, whether livestock or crops; knowing the
variety of cultivated plants, conditions of its growth and its needs of seeds; choosing the
type of fertilizer and pesticides, understanding their employment conditions and their
impact on the climate- soil-plant; recognizing daily water needs for each kind of plant;
calculating the median and mean values of yield; studying the conditions of natural
environment; estimating the financial revenue and manage the potential risks.
o Task C.1 [mark 10]: The volume of big data is expected to be more than 500
Petabytes. The data will come from various sensors, satellites, drones, social
media, market data, Online news feed etc. The Figure 1 below shows some
example data of CropY company. Some IT technician plan to build a data
warehouse to store data for further data analysis tasks but some others believe
data lake is a better choice. Which choice do you prefer? Please justify your
choice. (Word count: 300 words ±10%)
4
Figure 1. Example Data of CropY Company
o Task C.2 [mark 10]: The data of CropY company includes a large collection of
plants, corps, diseases, symptoms, pests, and relationships between them. The
CropY company needs to build a data analytical store which can facilitate queries
like: “find all diseases which are directly or indirectly caused by nitrogen
deficiency”. Please recommend a data store and justify your choice. (Word
count: 300 words ±10%)
o Task C.3 [mark 15]: Some prediction and analytics services provided by the
CropY company require to response in a few seconds after the arrival of new
data. Namely, they are real time or near real time prediction and analytics tasks.
Some IT managers suggested a popular distributed processing framework —
MapReduce to implement these tasks. Do you agree with that? Or you have
different suggestion. Please justify your choice. (Word count: 300 words ±10%)
o Task C.4 [mark 10]: CropY company decided to move most of applications and
services to cloud. These applications and services need to be highly available,
scalable, and accessible from worldwide. Note that some data such as price and
customer data are confidential. Please design a cloud hosting strategy for this big
data project and explain how your design will meet the security, scalability, high
availability. (Word count: 300 words ±10%)
5
Grading Criteria
1. Part A (Mark 25)

Task A.1 (10) Explain the main characteristics of Big Data	achieved well	partially achieved	poorly/not achieved
Clarity of explanation	/8
Citations/references	/2
Task A.2 (15) Compare Hadoop and Relational Database Systems.	achieved well	partially achieved	poorly/not achieved
Comparison of Hadoop and RDBMS	/6
The scenario described	/6
Citations/references	/3

2. Part B (Mark 30) MapReduce Algorithm Design

Factors	Mark	Overall mark
Description of the Functionalities of main functions such as Map, Reduce functions and Description of the input/output key value pairs of the functions (15)	The description of main functions is correct, complete and in-depth	11 – 15
The description of main functions has some minor error(s) or not clear enough	6 – 10
The description of main functions is poor with many errors	2 – 5
The description of main functions is nothing to do with MapReduce model or totally wrong	0 – 1
Algorithm design (15)	Algorithm is designed correctly	11- 15
Algorithm designed is basically correct (with some minor errors)	6 – 10

Algorithm designed is NOT
correct (with major errors)

0 – 5

3. Part C (Mark 45) Big Data Project Analysis
Task C.1 (10) Suggestion of storage tool
for big data analysis

achieved well	partially achieved	poorly/not achieved
Correctness of choice	/4
Clarity of explanation	/4
Citations/references	/2
Task C.2 (10) Analysis of analytical data store	achieved well	partially achieved	poorly/not achieved
Correctness of choice	/4
Clarity of explanation	/4
Citations/references	/2
Task C.3 (15) Suggestions of tools for real time or near real time prediction tasks	achieved well	partially achieved	poorly/not achieved
Correctness of suggestion	/7
Clarity of explanation	/5
Citations/references	/3
Task C.4 (10) Design of cloud hosting strategy for big data project	achieved well	partially achieved	poorly/not achieved
Correctness of choice	/4
Clarity of explanation	/4
Citations/references	/2

Overall Feedback
Grade 80-100% Exceptional
Grade 70-79% Excellent
Grade 60-69% Good
Grade 50-59% Satisfactory
Grade <50% Fail