Our Journey
Module 1 > Module 2 > Module 3
■ Business Intelligence
■ Visual and Predictive Analytics
■ Social Media Analytics
■ Introduction to Big Data
■ Big Data in Business
■ Big Data Value Creation
This module introduces you to Information Systems, Data Warehousing and Cloud Technologies.
It further explores how businesses manage and harness large databases and process them to
discover dynamic and real-time information.
■ Information Systems
■ Data Warehouse and Cloud
■ Processing Platforms for Data
Discovery
Bu 1 Computing
2 Data and IS
3 Mobile IS
This topic demonstrates the pervasive power of ICT
and mobile applications, along with cloud storage
integration that renders it necessary to examine large
population samples in real time.
Internet of Things (IoT) and mobile computing are
closely examined.
Topic Overview
4 IoT
You can have data without information,
but you cannot have information without
data.
Daniel Moran, programmer and author
Computing
Foundations of Computing
The invention and subsequent proliferation of computers have said to have the single biggest impact in terms of invention since
the harnessing of electricity. Almost every aspect of our lives are now mediated in some ways by computers (and, by extension,
ICTs). However, we cannot forget the age-old mantra, “Computers are no smarter than the humans who program them”. This is
especially pertinent for this course focus: on the role of using this data and learning from insights.
The four main components of any computing device most typically include:
Input Devices
Input devices allow us to enter information into the computer system. They include a plethora of sources such as: keyboards,
sensors, microphones, cameras.
Processing Devices
Processing components manipulate the information once it has been input into the computer. There are usually a set of common
components consisting of the central processing unit (CPU), interface components, and memory (RAM).
Storage Devices
Storage devices store all of this entered information and various programs for future use. Common storage devices include hard
disks and drives, CDs and other portable storage devices, as well as now cloud based servers.
Output Device
Output devices are how the manipulated information is returned to us as users of a computer. They commonly include screens
and monitors, printers, speakers and other physical peripherals. They can also include sensors and haptic feedback devices, such
as vibrations and sounds.
Computing
Measuring Data
Data is measured by basic units of measure that work up from a bit. A bit is represented by either a 1
(electricity flowing) or a 0 (no electricity flowing). This is called binary code. The code converts images,
text, and sounds into numbers in order to send information from one digital device such as a
computer to another.
Computers use binary numbers because they are easier to handle. In binary, the digits (read and
write) are worth 1, 2, 4, 8, and so on—not units, tens, and hundreds. A byte is a unit of measure and it
is 8 bits put together. In ordinary numbers, “1,001” is one unit. But in binary, “1001” is one 1, no 2, no 4,
and one 8, which equals 9.
The American Standard Code for
Information Interchange (ASCII),
uses a 7-bit binary code to
represent text and other
characters within computers,
communications equipment, and
other devices. Each letter or
symbol is assigned a number from
0 to 127.
For example, lowercase “a” is
represented by 1100001 as a bit
string (which is 97 in decimal).
Binary data is considered the native data/language of a
computer and it interacts with the lowest abstraction layer of
its hardware. This type of data is produced whenever a
process is performed on a computer. The application
requesting the process sends instructions in a high-level
language that is ultimately converted into binary data to be
executed or sent to the processor. All processes, regardless of
their type, are converted into binary form before execution.
Computing
Data Processing
Now that we have an understanding of exactly what data is, we need to think about how it is processed; how
do computers, and the users that operate them, collect and store data sets? The two types relevant to our
understanding of analytics: Batch and Real Time (RT) Processing.
Batch Processing
The collection and storage of data for
processing at a scheduled time, when a
sufficient amount of data has been
accumulated.
■ Data is collected and processed together
(multiple transactions)
■ This can be done without human
intervention
■ However, there may be a long time delay
For example: Generation of bills, credit card
transactions
Real-Time Processing
The immediate processing of data after the
transaction occurs, with the database being
updated at the time of the event.
■ Data is collected and processed instantly
■ The act of data processing here can be
repetitive
■ Because of the immediacy, can stop double
booking events from occurring
■ Always up-to-date systems
For example: Guidance systems in transport,
ticketing sales and reservations, POS (point of
sales terminals)
Computing
From Data to Systems
When large amounts of data are written and processed together, information is being created. When we talk
about information being used functionally (ie: information being computed, for purpose), we talk about
systems. A system is a functional unit which involves a set of procedures/functions to produce certain
outputs by processing given data and information (as input).
More specifically, it is the study of complementary networks of hardware and software that people and
organisations use to collect, filter, process, create and distribute data. A computer Information System (IS) is
a system composed of people and computers that process or interpret data.
As such, information systems inter-relate with data systems on the one hand and activity systems on the
other. An information system is a form of communication system in which data represent and are processed
as a form of social memory.
An information system can also be considered a semi-formal language which supports human decision
making and action – extremely important when considering the applications of data and analytics.
Information systems are the primary focus of study for organisational informatics.
Computing
The Traditional IT Systems Approach
Traditionally, in terms of how IT systems are developed and maintained, the requirements are defined, followed by solution
design. After this, the system is built, the queries (tasks) are executed. If there are new requirements or queries then the entire
system is redesigned and rebuilt.
Simplilearn, 2014
Consider the following system
model on the left.
Define three key issues with this from
an operational or managerial
perspective.
What could go wrong with an IT
system designed in this way?
How does this conflict with our Big
Data Value Creation Process model
from the last module?
Computing
The Problem With Conventional Computing
■ Limited storage capacity: We no longer can accept archival data that is historically stored. This can almost be
seen as “data debt” and often is not fit for purpose with today’s systems and is unable to be integrated and
processed due to ineffective methods of data collection in the past.
■ Limited processing capacity: Previously we could only process data at limited speeds. Currently, we are
operating at processing abilities exponentially beyond what previous systems could manage.
■ No scalability: Data storage must be scalable. Requirements change and increase day by day. Storage must
be able to scale itself to new volumes and new data requirements.
We can see these traditional models of IT systems as well as computing processes as being in conflict with
some of the major trends in the data and informatics area today. Some of the key issues surrounding these
legacy systems include:
Computing
The Problem With Conventional Computing
■ Sequential : We cannot process our data in distributed ways on a single machine as this is far too slow. Older
systems are sometimes designed around consolidated and internally-focused systems.
■ Data types: Older systems can handle structured data well, yet most data generated today is unstructured. We
will cover these different types shortly.
■ Requires pre-processing of data: Previously, data must be stored in RAW form (the most rich form of data).
Due to sheer size of these RAW volumes, older data requires processing in order to be stored efficiently, which
sometimes has not been done prior to legacy data sets.
Computing
The Problem With Conventional Computing
…Which brings us to the issues we
have been summarising thus far.
The Big Data problem should now
be apparent moreso looking at the
finer aspects of computing and
how data is processed by these
systems.
This problem is best summarised
in the diagram.
Hadoop Summit, 2012
In its raw form, oil has little value.
Once processed and refined, it
helps power the world.
Ann Winblad, Entrepreneur
Data and IS
The Same Can Be Said About Data
As we know by now, data is everywhere. And it is very valuable. But how do we turn data into
information? Firstly, we need to think about what the difference is between the two.
■ Data are facts, events and transactions which have been recorded. They are essentially the raw
inputs which further get processed to become, finally, information.
■ When facts are filtered through one or more processes (human or system), and are ready to give
certain details, they are information (ie: the data now has a function).
■ Information essentially is then processed data – presented in some useful and meaningful form.
Data | Information |
Raw Facts | Processed Facts |
Dead Stored Facts | Live Presented Facts |
Inactive (Only exists in the backend) | Active (Being processed data for knowledge base) |
Technology Orientated | Business Orientated |
The Relationship Extended
Infogineering, 2009
Data and IS
Recap: How Big Data is Different to Data as we Know it
■ Data is automatically collected: Sensor use is prevalent, use of smartphones
widespread – GPS tracking, and accelerometers for fitness apps.
■ Social media and web 2.0/3.0: Huge volumes generated, posted, and shared. 2.3
million pieces of content is shared per minute on Facebook alone
■ Big data and the internet of things: Smart TVs, Smart fridges. Collaboration and
coordination of this data also occurs across sources. Various tools exist and are
emerging for interpreting these and how they fit together in our connected
universe.
■ Continuous stream: these devices are always on, always aware, and always
computing. E-commerce sites exploit this, as do search engines in terms of
prediction analytics and real-time adapting of search results.
■ Velocity: Nowadays, data can have different types of time-sensitivity. The ability to
handle both real-time and non-real time approaches to data processing is possible.
We know some of the ways the Big Data era is different to how we have conceptualised data in the past
Data and IS
At the rate data is now
being created, the amount
will double every two years.
By 2020, the world will
generate 50 times that
amount of data, according
to a 2011 study by IDC. The
sheer volume is enormous,
and a very large contributor
to the ever-expanding
digital universe is the
“Internet of Things,” with all
kinds of devices creating
data every second.
Velocity is the speed at
which data is created,
stored, analysed, and
visualised.
In the Big Data age,
data is created in real
time or near real time.
With the availability of
Internet connected
devices, wireless or
wired, machines can
pass on data the
moment it is created
In the past, all data was
structured data that fit
neatly into columns and
rows, but those days
are over. Now, 90% of
data is unstructured.
Data now comes in
many different formats,
including structured,
semi-structured,
unstructured, and even
complex structured.
Volume Velocity Variety
Van Rijmenam, M. (2014)
Data and IS
Today we will focus specifically on the types – or variety – of data to help us understand some of
the issues around computation, processing and storage. How we manage this data is crucial, so it is
important to understand how it can be structured!
Structured Data
Which is represented in a
tabular format. It is stored
in relational databases with
well defined and organised
sets
For example: Traditional
databases
Semi –Structured data
This data does not have a
formal data model. Tags or
other types of markers are used
to identify certain elements
within the data, but there is no
rigid structure.
For example: XML files, log
files, web-visitation, call and
emails logs
Unstructured data
Data which does not have
a pre-defined data model.
For example: Audio, video,
images
Van Rijmenam, M. (2014)
The relationship between these various types of data and information extraction can be sometimes
quite complex and complicated.
■ Structured formats have some limitations with respect to handling large quantities of data
■ It is difficult to integrate data distributed across multiple systems
■ Potentially valuable data, across all types, is often left dormant or discarded
■ It is sometimes too expensive to justify the integration of large volumes of unstructured data
■ A lot of information has a short, albeit useful, lifespan
■ Context always adds meaning to the existing information
According to IBM, 80% of data
captured today is unstructured.
This includes from sensors
used to gather climate
information, posts to social
media sites, digital pictures
and videos, purchase
transaction records and cell
phone GPS signals.
Data and IS
Internal Data Sources
Internal data accounts for everything your business currently has or could access. Examples include:
■ Customer feedback
■ Sales data
■ Employee or customer survey data
■ CCTV video data
■ Transactional data
■ Customer record data
■ Stock control data
■ HR data.
Again, like structured data this isn’t considered very exciting. Most are focused on external data which
yields easier to navigate information and inferences.
Big Data, data science and business analytics work with structured and unstructured data. But true
business insights emerge when we combine existing data sets with unstructured or semi-structured data
from both internal and external sources.
Data and IS
External Data Sources
External data is the infinite array of information that exists outside your business. Examples
include:
■ Weather data
■ Government data such as census data
■ Twitter data
■ Social media profile data
■ Google Trends or Google Maps.
It is really important to understand that no type of data is inherently better or more
valuable than any other type. The key is to start with a strategy and establish your
collection approach to guide you to the best structured, unstructured, internal or external
data direction.
Data and IS
Data and IS
What Does This Mean for Business?
Schmarzo, B. (2013)
It is always integral as commercial-minded thinkers to consider the ongoing organisational impact
here: what is the effect on the business bottom line? ROI? Decision making? How do these different
data types, information systems and computational processes impact on our monetisation efforts?
Schmarzo’s table below summarises these key themes and some of these impacting factors on the
bottom line.
Mobile
Mobile Information Systems
Let us consider the mobile web based
information system. Today, we have
the ability to access the Internet (and
data) via mobile phone, tablets, ipads
and a plethora of devices. This
requires new forms of connectivity and
information exchange – known as
Peer-to-peer networks. Proximity
based adhoc exchange (bluetooth
enabled), and automatic discovery of
peers is possible.
For example, transmission of sensor
data or traffic data between moving
cars.
http://www.cyberdefensemagazine.com/wp-content/uploads/2015/03/s11.png
Mobile
Mobile Information Systems
New requirements and functionality is required such as context awareness and location based services.
For example, a restaurant may promote their menu based on your proximity using mobile device location.
Geographical Information Systems including maps is another example.
Watch the video on Future of Shopping from Microsoft. https://vimeo.com/14854760
http://www.cyberdefensemagazine.com/wp-content/uploads/2015/03/s11.png
IoT
Internet of Things
Watch the video for a simple explanation of IoT IoT made simple
Let’s start with understanding a few things that are fundamental.
Broadband Internet is become more widely available, the cost of connecting is decreasing, more
devices are being created with Wi-Fi capabilities and sensors built into them, technology costs are
going down, and smartphone penetration is sky-rocketing. All of these things are creating a “perfect
storm” for the IoT.
IoT
So What is IoT
This is the concept of basically connecting any device with an on and off switch to the Internet
(and/or to each other). This includes everything from cellphones, coffee makers, washing machines,
headphones, lamps, wearable devices and almost anything else you can think of. This also applies
to components of machines, for example a jet engine of an airplane or the drill of an oil rig.
If it has an on and off switch then chances are it can be a part of the IoT. The analyst firm Gartner
says that by 2020 there will be over 26 billion connected devices..(some even estimate this
number to be much higher, over 100 billion). The IoT is a giant network of connected “things”
(which also includes people). The relationship will be between people-people, people-things, and
things-things.
And how does it affect businesses? Where are the opportunities? See the next slide.
IoT
http://www.libelium.com/wp-content/themes/libelium/images/content/applications/libelium_smart_world_infographic_big.png
Exercise OPOWER
IoT
Opower is a leading energy management company that combines a cloud-based platform, big data, and
behavioral science to help utilities around the world reduce energy consumption and improve their relationship
with customers. OPower, in partnership with 93 utilities, helps over 32 million consumer households to lower their
energy use and costs and significantly reduce carbon emissions. Working with smart meter, thermostat and other
device data from Pacific Gas and Electric, OPower gathers over 7 million data points each day and provides
analytic reports to utility companies.
These reports are included in household bills to encourage consumers to conserve
energy by comparing their household energy usage to their neighbors. OPower found that their MySQL database
infrastructure could not analyze the data quickly enough, and much of the data they were gathering was not being
fully used. In fact, while OPower had over 60 instances of MySQL, they still couldn’t run analyses across the
breadth of their data.
To address this issue, OPower created an “Energy Data Hub” by migrating their data infrastructure to distributed
systems and using advanced analytics. Their data scientists benefited from self-service, end-user tools that data
engineers and product managers could use to access and analyze data in over 200 tables.
Today, OPower product managers use analytics to answer client questions directly – without IT assistance. They
can run analyses – for example, using consumer thermostat data to understand patterns of energy usage. The
end result? OPower has dramatically lowered the time required to access data for analytics and empowered
product managers with insights they used to help clients reduce energy consumption by $500 million and CO2
output by 7 billion pounds.
The case study details details
OPower’s efforts in giving
customers personalised
energy usage insights and
recommendations
Think about the following:
1. What types of data were
being collected?
2. What was the source of
this data?
3. Do you think there was
an emphasis on internal
or external data? How
were these integrated?
4. Explain the benefits of
an IT system design
such as this in relation
to Big Data.
5. What was the impact on
monetisation here; what
were the commercial
outcomes?
Adapted from Datameer (2015)
Check it out here:
https://www.cloudera.com/more/customers/opower.html
Summary
■ Computation – We went back to the origins of data and looked at binary code and how it
operates in terms of basic computation. We thought about how all of these 0s and 1s are
processed: batch versus real-time and what this means for application and system design
and development.
■ Information Systems- We looked at the evolution of IT systems: from traditional/conventional
models which have become unsuitable for the Big Data challenges of today. We also
explored how various systems are designed and operate around data flows in an
organisation, and the beginnings of how this can impact on the design of a Big Data strategy
■ Types of Data- We thought about the key differences between data and information- and
how information reveals a different level of function and detail, beyond what simplistic data
can provide. We also looked at a deeper dive into the variety of data (one of the V’s) as well
as sources of data; both internal and external. We reflected on how how an understanding of
this leads to monetisation….thinking about commercial applications of clever data usage.
■ Mobile IS and IoT: We explored Mobile information systems and IoT and the opportunities
created by them.
References
Datameer (2015) “Top Five High Impact Use Cases For Big Data Analytics”, accessed online at:
http://www.datameer.com/pdf/eBook-Top-Five-High-Impact-UseCases-for-Big-Data-Analytics.pdf [January, 2016]
Hadoop Summit (2012) “Combining Hadoop RDBMS for Large-Scale Big Data Analytics”, accessed online at:
http://www.slideshare.net/Hadoop_Summit/combining-hadoop-rdbms-for-largesacale-big-dataanalytics [January, 2016]
Infogineering (2009) “The Infogineering Model”, Image: http://www.infogineering.net/wp-content/uploads/2009/08/model.jpg [January,
2016]
Minelli, M., Chambers, M., & Dhiraj, A. (2012). Big data, big analytics: emerging business intelligence and analytic trends for today’s
businesses. John Wiley & Sons.
Schmarzo, B. (2013). Big Data: Understanding how data powers big business. John Wiley & Sons.
Simplilearn (2014) “What is Big Data?, accessed online at: https://www.youtube.com/watch?v=CKLzDWMsQGM [January, 2016]
Morgan J (2014) Internet of Things , Forbes.
http://www.forbes.com/sites/jacobmorgan/2014/05/13/simple-explanation-internet-things-that-anyone-can-understand/#64e04f168284
Extra Resources
■ TED Talks, Fei-Fei-Li, How We Teach Computers to Understand Pictures: https://www.youtube.com/watch?v=40riCqvRoMs [January, 2016]
○ Think about this closely in terms of unstructured data is recorded, processed and stored. Also at the future of computing and data systems that are
so visually-based.
■ TED Talks, Kevin Slavin, How Algorithms Shape Our World https://www.youtube.com/watch?v=ENWVRcMGDoU [January, 2016]
○ Another TED this time on the power of algorithms. This one is a bit more esoteric, but still interesting on how computer data systems is infiltrating
other aspects of society.
Light, easy and fun
For the academically-curious
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., & Jacobsen, H. A. (2013, June). BigBench: towards an industry standard benchmark for big
data analytics. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data (pp. 1197-1208). ACM. Accessed online at:
http://www.msrg.org/publications/pdf_files/2013/Ghazal13-BigBench:_Towards_an_Industry_Standa.pdf [January, 2016]
Ignore the sale-sy pitch in this paper in regards to the use of BigBench, but focus instead on how the data was mined/generated, processed and then
generated into actionable insights here for web visitation and click-data on a e-tailing site. Also of interest here is the way unstructured, structured and
semi-structured work together in their model on p1198.
An overview of BigBench can be found here for the diehards among us:
http://blog.cloudera.com/blog/2014/11/bigbench-toward-an-industry-standard-benchmark-for-big-data-analytics/