Visual and Predictive Analysis - Australia Assignments

Big Data for Business
Management
Module 3: Applications
3.2 – Visual and Predictive Analysis
Our Journey
Module 1
Milieu > Module 2 Infrastructure >

Module 3
Module 3
Applications

■ Business Intelligence
■ Visual and Predictive Analytics
■ Social Media Analytics
■ Introduction to Big Data
■ Big Data in Business
■ Big Data Value Creation
This module introduces you to the concept of Business Intelligence and analytics. It takes a
closer look at the collection and use of data to inform business decisions.
■ Information Systems
■ Data Warehouse and Cloud
■ Processing Platforms for Data
Discovery

1 Data Reduction
2 Predictive Analysis
3 Data Visualisation
This topic introduces methods of data preparation,
information visualisation and prediction. It will
introduce predictive analytics, which builds upon our
data mining and modeling from the previous topic.
Clustering and association methods, which draw
inferences through pattern discovery in order to
predict future behaviour and inform future practice, will
also be looked at.
This topic will acquaint you with the different methods
that can be used to accelerate the business user and
data science team collaboration process, as well as the
role that visualisation can play in interpretation of data
through enabling of discussion and comprehension.
Topic Overview
What is Data Reduction?
We have been considering Big Data, i.e. data that is becoming bigger and bigger, daily. In this topic we
begin our discussion with data reduction. Why? It helps you simplify the data by removing unnecessary
elements, and help you focus on certain variables or the constructs (made from these variables), which
will offer some meaning that helps you predict future or have a predictive value. What does that mean?
In simple terms, data reduction is how the empirical digital information (which is getting bigger daily) is
transformed into an orderly format. The concept is about reducing the data into meaningful parts. It
would involve some sorting, collating, editing, scaling, coding and effectively creating some categorised
tables. Consider this example from Astronomy (Source:Wikipedia, 2016). The Kepler Satellite (NASA)
records 95 megapixel images once every six seconds, generating tens of megabytes of data per
second. In order of magnitude, it is more than the downlink bandwidth of 550 kilobytes per second. The
on-board data reduction includes co-adding the raw frames for 30 minutes, reducing the bandwidth by a
factor of 300. Furthermore, interesting targets are pre-selected and only the relevant pixels are
processed, which is 6% of the total. Such reduced data is then sent to Earth to be processed further.
So what does this mean or imply for business managers?
(Wikipedia, 2016)
Data Reduction
Why Data Reduction?
Data reduction can help businesses create extra capacity, within the existing environment. Specifically it
helps a business manage the continuously expanding data. It may not reduce storage costs, but can
decrease the amount of required storage capacity in a Storage Area Network (SAN) environment within
businesses of any size.
Storage Area Network provides access to consolidated, block level data storage. SANs are mainly used
to enhance storage devices such as tape libraries and optical jukeboxes, accessible to servers. To the
operating system (O/S), the devices would appear as locally attached. It has its own network of storage
devices not usually accessible through a local area network by other devices.
We are living in a world of increasing BYO devices that plug into the network and storage needs
increase as the data is generated continuously. Therefore, business managers need to strategise for
data reduction.
There are three strategies that businesses use for data reduction.
Data Reduction
Source: http://www.reliant-technology.com/storage_blog/pros-cons-data-reduction-strategies/
Strategies for data reduction – Pro’s and Con’s
The first strategy is better known as thin provisioning. This strategy is about eliminating the reserve on
unwritten blocks of storage, to make more efficient use of existing storage capacity. With low impact on
operations, businesses can save up to 30%. Vendors are available across all sized businesses. However,
the existing written data cannot be altered in this strategy to make provision for a larger amount of
space.
The second strategy is data deduplication. This strategy identifies repeated data patterns and reduces
them to a single instance, so as to help save capacity in the storage environment. As it reduces repeat
patterns to a single physical copy, storage savings can range from 2:1 to 10:1 (based on industry sector
and the data). However, the processes may require more capacity to function accurately. For instance,
after the processing, deduplication requires more space to hold the new (processed) data temporarily,
before it can be deduplicated. This may impact the amount of actual saved space available in the
storage.
Data Reduction
http://www.reliant-technology.com/storage_blog/pros-cons-data-reduction-strategies/
Strategies for data reduction – Pro’s and Con’s
The third strategy is compression. It is a strategy that has been tried and tested for almost 25 years.
IBM tape drives used this technique two decades ago, by finding repeat patterns of similar information
that can be replaced, with a streamline data structure. Space savings can vary dependant on specific
types of data. However, this technique introduces latency before and after transfer of data. For
example, the files need to be ‘re-hydrated’ or in the original form before it can be accessed, to be
compressed. On the other hand, compression algorithms can slow down the writing process.
Recently, a technique called lossless compression – a class of data compression algorithms that allow
original data to be reconstructed from compressed data. This has many applications. For example, you
may be familiar with FLAC and ALAC file types. These are types of audio files that would contain the
same data as a WAV file, which you may be familiar with. WAV takes up much storage, while FLAC and
ALAC files use lossless comparison to create smaller files that take less space.
Read more on this here:
http://www.howtogeek.com/142174/what-lossless-file-formats-are-why-you-shouldnt-convert-lossy-to-los
sless/
Data Reduction
http://www.reliant-technology.com/storage_blog/pros-cons-data-reduction-strategies/
Data Dimensionality and Reduction
A business may be interested in certain attributes of data such as customers, products, time, stores etc.
These are dimensions of the data. For the data warehouse user (or the person who uses data to
undertake predictions), these dimensions are entry points to quantitative facts such as sale, profit and
revenue that a business wishes to monitor. For example, a business may be interested in knowing a
specific jeans that was sold in a store in Melbourne, last month.
The data dimension can be hierarchical. For example, to aggregate, summarise and present the data with
reduced dimensions, the days can be grouped into months, or quarters can be grouped into calendar
years.
[Source: http://data-warehouses.net/glossary/datadimensions.html]
Reducing data dimensions is usually done via column formats. However, as the number of columns
increase, dimension of the data also rises. This is cumbersome, and therefore, it dimensionality needs to
be reduced. Please read the article to understand seven techniques for reducing data dimensionality.
https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf
Data Reduction
Typical Analysis using Data
As illustrated in this
diagram, there are many
typical analyses using data.
However, we are more
focused on predictive
analytics and visuaIisation
using the ‘reduced and
aggregated’ data, and the
techniques illustrated in the
previous slides.
Data Reduction
https://www.citrix.com.au/articles/bytemobile-insight-big-money-from-big-data.html
What is Predictive Analysis?
The distinction between explanatory and predictive analyses is quite well known and is rather important.
Explanatory analyses focus on the ‘Why?’ question. A researcher aims to know why specific phenomena
happen.
Predictive Analysis (PA)focus on forecasting marketing and customer metrics. Market forecasting
typically focuses on sales forecasting of new products and forecasting market share. At the customer
level, this may involve forecasting response to marketing actions, such as direct mails, emails, but also
forecasting churn, product returns, and lifetime value.
PA can be done rather simply, perhaps by using a kind of weighted average to predict future sales, or in a
more sophisticated fashion by using forecast models, such as time-series models or choice models.
We will now look at a brief introduction to two of the most common methods:
■ Cluster Analysis
■ Association Analysis
Predictive Analysis
Algorithms and Models
The following analytic algorithms start to move the data scientist beyond the data exploration stage into the more
predictive stages of the analysis process. These analytic algorithms are more actionable, allowing the data scientist to
quantify cause and effect and provide the foundation to predict what is likely to happen and recommend specific
actions to take.
Predictive Analysis
Cluster analysis or clustering is the exercise of grouping
a set of objects in such a way that objects in the same
group are more similar to each other than to those in
other groups (clusters). Cluster analysis is used to
uncover insights about how customers and/or products
cluster into natural groupings in order to drive specific
actions or recommendations.
For example: target marketing or maintenance
scheduling. Clustering analysis can uncover potential
actionable insights across massive data volumes of
customer and product transactions and events. It can
uncover groups of customers and products that share
common behavioral tendencies.
Schmarzo, B. (2015)
Algorithms and Models
Predictive Analysis
Association Analysis
Association analysis is a popular algorithm for
discovering and quantifying relationships between
variables in large databases. Association analysis shows
customer or product events or activities that tend to
happen together, which makes this type of analysis very
actionable.
For example, the association rule {eggs, flour} → {baking}
found in the point-of-sales data of a supermarket would
indicate that if a customer buys eggs and flour together,
they are likely to also buy butter.
Such information can be used as the basis for making
pricing, product placement, promotion, and other
marketing decisions. One very actionable data science
technique is to cluster the resulting association rules into
common groups or segments.
Schmarzo, B. (2015)
Visualising – How to Make it Insightful?
Data Visualisation
Schmarzo, B. (2015)
Visualising data is a technique to facilitate the identification of patterns in data and presenting data to make it more
consumable.
Techniques such as charts, graphs, and dashboards have been used for decades to synthesise data into a cohesive and
comprehensible format for business analysts, managers, and executives.
These techniques have been used to differentiate the contexts of the data that is being visualised in order to:
■ Describe: Attempting to explain the thing being described, for basic comprehension.
■ Report: Summarising findings from the past as of a point-in-time. However, as we move into the next stage of visualisation, we
move beyond these initial intents into the realm of:
■ Observation: Viewing data to identify significance or patterns which unfold over a period of time.
■ Discovery: Interacting with data to explore, interact, and understand relationships between data.
Visualisation also has benefits over static data presentation through enabling dialogue: demonstration of other perspectives and
relationships. In a business setting this abstraction can promote discussion, as well as create shared meaning.
However, the sheer volume of data does not always suit visualisation methods: millions of data points, in an image for example, is
not always simple to understand and comprehend. New techniques and tools are emerging that utilise exciting new visualisations
and animations to visually depict a story about data that far exceeds the standard charts, graphs, and dashboards.
Data artisans (Fast Company, 2013) can create these new and dynamic visualisations, with skills that transcend science, design and
art.
Visualisation Attributes
Data Visualisation
Minelli, M., Chambers, M., & Dhiraj, A. (2012)
As visualisation is becoming more commonplace, data artisans are using many different dimensions
to represent and/or evaluate data. Some of these include:
■ Spatial/geospatial: position, direction, velocity
■ Temporal, periodicity: state, cycle, phase
■ Scale, granularity: weight, size, count
■ Relativity, proximity
■ Value, priority
■ Resources: energy, temperature, matter
A static visual representation can never address multiple dimensions effectively, nor can it effectively show change over time. A series
of static representations can only approximate change through periodic snapshots. But looking at and observing data through
visualisation, even complex animations, isn’t the same as interacting with it to uncover deeper meaning. It requires an effort to traverse
and explore the data to uncover these various dimensions. Therefore, they can assist in our ability to explore and engage for discovery,
interpretation, and deeper understanding.
Visualisation refers not only to a set of graphical images but also to the iterative process of visual thinking and interaction with
the images. An interactive visualisation environment, in which the user may choose to display the data in many different ways,
encourages data exploration. One goal of data exploration is the recognition of pattern and the abstraction of structure and
meaning from data.
Have a look at how a simple (but highly
effective) visualisation of wind patterns across
the United States change how we interpret
these weather patterns:
http://hint.fm/wind/
Think about how the visualisation is
accessible, and how its presentation enables
understanding and engagement.
Visualisation Essentials
Data Visualisation
Schmarzo, B. (2015)
They are rhetorical acts
This means that they are asking deep, value-focused questions. What really matters here? What does the
organisation (and it’s customers) really care about? How can we describe our world?
They are abstractions
This means we are looking at broad ideas rather than isolated events in our calculations. While idealist and
sometimes hypothetical in analysis, they need to be firmly anchored to pragmatic and feasible outcomes.
They work on multiple dimensions
As you have seen from the previous slide – from visual to cognitive, emotional, spatial and computational
evaluations
When Big Data is visualised there is a common DNA that must be adhered to for this data exploration and
comprehension to be encouraged.
All visualisations must ensure:
Visualisation Examples
Dear Data- http://www.dear-data.com/all
Dear Data is a project by Stefanie Posavec and Giorgia Lupi. Both tracked everyday things
during a week, such as how many times each picked up the phone, and then visualised the
data on a postcard. Then they mailed the postcards to each other, with Lupi currently in New
York and Posavec in London. Not commercial at all, but very nifty!
An Introduction to Machine Learninghttp://www.r2d3.us/visual-intro-to-machine-learning-part-1/ Stephanie Yee and Tony Chu
clarified this tough concept with an amazing scrolling visual example.
Heat Wave- http://flowingdata.com/2015/01/19/feeling-hot-hot-hot/
Bloomberg split up the global temperatures and averaged them, added visualisations and voila!
Stereotropes- http://stereotropes.bocoup.com/
Developed by the Bocoup Data Visualisation Team, explores the many tropes in films and the
the adjective used to describe them. Some are unique to a trope and some words span
multiple tropes and genders.
Data Visualisation
WHO and GIS
A geographical information system (GIS)
is a computer system for capturing,
storing, checking, integrating,
manipulating, analysing and displaying
data related to positions on the Earth’s
surface. It is thus a way of linking
databases with maps, to display
information, perform spatial analyses or
develop and apply spatial models.
Data Visualisation
Don’t Forget Quality Data
Data Visualisation
This course has focused on what we do with the data in order to process and extract for maximum insight. One of the most
fundamental elements of this process is ensuring the product is sound: that your data is ‘good’!
Data quality consists of several dimensions:
■ Completeness of data
■ Data being up to date
■ No mistakes in data
Verhoef, P. C., Kooge, E., & Walk, N. (2016).
Completeness of data refers to whether all available data are present for all customers.
Mistakes frequently occur, especially with regard to customer descriptors, such as name and
address. These mistakes may arise as customers or operators do not complete data entry.
The data being up to date concerns whether the data are being updated on a very frequent
basis. A database that is not up to date can easily contain mistakes on all kinds of variables,
from purchase history to contact details, leading to unreliable information.
Mistakes can potentially have strong negative reputational consequences for a firm. For
example, a firm may continue sending mail to someone who has recently passed away.
Moreover, data being not up to date may also cause wrong predictions to be made on future
customer value, which might lead to less than optimal strategies.
The Technology Enabling Analysis
Now that we have covered some of these methods, you must be asking; how do we perform these tasks?; what technology
is available?
All the traditional players such as SAS, IBM SPSS, Matlab, Tableau, Pentaho, and others are working toward Hadoop-based Big
Data analytics. There are also new commercial vendors and open-source projects evolving to address the growing appetite for
Big Data analytics.
Karmasphere ( https://karmasphere.com ) is a native Hadoop-based tool for data exploration and visualisation. They have
since been acquired by FICO (http://www.fico.com/en/acquisitions)
Datameer ( http://www.datameer.com ) is a spreadsheet-like presentation tool. Alpine Data Miner ( http://www.alpin
edatalabs.com/ ) has a cross-platform analytic workbench.
R ( http://cran.r-project.org) is by far the most dominant analytics tool in the Big Data space. R is an open-source statistical
language with constructs that make it easy for data scientists to explore and build models. R is also renowned for the plethora
of available analytics. There are libraries focused on industry problems (i.e. clinical trials, genetics, finance, and others) as well
as general purpose libraries (i.e. econometrics, language processing, optimisation, time series). There is also commercial
applications via Revolution Analytics.
Some of the open-source technologies we have discussed previously, including Apache Mahout, a scalable, Hadoop machine
learning library (http://mahout.apache.org) are slowly being used for enterprise function.
Data Visualisation
Schmarzo, B. (2015)
Data Visualisation
Watch this video for RAPIDMINER. https://www.youtube.com/user/RapidIVideos
RapidMiner provides you with some end-to-end solutions. It helps you prepare data, helps with data reduction, modelling,
prediction and visual analysis. Essentially, tools such as these empowers a current business manager.
There are also some emergent Big Data apps coming to fruition due to the demands of Big Data volumes and the pace of
innovation. Both horizontal Big Data Apps (machine log analytics by Splunk, http://www.splunk.com ) and vertical Big Data
apps (telecommunications analytics by Guavus, http://www.guavus.com ) are emerging.
Big Data apps are designed to address specific business problems and incorporate deeper, more complex prescriptive
analytics typically, while also allowing business analysts the ability to explore the data.
Watch
The Actioning Analysis Process
Data Visualisation
We have looked at the ‘why’ of analysis, as well as some of the methods to perform the actual analysis. What is missing is the ‘how’ we
make this data actionable! Schmarzo (2015) has illuminated a straightforward process to turn this data into actionable insight and assist
in decision making.
Step 1: Establish a hypothesis that you want to test.
Step 2: Identify and quantify the most important metrics or scores to predict a
certain business outcome.
Step 3: Employ the predictive metrics to build detailed profiles for each
individual customer with respect to the hypothesis to be tested.
Step 4: Compare an individual’s recent activities and current state with his or
her profile in order to flag unusual behaviors and actions that may be
indicative of a customer retention problem.
Step 5: Continue to seek out new data sources and new metrics that may be
better predictors of attrition. This is also the part of the data science process to
continuously try to improve the accuracy and confidence levels of the metrics
and scores using sensitivity analysis and simulations.
Step 6: Integrate the analytic insights, scores, and recommendations into the
key operational systems (likely CRM, direct marketing, point of sales, and call
center for the customer retention business initiative) in order to ensure that
the insights uncovered by the analysis are actionable by frontline or customer-engaging
employees.
Once this process is complete,
review the findings with the
leadership team, and those
responsible for responding to the
findings (often subject matter experts).
Ask them:
■ Is the insight of strategic value to
what you are trying to accomplish?
■ Is the insight actionable (i.e., is the
insight at a level upon which I can
act on that insight)?
■ Is the insight of material value (i.e.
is the value of the insight greater
than the cost to act on that insight)?
Data Visualisation
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
■ Use credit card transaction data and the information on the account holder as attributes.
What are these attributes? What do we want to know about them?
■ Label past transactions as fraud or fair. This forms a defining or ‘class’ attribute.
■ Learn a model from how these patterns have formed.
■ Use this model to predict future fraud by observing credit card transactions on this
account.
What types of methods are used here in this process?
This use case looks at applications of analysis methods examined in this course.
Please read these approaches carefully and think about responding to the questions.
Key
Perspective
Data Visualisation
Market Segmentation
Goal: Subdivide a market into distinct subsets of customers where any subset may conceivably
be selected as a market target to be reached with a distinct marketing mix.
Approach:
■ Collect different attributes of customers based upon their geographic and lifestyle related
information.
What sources would we use here? How could we reliably access this data?
■ Find groups of similar customers.
■ Measure the quality of these groups by observing buying patterns of customers in the same
group versus those from other groups.
What types of methods are used here in this process?
This use case looks at applications of analysis methods examined in this course.
Please read these approaches carefully and think about responding to the questions.
Key
Perspective
Data Visualisation
Supermarket shelf management/Inventory
Goal: To identify items that are purchased together by sufficiently many customers.
Is this the only goal here? What other business and commercial benefits can you think of?
Approach:
■ Process point-of-sale data collected with barcode scanners to find dependencies
among items.
What can this be cross-referenced with for deeper insight?
■ Develop rules and associations between these purchases.
What types of methods are used here in this process?
This use case looks at applications of analysis methods examined in this course.
Please read these approaches carefully and think about responding to the questions.
Key
Perspective
Question Time
Try answering these brief review questions:
1. What is a key difference between predictive analysis and descriptive analytics?
2. Explain how both cluster and association analysis work to provide future insight and enable market
predictions for business leaders. What key insights could they provide to certain industries?
3. Justify why an organisation would invest further to visualise their analysis. What value does this process add?
4. Describe two different dimensions of visualisation representation.
5. Write a sample job posting for a data artisan you are looking to recruit. What skills and requirements should
they have and why?
6. Explain the consequences of poor data quality to a business.
Quiz
Summary
In this topic, we studied the concept of data reduction and dimensionality reduction, to
prepare the data for analyses. The field of predictive analytics was introduced which builds
upon our data mining and modeling from the previous topic. Clustering and association
methods were looked at which draw inferences through pattern discovery in order to predict
future behaviour and inform future practice.
This topic better acquainted you with the different methods that can be used to accelerate
the business user and data science team collaboration process. We extended this discussion
to the role that visualisation can play in interpretation of data through enabling of discussion
and comprehension.
While it is not the expectation of this module to turn you into data scientists and
number-crunchers, it was our objective to set the foundation that helps you as future business
leaders to ‘think like a data scientist’. The final actioning your analytics plan provided a
pathway for this enablement, and a process by which to make the most of the data you have
collected and sourced.
References
■ Minelli, M., Chambers, M., & Dhiraj, A. (2012). Big data, big analytics: emerging business intelligence and
analytic trends for today’s businesses. John Wiley & Sons.
■ Ohlhorst, F. J. (2012). Big data analytics: turning big data into big money. John Wiley & Sons.
■ Schmarzo, B. (2013). Big Data: Understanding how data powers big business. John Wiley & Sons.
■ Schmarzo, B. (2015). Big Data MBA: Driving Business Strategies with Data Science. John Wiley & Sons.
■ Verhoef, P. C., Kooge, E., & Walk, N. (2016). Creating Value with Big Data Analytics: Making Smarter Marketing
Decisions. Routledge.
Extra Resources
Koukounas, M. (2013) “A data analytics strategy for boosting business efficiency and revenue”, accessed online at:
http://searchcio.techtarget.com/video/A-data-analytics-strategy-for-boosting-business-efficiency-and-revenue [February, 2016]
A short and simple interview with Michael Koukounas a leading analyst at Equifax looks at the value of analytics which touches on
many themes we have covered in this topic: asking the right questions, putting in procedures, as well as some insightful use cases.
Kirk, A. (2015) “Best of Visualisations, 2015”, accessed online at:
http://www.visualisingdata.com/2015/07/best-of-the-visualisation-web-may-2015/ [Feb, 2016]
Looking for more examples of Visualisation? Perfect resource!
Techtarget (2016) “Seven data science lessons from McGraw-Hill Education analytics guru”, accessed online at:
http://searchcio.techtarget.com/tip/Seven-data-science-lessons-from-McGraw-Hill-Education-analytics-guru [Feb, 2016
A very practical interview with Alfred Essa, VP of analytics and R&D at McGraw Hill. This is a very insightful interview which covers how
McGraw hill are applying analytics, but also the role of the data science and an emphasis on predictive methods of data exploration in
the future.
Spotfire (2015) “Going from Playtime to Efficiency with Data Visualization”, accessed online at:
http://www.tibco.com/blog/2015/12/03/going-from-playtime-to-efficiency-with-data-visualization/ [Feb, 2016] How to streamline and
optimise your visualisations- very handy checklist and with some nice notes on formatting. Invaluable.
Light, easy and fun