Data Warehousing and Cloud

135 views 9:19 am 0 Comments May 19, 2023

Big Data for Business
Management
Module 2: Infrastructure
2.2 – Data Warehousing and Cloud
Our Journey
Module 1
Milieu > Module 2 Infrastructure > Module 3 Applications
Business Intelligence
Visual and Predictive Analytics
Social Media Analytics
Introduction to Big Data
Big Data in Business
Big Data Value Creation
This module introduces you to Information Systems, Data Warehousing and Cloud Technologies.
It further explores how businesses manage and harness large databases and process them to
discover dynamic and real-time information.
Information Systems
Data Warehouse and Cloud
Processing Platforms for Data
Discovery

Bu 1 Storage
2 Data Warehousing
3 Cloud Computing
This topic explores types of data, traditional data
stores, distributed systems, database
management, and relational databases (RDBMS).
Data warehousing methods are then explored,
and concepts of OLTP, OLAP are highlighted,
indicating the need for real time processing of
data for businesses.
The topic then examines Cloud Computing and
solutions/best practices that enable real time
data discovery, that facilitates lean and efficient
processes.
.
Topic Overview
Storage
Finding Homes for Data
We know from the 7V’s of data that Volume is a key component of any Big Data conceptualisation- the
sheer amount of data can be seen as overwhelming. Not just for us, but for our systems!
Data warehousing techniques have been used for Big Data storage for some time now,
however we have some development around the techniques by which we store these
large data sets.
By far the most common and popular is the
relational database (RDBMS) that stores information in
rows and columns. In relational databases the primary access is via the row.
Another storage technique is a
columnar storage scheme. While a columnar database has rows and
columns, information is predominantly accessed via the columns.
Lastly, and most importantly, there are also
hybrid approaches, which allow for some information to be
easily accessed via row and other data to be easily accessed via columns.

Storage
Storage Structures
The following demonstrates visually how this storage is structured….
http://www.teradata.com/uploadedImages/Company-And-Careers/News-And-Events/Ne
ws/2012/Smart-Analytics-Large.png

Storage
Storage Limitations
However it does not come without its limitations – namely, issues of physical space! Some of the most
typical and apparent Big Data computing storage limitations encountered are:
Disk bound: Simply implies there is not enough storage space on the disk. We may be looking at
petabytes of data, when only terabytes are available. Side note, here is a nice graphic of the various
sizes of data:
http://www.billpetro.com/wp-content/uploads/zettabyte.jpg
I/O bound: I/O bound refers to the time taken for computation queries to process, indeed “inputs and
outputs” (I and O, where the name comes from). In terms of a data limitation, we can visualise this as
a “bandwidth” of sorts, and sometimes there is not enough of this “pipe” to move around all the data
to meet business needs (in a timely fashion)
Memory bound: Memory is what we use to process these applications and queries. We need more
of it the bigger the data size, and a nice analogy is a blackboard when working out a math problem.
The bigger the blackboard, the more “space” we have to compute our workings.
CPU bound limitations: Similar in some ways to memory, we use a combination of it and CPUs to
process Big Data analytics. Most analytic software today functions by loading all the data needed for
a calculation into memory and then uses CPU to perform calculations.

Storage
Overwhelming Systems
The sheer volume and continuous nature of Big Data programs
may eventually exhaust your systems if they are not designed
accordingly. Appropriate storage is vital for a number of
reasons, chiefly, relieving pressure on the human actors in your
IT systems.
There is a chance that your analytics operators perform
shortcuts to process data to compensate for memory or CPU
limitations. They may attempt to break apart their problems to
process smaller amounts of the data and instead use this
towards achieving their outcomes. They may even use samples
and extrapolate.
These activities can be costly not only in time (human
computation always takes longer!), but can also skew your
analysis and limit your insights.
Big Data can be a Big Deal – if you are in fact
acquiring relevant data and delivering it to
specific people within their specific
timetables. Call it
data merchandising — the
act of managing, cleansing and securing
highly diverse data from a variety of sources;
delivering specific data to specific users; and
doing so 24/7 with available personnel and
resources.
Organisations need a team of
IT Analytics
Champions
that can do this.
A team that can efficiently control the data
acquisition process and assume responsibility
for quality data acquisition and management.
The properly positioned IT Analytics
Champions don’t deliver a barrage of data to
everyone at hand; they carefully screen and
cleanse that data and then provide pinpoint
delivery of that data to exactly the people
who need to see it.
Lumidata (2015)
Storage
Distributed Data Systems
The ability to design, develop, and implement a big data application is directly dependent on an awareness of
the architecture of the underlying computing platform, both from a hardware and, more importantly, from a
software perspective.
Because simplistic traditional computer systems are limited in capacity, they cannot easily accommodate
massive amounts of data.
That is why high performance platforms are composed of collections of computers in which the massive amounts
of data and requirements for processing can be distributed among a pool of resources.
More of these platforms are to come in the next topic, when we look at Hadoop systems, HDFS and
MapReduce.

Data Warehousing
Current and Historic Data
A data warehouse (DW or DWH), also known as an
enterprise data warehouse (EDW), is a central
repository of integrated data from one or more
disparate sources. They store current and historical
data and are used for creating analytical reports for
knowledge workers throughout the enterprise.
Examples of reports could range from annual and
quarterly comparisons and trends to detailed daily
sales analysis.
The data stored in the warehouse is uploaded from
the operational systems (such as marketing or sales).
The data may pass through an operational data store
for additional operations before it is used in the DW
for reporting.
https://upload.wikimedia.org/wikipedia/commons/4/46/Data_warehouse_overview.JPG
Data Warehousing
Datamarts
A data mart is a simple form of a data
warehouse that is focused on a single subject
(or functional area), hence they draw data
from a limited number of sources such as
sales, finance or marketing.
Data marts are often built and controlled by a
single department within an organisation. The
sources could be internal operational
systems, a central data warehouse, or
external data.

Data Warehouse Data Mart
Enterprise-wide data Department-wide data
Multiple subject areas Single subject area
Difficult to build Easy to build
Takes more time to build Less time to build
Larger memory Limited memory

http://docs.oracle.com/html/E10312_01/dm_concepts.htm
Data Warehousing
OLAP/OLTP
Online analytical processing is an approach to answering multi-dimensional analytical (MDA) queries
swiftly in computing. It is part of the broader category of business intelligence which also
encompasses relational database, report writing and data mining. Typical applications of OLAP
include business reporting for sales, marketing, management reporting, business process
management, budgeting, forecasting, financial reporting etc.
Online transaction processing (OLTP) is a class of information systems that facilitate and manage
transaction-oriented applications, typically for data entry and retrieval transaction processing. It has
also been used to refer to processing in which the system responds immediately to user requests. An
automated teller machine (ATM) for a bank is an example of a commercial transaction processing
application. Online transaction processing applications are high throughout and insert or
update-intensive in database management. These applications are used concurrently by hundreds of
users. The key goals of OLTP applications are availability, speed, concurrency and recoverability.
Reduced paper trails and the faster, more accurate forecast for revenues and expenses are both
examples of how OLTP makes things simpler for businesses. However, like many modern online
information technology solutions, some systems require offline maintenance, which further affects
the cost–benefit analysis of on line transaction processing system.
Surajit Chaudhuri & Umeshwar Dayal (1997). “An overview of data warehousing
and OLAP technology”
. SIGMOD Rec. ACM. 26 (1): 65.
Data Warehousing
We Know That…
Technology has changed.
New technologies and open-source inventions enable different approaches that make it
easier and more affordable to store, manage, and analyse data.
The variety of data has evolved.
Unstructured data is on the rise, as have methods as to how it is handled and
techniques to store, process and manage these diverse data sets.
…. However there are some engineering challenges
Despite developments in technology and platforms to manage these new and unstructured
sets, we have some design challenges around how these systems are built and maintained.
More so, we can also think about the challenges of future-proofing our platforms for the data
of tomorrow.

Cloud Mapping
Cloud computing describes a relatively new development that assists in the supplement, consumption and
delivery of IT services based on the internet.
It typically involves the provision of dynamically scalable and often virtualised resources as a service over
the internet.
It has the potential to alleviate many of these storage and systems we have discussed due to the ability to
be provisioned remotely via an internet connection. It is also dynamic in the sense that memory and CPU
storage limitations are now mediated via remote access to a server through an internet connection.
It also has the ability to be “on demand” through third party applications which can process and store
large data sets via third-party data centres, when requests are made. This saves on maintenance and
management in-house of IT systems for most businesses, and as data needs change these requests can
be scaled up, or scaled down.

Cloud Mapping
In effect, cloud computing can also allow for:
Data being processed elsewhere, eliminating the need for systems to
accommodate high level processing (CPU and Memory capacity)
Space and storage becomes in effect “infinite” as storage is distributed remotely
Economies of scale benefits, as you only pay for what you use. Subscription
services to third party providers of cloud based storage and computing
platforms are generally the norm. Infrastructure costs are therefore owned by
the 3
rd party.
A heavier reliance on internet access, which shifts the issues of capacity from
storage to bandwidth.
It is important to remember and be mindful that Cloud Computing
is not simply
networked computing. Applications and data are not confined to a single computer
(they are distributed and can encompass multiple servers and involve multiple
companies). It is also not traditional IT outsourcing, where subcontracting occurs.
It is also not a buzzword. This is a concept that has now been around and in
commercial use (viable) for the past few years.
https://lh5.googleusercontent.com/EjoEwU_-MyRa
wQa1FOgeKUl8xcwFOFRu8varPKgSWmejTuZy4OH
a8rxlI1LG5P7SFD9G552zojSC7Ppusd1wuuEdmVRo
yVnMr5_km47grUE4ZcZ5olB

Cloud Mapping
The Services of The Cloud
So what can we do with the Cloud?
Many models have emerged for the ways in which this technology can be exploited, and many are relevant for our
understanding of Big Data and Analytics practices.
XAAS/EAAS
“Everything as a Service”
This basically refers to an increasing number of services that are delivered over the Internet rather than provided
locally or on-site.
It allows for organisations to call up software components (applications) over a network for use by a business. This can
range from email and administration operations, to computing and processing applications.
It is technically a subset of cloud computing, with the most common example being Software as a Service (SaaS).
However, this “as service” terminology has been associated with many other functions, including communication,
infrastructure (IaaS) and platforms (PaaS). Sometimes this is referred to as the SPI model (Software, Platform and
Infrastructure). All are core components of Cloud Computing.

Cloud Mapping
Cloud Computing Services
SaaS
(Software as a Service)
Applications, typically
available via a browser. The
most common and first
service.
Examples:
Google Apps (email etc),
Windows/Office Live, and
CRM software delivered
online, such as
Salesforce.com
PaaS
(Platform as a Service)
A platform that enables
application developers to
host their services. Also
provides the ability to build
and deploy cloud
applications.
Examples:
Microsoft Azure
Amazon E2C
IaaS
(Infrastructure as a Service)
Service providers offer
capacity for rent, basically to
host data centres and
servers.
Examples:
Rackspace
Amazon E2C and S3
This is a nice flow diagram showing who uses what and how
http
://www.mechanosphere.com/Media/Images/DigitalEcosystems/Fig_small.jpg
Cloud Mapping
Try answering these brief review questions:
Choose one of the examples from the previous slide and explore what the service offering to businesses is in
terms of Cloud potential.
1. Think specifically how this is vital for Big Data applications: would this help alleviate some of the storage
limitations we have discussed previously?
2. Why would businesses be ready to adopt and deploy some of these services? List the key benefits to an
organisation from a functional or operational perspective.
Quiz
Cloud Mapping
Types of Clouds
PUBLIC
This is where Cloud services
(corporate, database, storage etc)
are delivered to the client via the
internet from a third party service
provider
Examples:
Amazon Cloud Services Offering
Google, and Microsoft
Dropbox!
PRIVATE
Where services are managed and
provided within the organisation
(in-house). This allows for less
restriction on network bandwidth,
greater security and other legal
concern.
Examples:
HP Data Centres for Enterprise
HYBRID
Simply put, there is a combination
of services provided from both
public and private clouds (in and
out of house)
Examples:
Sales and email services on public
cloud, in-house management of
Enterprise Resource Planning
services
As well as the various services “on” the Cloud, there are also various types of Cloud itself, which almost resemble types of access.
You may have heard of concepts such as “Hybrid Cloud” and wondered what this means, so this will help clarify these concepts…
What are the benefits to
public or private clouds?
Public cloud gives the impression of “infinite” storage resources – namely everything is offered and hosted remotely. There are also
less up-front commitments and costs to businesses with public cloud, as well as pay-as-you-go options to scale up if more storage
and processing is required.

Cloud Mapping
Cloud Computing and Storage
It is important to remember that for all kinds of reasons—technical, political, social, regulatory,
and cultural—cloud computing has taken some time to gain commercial momentum.
In terms of Big Data storage, the Cloud has not been as enthusiastically embraced by big
businesses and enterprise. However, there are many who believe that some key industry
players will soon realise that there is a huge ROI opportunity if they do embrace cloud. These
benefits should be now apparent in terms of solving some of our storage issues.
In terms of the potential possibilities of the Cloud, we need to go beyond our basic
conceptualisation and think purely about how we promote its function for wider business
adoption. This includes:
A shift from a focus on the virtual capability, to the remote delivery of a service. Whether this
is public or private is almost moot, the delivery and consumption of this service is what
matters most for data and analytics applications.
Acknowledge some of the key business issues. There are real concerns over privacy and
security of data, these need to be addressed so that the business case is solid and
bulletproof to promote uptake.
Technical gaps in the software being remotely provisioned need to be addressed.
Everything from the ability to run analytics at scale in a virtual environment to ensuring
information processing and analytics authenticity are issues that need solutions.
In terms of organisational perspective,
think about where some key points of
resistance have emerged from:
Executive level: “A buyer-centric
approach to technology where
applications are available to purchase or
rent. Whenever and wherever!”
CFO-level: “An approach which
consumer technology in a pay-as-you-go
model where consumers only pay for
what is used”
CIO-level: “A comprehensive approach to
outsource IT applications remotely, where
infrastructure can be provisioned and
managed externally”.

Cloud Mapping
Cloud Computing and Storage
Case study from Computer Weekly (2014)
Retailers struggle to join up the dots and turn data analysis to advantage
The power of data is becoming clear to organisations of all shapes and sizes, but many in the retail sector
are failing to exploit the benefits of data analysis. Research from eCommera found only 23% of UK retailers
could quickly use available data to take business decisions. Nearly 50% of retailers believed their business
intelligence tools fell short of their needs, with only 16% saying they were confident that their data
analytics tools provided the organisational visibility they needed. Computer Weekly attended
Demandware’s Xchange event in Miami, and it was clear from the keynote that data is going to be a key
trend for retailers over the coming year. Demandware president and CEO Tom Ebling said retailers should
make data an important part of their business strategy. “The future is going to be so personalised, you’ll
know the customer as well as they know themselves,” he said. Tom Davis, global lead for e-commerce at
Puma, said retailers were disadvantaged by failing to gain insight into their customers. “If you can’t control
your data, you can’t move fast enough,” he said.
The following case study details details some of the many ways the retail sector is facing challenges with Big Data and
analytics adoption.

Exercise CROCS
Cloud Mapping
The following case study details details some of the many ways the retail sector is facing challenges with Big Data
and analytics adoption
The loyalty opportunity
Footwear brand Crocs has collected data for some time in the US. But it has not yet managed to tie that data to
customer profiles on a global basis. So far, Crocs has used transactional data to drive product assorting and
recommendations. But the introduction of a new customer relationship management (CRM) programme will give the
retailer a “big data view” of customers, helping it identify customers who have a propensity to buy more easily and more
readily, as well as identifying loyal customers to create a loyalty program.
“We know the customer who shops in multiple channels has a much greater value and provides us an opportunity for
growth as we bring new lines to market and expand the brand,” said Harvey Bierman, vice-president of global
e-commerce at Crocs. Over the years, many retailers have implemented loyalty programmes, gaining insights into
customers’ habits.
You can access the full article here :
http://www.computerweekly.com/news/2240218550/Are-retailers-using-data-analytics-to-their-advantage
Case study from Computer Weekly (2014)
Exercise
Question Time
http://www.computerweekly.com/news/2240218550/Are-retailers-using-data-an
alytics-to-their-advantage
Think about the following:
1. What key data storage challenges existed for some of these retailers?
2. Why is there a hesitancy for retailers to adopt some of these analytics
applications?
3. What was the value here of Cloud services?
4. How did consumers benefit?
Qiz
Case study from Computer Weekly (2014)
Case Study
Question Time
Try answering these brief review questions:
1. Describe the main types of storage structures. How has Big Data caused these structures t change?
Summarise the key limitations of traditional storage systems.
2. Justify two arguments why organisations should focus on storage if they wanted to expand or implement a
large-scale data analytics program. What is likely to occur if storage is neglected?
3. Explain the difference between SaaS, PaaS and IaaS.
4. Why would businesses be attracted to hybrid Cloud offerings? What are some of the limitations of a purely
public or purely private cloud network?
5. Why do you think businesses have not fully embraced Cloud computing? Explain with reference to both
business operations and Big Data programs more specifically. How can this be overcome?
Quiz
Summary
There are issues and limitations around data storage. We first looked at how storage
systems work: some of the structures and systems that are required for Big Data
requirements, with a heavy emphasis on Volume. We also looked at some of the key
limitations for storage and analytics, the challenges that currently face business. One of the
most important being overload of current IT systems and employees.
We then explored data warehouses, data marts and the concepts of OLAP and OLTP for
storage, retrieval and processing of data. Technologies have evolved, and so we need
better solutions.
Cloud came in as our saviour to some of these storage limitations, and we discussed the
fundamental concept of what Cloud Computing really means. Due to the remote availability,
application and service potential, as well as important operational and efficiency benefits for
businesses, Cloud provides a strong case for suitability to Big Data programs.
We moved into the various types of Cloud (public, private and hybrid), as well as the
services that are offered and how they are provided by third-party suppliers. This further
assisted us in terms of promoting its function for wider business adoption (as well as some
of the hurdles).

References
Computer Weekly (2014), “Are retailers using data analytics to their advantage?”, Caroline Baldwin (Ed.), accessed online
at:
http://www.computerweekly.com/news/2240218550/Are-retailers-using-data-analytics-to-their-advantage [January,
2016]
Gartner (2015) “Gartner’s 2015 Hype Cycle for Emerging Technologies Identifies the Computing Innovations That
Organizations Should Monitor”, accessed online at:
http://www.gartner.com/newsroom/id/3114217 [January, 2016]
Lumidata (2015) “Don’t be Overwhelmed By Big Data”, accessed online at:
http://www.lumidata.com/sites/lumidata.com/files/whitepapers/Don’t%20Be%20Overwhelmed%20by%20Big%20Data_201
5pdf.pdf
[January, 2016]
Minelli, M., Chambers, M., & Dhiraj, A. (2012). Big data, big analytics: emerging business intelligence and analytic trends for
today’s businesses. John Wiley & Sons.
Schmarzo, B. (2013). Big Data: Understanding how data powers big business. John Wiley & Sons.
http://docs.oracle.com/html/E10312_01/dm_concepts.htm
Surajit Chaudhuri & Umeshwar Dayal (1997). “An overview of data warehousing and OLAP technology”. SIGMOD Rec.
ACM. 26 (1): 65.
Extra Resources
Ian Moyse (2012) Vimeo on Cloud Computing: https://vimeo.com/153755091 [January, 2016]
A lively presentation- if anything a bit outdated – but he makes some salient points towards the end that are interesting to view in retrospect
regarding Cloud potential. A nice hint to strategy also.
Salesforce’s case study for Gumtree (2014) https://www.youtube.com/watch?v=YjIDbc11Z00 [January, 2016]
If you ignore the Salesforce pitch language, it provides an interesting application of Salesforce, an important and popular SaaS application
Light, easy and fun
For the academically-curious
Presentation by Bill Barney, CEO, Reliance Communications (Enterprise) & Global Cloud Xchange, Hong Kong SAR China, accessed online at:
https://vimeo.com/153474971 [January, 2016]
A comprehensive presentation on the state of Cloud in 2016 and the impact of the Cloud “revolution”. Some nice critical discussion here which is
always appreciated for perspective! Gets a bit tecchie and industry-focused towards the end, but interesting nonetheless
Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open
research issues. Information Systems, 47, 98-115.
http://umexpert.um.edu.my/file/publication/00001293_117865.pdf [January, 2016]
The most comprehensive plain-language paper on the state of Cloud and Big Data and its complex relationship.
Very interesting conclusions, and excellent overview of some of the key terminology we have discussed in this module.