Processing Platforms for Data Discovery

Big Data for Business
Management
Module 2: Infrastructure
2.3 – Processing Platforms for Data Discovery
Our Journey
Module 1
Milieu > Module 2 Infrastructure > Module 3 Applications
■ Business Intelligence
■ Visual and Predictive Analytics
■ Social Media Analytics
■ Introduction to Big Data
■ Big Data in Business
■ Big Data Value Creation
This module introduces you to Information Systems, Data Warehousing and Cloud Technologies.
It further explores how businesses manage and harness large databases and process them to
discover dynamic and real-time information.
■ Information Systems
■ Data Warehouse and Cloud
■ Processing Platforms for Data
Discovery
Bu 1 Hadoop
2 MapReduce
3 Cassandra
This topic explores how businesses manage and
harness large databases and process them to
discover dynamic and real-time information. The topic
introduces Hadoop and HDFS, the common standard
for data storage, The Hadoop Eco System, and
MapReduce. Furthermore, it looks into emerging
processing technologies (such as Cassandra) and
contemporary solutions to data processing issues..
Topic Overview
Hadoop
From Our Last Topic
Storage is a critical element for Big Data. Data, as we know, has to be stored somewhere, readily accessible
and protected. This has proved to be an expensive challenge for many organisations, since network-based
storage (not forgetting Cloud systems) can be very expensive to purchase and manage.
Storage has evolved to become one of the more ‘standard’ elements in a typical data center. Indeed, storage
technologies have started to approach a critical commodity-like status for businesses.
Last topic spoke about some of the evolving data needs that can put the strain on storage technologies.
Traditional storage technologies are not suited to deal with the terabytes (and petabytes!) of unstructured
information presented by Big Data as we have discussed previously.
Success with Big Data analytics demands something more: a new way to deal with large volumes of data
and a new storage platform approach.
Hadoop
Rise of Distributed Data Systems
The ability to design, develop, and implement a big data program is directly
dependent on an understanding of the underlying computing platform, both
from a hardware and, more importantly, from a software perspective.
Because simplistic traditional computer systems are limited in capacity, they
cannot easily accommodate massive amounts of data.
High performance platforms are composed of collections of computers in which
the massive amounts of data and requirements for processing can be
distributed among a pool of resources, easing this burden.
The terms distributed database and distributed processing are closely related, yet have distinct meanings.
Distributed database
A set of databases in a distributed system that can appear to applications as a single data source.
Distributed processing
The operations that occurs when an application distributes its tasks among different computers in a network.
For example, a database application typically distributes front-end presentation tasks to client computers and allows a
back-end database server to manage shared access to a database.
https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67775/ch3_eva4.gif
Read
Exercise Distributed Data Systems
Hadoop
Please watch the following short video which canvasses the key ways centralised
databases differ from distributed databases https://www.youtube.com/watch?v=tnDKJwzV6FA
Hadoop
The Hadoop File System (HDFS)
The Big Data foundation is composed of two major systems. The first stores the data and the
second processes it.
Big Data storage is often synonymously interchanged with the Hadoop File System (HDFS), but
traditional data warehouses can also house Big Data. HDFS is distributed data storage that
has become the de facto standard because you can store any type of data without limitations
on the type or amount of data. One of the reasons HDFS has become so popular is that you
don’t have to do any set up to store the data. In traditional databases, you need to do quite a
bit of ‘set up’ in order to store data.
Data that will be housed in the database and set up the database by creating a schema. The
schema is the blueprint for how data is placed into tables with columns. The schema also
contains rules about the data being stored—for example, which column(s) will be used to find
or index the data in a particular table.
With HDFS, you don’t set up a blueprint; you simply dump the data into a file. (Think about it as
cutting and pasting all your data into a huge Microsoft Word file!) This approach makes sense
for massive quantities of data when the value of that data is unknown.
The name “Hadoop” was given by
one of Doug Cutting’s sons to that
son’s toy elephant. Doug used the
name for his open source project
because it was easy to pronounce
and to Google.
An interview with Doug Cutting
discussing more can be found
here:
http://video.cnbc.com/gallery/?vi
deo=3000165622
(CNBC, 2013)
Hadoop
So Why Is It So Good?
There are many Big Data technologies that have been making an impact on the new technology stacks for handling Big Data,
but the Hadoop Distributed File System (HDFS) seems to be the most prominent and more and more businesses are just now
starting to leverage its capabilities.
When data lands in the cluster, HDFS breaks it into pieces and distributes those pieces among the different servers
participating in the cluster. Each server stores just a small fragment of the complete data set, and each piece of data is
replicated on more than one server.
Another reason HDFS is used to complement traditional data warehouses is due to limitations on the type of data the
database supports, and the size of data that the database can store. Often, traditional databases support the data type but
make it impractical to manipulate the data once it is stored thereby rendering it fairly useless.
Big Data processing involves the manipulation and calculations that occur on the Big Data. Traditional databases have
differing abilities to process Big Data sets effectively and efficiently.
The Hadoop platform is designed to solve problems caused by massive amounts of data, especially data that contain a
mixture of complex structured and unstructured data, which does not lend itself well to being placed in tables. Hadoop
works well in situations that require the support of analytics that are deep and computationally extensive. In some ways,
it can be said to perform similar to a supercomputer.
For the decision maker seeking to leverage Big Data, Hadoop
solves the most common problem associated with Big Data: storing
and accessing large amounts of data in an efficient fashion.
The intrinsic design of Hadoop allows it to run as a platform that is
able to work on a large number of machines that don’t share any
memory or disks. With that in mind, it becomes easy to see how
Hadoop offers additional value: Network managers can simply buy
a whole bunch of commodity servers, slap them in a rack, and run
the Hadoop software on each one.
Frank Ohlhorst, 2015
Hadoop
How Did We Get Here?
Hadoop originated from the ‘Nutch’ open-source project on search engines in the early 2000s. Google
revolutionised the field of Big Data and Cloud Computing in 2004 when they released a cache of papers
describing how they handled these topics. In 2006, Lucene started a subproject using operations from this
project, but it wasn’t until 2008 when Apache took over Hadoop and made it a top-level project when it started
to ‘take off’ in terms of computational ability and feasibility.
http://1.bp.blogspot.com/-Kw_rRWWaT8Q/VCmsRbsK-oI/AAAAAAAAAO4/DwNK
sFWf7U4/s1600/ha.png
Who is now using Hadoop to manage their
Big Data? Hadoop is now arguably the
backbone of…
Amazon/A9
Facebook
Google
IBM
Disney
New York Times
Yahoo!
Twitter
LinkedIn
Hadoop
How Can We Use Hadoop
Large dataset analysis (eg: AOL data warehousing)
Social Media analysis
Text mining and patterns search (eg: recommendations systems- Facebook and Disney)
Machine Log analysis
Geo-spatial analysis
Trend Analysis (eg: search analysis – Yahoo, Amazon)
Genome Analysis
Drug Discovery
Fraud and Compliance Management
Video and Image Analysis (eg: Facebook, NY Times)
Output analysis (eg: computing carbon footprint, emissions etc)
Watch the following
video to see how
Hadoop is advancing
video and image
analysis and changing
the ‘face’ of emotional
recognition and see
what it means for
businesses:
https://www.youtube.co
m/watch?v=X97XQ-bIBi
g
Hadoop has a variety of uses beyond a distributed storage platform. The following detail the many different ways
it has revolutionised information management across a variety of sectors…
MapReduce
Processing Large Data Sets
Because Hadoop stores the entire dataset in small pieces across a collection of
servers, analytical jobs can be distributed, in parallel, to each of the servers
storing part of the data.
Each server evaluates the question against its local fragment simultaneously and
reports its results back for collation into a comprehensive answer.
MapReduce is the agent that distributes the work and collects the results. It does
this by automatically dividing the processing workload into smaller workloads
that are distributed.
MapReduce is a programming model for processing large data sets on
distributed computing
MapReduce has emerged as the de-facto processing software for HDFS, and is a
fault-tolerant parallel programming framework designed to harness Big Data
processing capabilities.
When working with HDFS, data manipulation and calculations are programmed
using the MapReduce framework
in the language of choice, which is typically Java programming language.
Java is a general-purpose
computer programming
language that is specifically
designed to have as few
implementation dependencies
as possible. It is intended to let
application developers ‘write
once, run anywhere’, meaning
that compiled Java code can
run on all platforms that
support Java without the need
for recompilation.
As of 2015, Java is one of the
most popular programming
languages in use
(Wikipedia, 2016)
MapReduce
Parallel Computing Framework
A parallel computing framework may be a bit meaningless to most of us. The following analogy helps to explain this concept and how
MapReduce operates…
“For example, a factory with 10 assembly lines receives an order to create 500 toy trucks. One assembly line could create all 500 toy trucks or,
alternatively, there could be a division of labor among the assembly lines where each assembly line produces 50 toy trucks. If each assembly line
started at the same time and everything went flawlessly each assembly line would complete their production of 50 trucks simultaneously. This
efficient division of labor was fairly straightforward because each toy truck could be produced independently.
However, if three of the assembly lines could only produce the engine and the remaining seven assembly lines could only produce the balance of
the toy truck the division of labor becomes more complicated. In this scenario, planning has to take into account that there is a dependency
between the engine production and rest of the toy truck production.
Just like in the toy truck production, some data manipulation and calculations can be performed independently. To maximize the processing
throughput, MapReduce assumes that the workloads being distributed are independent tasks and the workload is equally divided just like the
division of labor to the 10 assembly lines to produce 50 trucks each. However, if there are dependencies in the processing workload, the
MapReduce framework is unaware of those dependencies. The programmer has to be aware of those dependencies and has to specifically divide
the workload up in the program understanding that MapReduce will automatically distribute tasks. This type of programming is called parallel
programming.
Just like the production planning that has to occur in dividing the workload between assembly lines that only produce engines and all the other
assembly lines was more complicated, parallel programming is more complicated. One of the benefits of MapReduce and some data warehouse
appliances is that the easier independent processing is automatically handled by the framework or data warehouse appliance.”
Minelli, M., Chambers, M., & Dhiraj, A. (2012)
MapReduce
Map. Shuffle. Reduce
Hadoop MapReduce includes several stages, each with an
important set of operations helping to get to your goal of getting
the answers you need from big data. The process starts with a user
request to run a MapReduce program and continues until the
results are written back to the HDFS.
The basic stages include:
Mapping: Writing the output files to temporary storage, assigning
all of the input files to processing.
Shuffling: Data is redistributed according to the mapped function.
All similar data is sorted and assigned to the reduce processors.
Reduce: Processing now occurs per group of data, in parallel.
The MapReduce system then collects all the Reduce output data,
and it is sorted to produce a final outcome.
It is helpful to think about MapReduce as an ‘engine’, because that is exactly how it works. You provide input (fuel),
the engine converts the input into output quickly and efficiently, and you get the answers you require.
MapReduce
Another Way of Looking at Things
http://www.glennklockwood.com/data-intensive/hadoop/mapreduce-workflow.png
This highlights our Big Data
journey; right across and through
the Hadoop processing system
(HDFS).
We can see data being
distributed, assigned and
processed; from large daunting
sets of data, to manageable,
actionable groupings.
It is a process of sorting in order
to compute – and it results in
lightening-fast computing ability
that is practically and virtually
highly efficient.
MapReduce
The Data Processing Landscape
You might see or hear the term ‘Apache Hadoop’ in the literature and online
readings. This is because the software is managed by an open-sourced
foundation called the ‘Apache Software Foundation’. They are an American
membership-based organisation that are decentralised in structure, but they also
oversee the many components of the Hadoop framework.
The Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop
modules.
Hadoop Distributed File System (HDFS) – a distributed file-system that stores
data on commodity machines, which we have discussed previously.
Hadoop YARN – a resource-management platform responsible for managing
computing resources in clusters and using them for scheduling of users’
applications.
Hadoop MapReduce – an implementation of the MapReduce programming
model for large scale data processing, which we have previously covered.
http://www.tutorialspoint.com/hadoop/i
mages/hadoop_architecture.jpg
MapReduce
The Hadoop Ecosystem
The term Hadoop has come to refer not just to the base modules above, but also to the ecosystem, or a
collection of additional software packages that can be installed on top of or alongside Hadoop.
Tools Layer
Hadoop Layer
MapReduce HDFS
Zookeeper Flume Hive Pig HBase Sqoop
The Hadoop Ecosystem
Hive: A distributed data warehouse for data that is stored in HDFS.
Hbase: Is a columnar database and an implementation of Google BigTable- highly scalable to store billions of rows.
Flume: Apache Flume is a distributed, reliable and highly available service for efficiently collecting, aggregating and moving
large amounts of log data.
Mahout: A library of machine learning statistical algorithms that were implemented in MapReduce and can run natively on
Hadoop.
Sqoop: Used for transferring bulk data between Hadoop and traditional structured data platforms.
http://www.dummies.com/how-to/content/
the-apache-hadoop-ecosystem.html
Pig: A platform for the analysis of very large
data sets that runs on HDFS and with an
infrastructure layer consisting of a compiler
that produces sequences of MapReduce
programs and a language layer consisting
of the query language named Pig Latin.
Zookeper: A simple interface to the
centralised coordination of services (such
as naming, configuration, and
synchronisation) used by distributed
applications.
MapReduce
The Data Processing Landscape
http://blog.datagravity.com/wp-content/uploads/
2013/11/DataAnalyticsTrends-Big-Data-Landscap
e.jpg
And it doesn’t stop there. Hadoop is only
one of the many providers of processing
services (albeit they are open-source).
There are cross-infrastructure platforms
provided by IBM and SAP, as well as
applications provided by Google,
Microsoft and other private analytics and
infrastructure businesses.
The image on the left perhaps outlines
the complexity of the analytics processing
marketplace!
The following video from Cloudera is
another example of another such
provider…
MapReduce
Different Platforms for Different Sectors
The Wall Street Journal sat down with Cloudera Chief Executive Tom Reilly to talk about the different ways in which
Hadoop is being deployed by certain businesses. These applications are revealing in the sense that Cloudera has its own
market positioning and unique value proposition to certain sector groups.
http://www.wsj.com/articles/
tom-reilly-talks-about-helpin
g-companies-deal-with-bigdata-1455103825
MapReduce
Choosing The Right Platform
Current forecasting suggests that the Big Data technology and services market will grow at beyond 40% per year,
which is over seven times the estimated growth rate for the overall IT and communications business (IDC, 2015). By
the end of this year alone, it is predicted that 75% of all data on the planet will be residing in Hadoop systems.
While Hadoop is applicable to the majority of businesses, it is not the only provider, especially when it comes to
open-source platforms. Once an organisation has decided to leverage its massive data sets, setting up the infrastructure
will not be the biggest challenge.
The biggest challenge may come from deciding to go it alone with an open source provider, or to turn to one of the
commercial implementations of Big Data technology. Providers such as Cloudera, Hortonworks, and MapR are
commercialising Big Data technologies, making them easier to deploy and manage.
Add to that the growing crop of Big Data on-demand services from cloud services providers, and the decision process
becomes that much more complex. Decision makers will have to invest in research and perform due diligence to select
the proper platform and implementation methodology to make a business plan successful.
However, most of that legwork can be done during the business plan development phase, when the pros and cons of the
various Big Data methodologies can be weighed and then measured against the overall goals of the business plan.
Which technology will get us there the fastest, with the lowest cost, and without sacrificing future capabilities?
MapReduce
MapReduce
Moving on from Hadoop
Hadoop and the world of big data have transitioned from being buzzwords and hype to being a reality. Businesses
have realised the value of Big Data and are beginning to understand the use cases for the technologies in the Hadoop
ecosystem.
While the penny may have dropped for some business, Hadoop has not yet forged a name for itself as being the most
simplistic set of technologies to use, some remaining confusion about what exactly it is only adds to the complexity.
Hewlett-Packard has proposed a new kind of computing architecture that focuses on memory, as opposed to processing
power, as the center of a computing system in 2015. Around the same time, IBM, which has championed Hadoop and put it
at the center of its big data strategy, announced it is working on a faster data-processing engine, called Spark. Additionally,
a senior executive at Cloudera, probably the largest Hadoop company, hinted that they are prepared to see key parts of
Hadoop diminish in importance, and was increasingly distributing Spark.
Hadoop is complex technology but an important thing to remember is that the business doesn’t necessarily need to know
how complex it is or how it works. The work done by the Apache Hadoop developer community along with the large
vendors has helped to mask the complexities so that businesses can instead focus on putting it to use.
Although the technology is complex it is not something that is fundamental to the user being able to use it. The important
elements are knowing your use cases and having best practices in place in order to make sure it benefits the business.
Nunns, 2016
Cassandra
Beyond Hadoop
Another name in the Big Data realm is the Cassandra database, a technology that can store
two million columns in a single row. Cassandra is ideal for appending more data onto existing
user accounts without knowing ahead of time how the data should be formatted. It is an
open-source system managed by Apache.
Cassandra’s origins can be traced to Facebook, which needed a massive distributed
database to power the service’s inbox search. Facebook wanted to use the Google Bigtable
architecture, however, Bigtable had a serious limitation in that it depended on a single node
to coordinate all read-and-write activities on all of the nodes. This meant that if the head
node went down, the whole system would be useless.
Cassandra was built on a distributed architecture called Dynamo. Amazon uses Dynamo to
keep track of what customers are putting in their shopping carts. Dynamo gave Cassandra an
advantage over Bigtable, as Dynamo is not dependent on any one master node. Any node
can accept data for the whole system, as well as answer queries. Data is replicated on
multiple hosts, creating resiliency and eliminating the single point of failure.
Google Bigtable is a
distributed storage system for
managing structured data that
is designed to scale to a very
large size: petabytes of data
across thousands of
commodity servers.
Many projects at Google store
data in Bigtable,including web
indexing, Google Earth, and
Google Finance.
These applications place very
different demands on Bigtable,
both in terms of data size (from
URLs to web pages to satellite
imagery)
(Google Inc, 2016)
Ohlhorst, F. J. (2012).
Summary
By now you should have a more deeper understanding not only of the way enterprise and large organisations
manage and harness these databases, but also how they enable the processing and crunching of dynamic and
real-time information.
We looked at the importance of processing, and how distributed data systems work. This is the fundamental
concept behind the Hadoop system (HDFS) and we unpacked it’s various uses in terms of managing large data
sets and the development of storage processes.
From that we delved into MapReduce- a critical processing platform that makes this seemingly infinite data pool
so much more efficient through its Map/Shuffle/Reduce framework.
This module also touched upon the wider Hadoop ecosystem, the various other tools and platforms that exist in
this processing universe. While we did not go into a computation exercise specifically, what is important here is
the application of these tools and the importance in selecting the right analytics systems. Again, we looked at
some of the drivers behind Hadoop incorporation in business, as well as some emerging alternatives in the
private marketplace beyond open-source offerings.
These contemporary solutions to Big Data processing problems allow us to imagine a future of further analytic
and informatics possibility – how will these developments assist us with Big Data management, storage and
processing?
Furthermore, we need to consistently evaluate and ask ourselves which technology will get us there the fastest,
with the lowest cost, and without sacrificing future capabilities.
Question Time
Try answering these review questions:
1. What are the key benefits of a distributed data system? How does it differ from a modular, centralised system?
2. Why did Hadoop become so widely accepted? What are its benefit over traditional storage and processing
platforms?
3. Explain how an organisation like Facebook might use Hadoop in its Big Data processing and management.
4. Why is the Map/Shuffle/Reduce process of MapReduce so efficient? Why would this benefit businesses in
terms of data outcomes?
5. Describe two different tools in the Hadoop ecosystem and why are they have might have benefits to two
different industries.
Quiz
Glossary
Computation and Data
Big Data as an industry is quite dynamic, and still is a degree of flux….as you have probably ascertained, there are new terms and technology appearing almost weekly!
This fast-paced environment is fuelled by open-source communities, emerging tech companies, and key industry players such as IBM, SAP and Oracle. The following
should help with some of the terminology!
Analytics
Using math to derive meaning from data.
Analytics Platform
A set of analytic tools and computational power used to query and process data.
Appliance
Optimised hardware and software purpose built for a specific set of activities.
Batch
A job or process that runs in the background without human interaction.
Big Data
The de facto standard definition of big data is data that goes beyond the traditional limits of data along three dimensions: volume, variety, velocity. (Don’t
forget our extra four dimensions!) The combination of these dimensions makes the data more complex
to ingest, process, and visualise.
Cloud
General term used to refer to any computing resources— software, hardware or service—that is delivered as a service over a network.
Columnar Database
The storing and optimising of data by columns. Particularly useful for some analytics processing that uses data based on a column.
Data Mining
The process of discovering patterns, trends, and relationships from data using machine learning.
Distributed Processing
The execution of a program across multiple CPUs.
Grid
Loosely coupled servers networked together to process workloads in parallel.
Minelli, M., Chambers, M., & Dhiraj, A. (2012).
Glossary
Computation and Data
Hadoop
An open-source project framework that can store large unstructured data (HDFS) and processes the unstructured data (MapReduce) in a cluster of computers
(grid).
HDFS
The Hadoop File System, which is the storage mechanism for Hadoop.
HPC
High-performance computing. Used colloquially to refer to devices designed for high-speed floating point processing, in-memory with some disk parallelisation.
MapReduce
A parallel computational batch processing frame for Hadoop where jobs are mostly written in Java. The job breaks up a larger problem into smaller pieces of work
and distributes the workload across the grid so that jobs can be worked on simultaneously (mapper). A master job (reducer) collects all the interim results and
combines them.
Real-Time
Today it is colloquially defined as immediate processing. Real-time processing originated in 1950s when multi-tasking machines provided the capability to
“interrupt” a task for a higher priority task to be executed. These types of machines powered the space program, military applications, and many commercial
control systems.
Relational Database
The storing and optimising of data by rows and columns.
Semistructured Data
Unstructured data that can be put into a structure by available format descriptions.
SQL (structured Query Language)
The language for storing, accessing, and manipulating data in a relational database.
Structured Data
Data that has a pre-set format.
Unstructured Data
Data that has no preset structure.
Minelli, M., Chambers, M., & Dhiraj, A. (2012).
References
CNBC (2013) “Big Data Giveaway’s Big Payoff”, accessed online at:
http://video.cnbc.com/gallery/?video=3000165622 [February, 2016]
Google (2016) “BigTable” accessed online at:
http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf [February, 2016]
Minelli, M., Chambers, M., & Dhiraj, A. (2012). Big data, big analytics: emerging business intelligence and analytic trends for today’s businesses. John
Wiley & Sons.
Nunns (2016) “How and when to use Apache Hadoop to bring big data value to business”, CBR Online, accessed online at:
http://www.cbronline.com/news/big-data/software/how-and-when-to-use-apache-hadoop-to-bring-big-data-value-to-business-4807141 [February,
2016]
Ohlhorst, F. J. (2012). Big data analytics: turning big data into big money. John Wiley & Sons.
Oracle (2016) “Database Administrator’s Guide”, accessed online at:
https://docs.oracle.com/cd/B10501_01/server.920/a96521/ds_concepts.htm [February, 2016]
Schmarzo, B. (2013). Big Data: Understanding how data powers big business. John Wiley & Sons.
Verhoef, P. C., Kooge, E., & Walk, N. (2016). Creating Value with Big Data Analytics: Making Smarter Marketing Decisions. Routledge.
Extra Resources
Light, easy and fun
For the academically-curious
• Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the
Hadoop ecosystem. Journal of Big Data, 2(1), 1-36. accessed online at: http://link.springer.com/article/10.1186/s40537-015-0032-1 [January,
2016]
• A comprehensive article on the evaluative criteria of some of the Big Data tools – advantages, disadvantages of the various processing
approaches. Quite lengthy, but relevant for a deeper-dive into some of these tools.
• Muglia, B (2016) “The Year of Data Walls and Cloud Battles”, Huffington Post, accessed online at:
http://www.huffingtonpost.com/bob-muglia/the-year-of-data-walls-an_b_8962084.html [February, 2016]
A highly recent and quite accessible article on the year that was 2015 and some nice predictions on the Big Data and Hadoop space
for 2016 and beyond.
•Hardy, Q (2015) “Companies Move On From Big Data Technology Hadoop”, NY Times, accessed online at:
http://bits.blogs.nytimes.com/2015/06/15/companies-are-moving-on-from-big-data-technology-hadoop/ [January, 2016]
•If you are interested in future tech and directions of this space, this furthers the discussion around the current state of the
Hadoop market and is a nice overview of what some of the big key players are doing in the technology space from 2015
onwards.
•Wiggins, C (2015) “Data Science At The New York Times”, video accessed online at:
https://www.youtube.com/watch?v=MNosMXFGtBE [January, 2016]
– An excellent wide-ranging keynote by Chris Wiggins on how the media space has been transformed by data, specifically Hadoop
applications. I love the tone, and some of the very interesting ways in which data is mined and monetised for readership. Fascinating!