Create a Smarter Data Lake with HP Haven and Apache Hadoop
1:00 PM Eastern / 10:00 AM Pacific
The Smart Content Hub solution from HP and Hortonworks enables a shared content infrastructure that transparently synchronizes information with existing systems and offers an open standards-based platform for deep analysis and data monetization. Join this webinar and learn how you can: 1/ Leverage a 100% of your data, including text, images, audio, video, and many more data types can be automatically consumed and enriched using HP Haven and Hortonworks Data Platform. 2/ Democratize and enable multi-dimensional content analysis and empower your analysts, business users, and data scientists to search and analyze Hadoop data with ease. 3/ Extend the enterprise data warehouse to synchronize and manage content from content management systems, and crack open the files in whatever format they happen to be in. 4/ Dramatically reduce complexity with enterprise-ready SQL engine.
Architecturally, Hadoop is just the combination of two technologies: the Hadoop Distributed File System (HDFS)
that provides storage, and the MapReduce programming model, which provides processing .
HDFS exists to split, distribute, and manage chunks of the overall data set, which could be a single file or a
directory full of files. These chunks of data are pre-loaded onto the worker nodes, which later process them in
the MapReduce phase. By having the data local at process time, HDFS saves all of the headache and inefficiency
of shuffling data back and forth across the network.
Mark your calendar for April 15-16, 2015 for Hadoop Summit Europe in Brussels, Belgium or June 9-11, 2015 for Hadoop Summit North American in San Jose, CA. Or block the whole week and attend the pre-conference activities. Call for abstracts for Hadoop Summit Europe will open in the next few days.
Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms.
Apache Spark addresses these limitations by generalizing the MapReduce computation model, while dramatically improving performance and ease of use.
Fast and Easy Big Data Processing with Spark
At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. This composition makes it easy to express a wide array of computations, including iterative machine learning, streaming, complex queries, and batch.
In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory. This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses. As illustrated in the figure below, this feature enables:
Low-latency computations by caching the working dataset in memory and then performing computations at memory speeds, and
Efficient iterative algorithm by having subsequent iterations share data through memory, or repeatedly accessing the same dataset
Did you know that using the Crunch API is a powerful option for doing time-series analysis? Apache Crunch is a Java library for building data pipelines on top of Apache Hadoop. (The Crunch project was originally founded by Cloudera data scientist Josh Wills.) Developers can spend more time focused on their use case by using the Crunch API to handle common tasks such as joining data sets and chaining jobs together in a pipeline. At Cloudera, we are so enthusiastic about Crunch that we have included it in CDH 5! (You can get started with Apache Crunch here and here.)
Furthermore, Crunch is a really good option for transforming and analyzing time-series data. In this post, I will provide a simple example for bootstrapping with Crunch for that use case.
Let’s start by looking at the raw data sample on HDFS.
Head of Big Data & Analytics - YMC AG
- EPAM SystemsConsultant, 2012 - present
- ThoughtCorpConsultant, 2012
- C3I.T. Consulting, 2012
Color Me SASS - Colour library for the css preprocessor SASS
Colour library for SASS and LESS
Rubular: a Ruby regular expression editor and tester
Rubular is a Ruby-based regular expression editor and tester. It's a handy way to test regular expressions as you write them. Rubular is an
The UI Grail – jQuery UI Sliders on a Crosstab « Nic Bertino.
You'll have to stay with me and really be a sport until I can get a virtual machine going at home with a COGNOS training install on it. Unti