Profile cover photo
Profile photo
Tom Klimovski
Tom's posts

Post has attachment

Post has attachment
Learn how to machine learn

Post has attachment
Join us in 2015 – Mark your Calendar

Mark your calendar for April 15-16, 2015 for Hadoop Summit Europe in Brussels, Belgium or June 9-11, 2015 for Hadoop Summit North American in San Jose, CA. Or block the whole week and attend the pre-conference activities. Call for abstracts for Hadoop Summit Europe will open in the next few days.

Post has attachment
Hortonworks webinars - can be quite useful. Here's the next one coming up:

Create a Smarter Data Lake with HP Haven and Apache Hadoop
1:00 PM Eastern / 10:00 AM Pacific
The Smart Content Hub solution from HP and Hortonworks enables a shared content infrastructure that transparently synchronizes information with existing systems and offers an open standards-based platform for deep analysis and data monetization. Join this webinar and learn how you can: 1/ Leverage a 100% of your data, including text, images, audio, video, and many more data types can be automatically consumed and enriched using HP Haven and Hortonworks Data Platform. 2/ Democratize and enable multi-dimensional content analysis and empower your analysts, business users, and data scientists to search and analyze Hadoop data with ease. 3/ Extend the enterprise data warehouse to synchronize and manage content from content management systems, and crack open the files in whatever format they happen to be in. 4/ Dramatically reduce complexity with enterprise-ready SQL engine.

Post has attachment
Why use Apache Spark?

Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms.

Apache Spark addresses these limitations by generalizing the MapReduce computation model, while dramatically improving performance and ease of use.

Fast and Easy Big Data Processing with Spark

At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. This composition makes it easy to express a wide array of computations, including iterative machine learning, streaming, complex queries, and batch.

In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory. This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses. As illustrated in the figure below, this feature enables:

Low-latency computations by caching the working dataset in memory and then performing computations at memory speeds, and
Efficient iterative algorithm by having subsequent iterations share data through memory, or repeatedly accessing the same dataset

Post has attachment
Solid article: Learning How to Learn Hadoop;

Architecturally, Hadoop is just the combination of two technologies: the Hadoop Distributed File System (HDFS) 
that provides storage, and the MapReduce programming model, which provides processing[1] [2].
HDFS exists to split, distribute, and manage chunks of the overall data set, which could be a single file or a 
directory full of files. These chunks of data are pre-loaded onto the worker nodes, which later process them in 
the MapReduce phase. By having the data local at process time, HDFS saves all of the headache and inefficiency 
of shuffling data back and forth across the network.

Post has shared content

Post has attachment
How-to: Process Time-Series Data Using Apache Crunch: by Jeremy BeardMay 

Did you know that using the Crunch API is a powerful option for doing time-series analysis? Apache Crunch is a Java library for building data pipelines on top of Apache Hadoop. (The Crunch project was originally founded by Cloudera data scientist Josh Wills.) Developers can spend more time focused on their use case by using the Crunch API to handle common tasks such as joining data sets and chaining jobs together in a pipeline. At Cloudera, we are so enthusiastic about Crunch that we have included it in CDH 5! (You can get started with Apache Crunch here and here.)

Furthermore, Crunch is a really good option for transforming and analyzing time-series data. In this post, I will provide a simple example for bootstrapping with Crunch for that use case.

Post has attachment
In the previous article we described how to collect WiFi router logs with Flume to store in HDFS. This article will describe how we did the transformation, parsing, filtering and finally loading into Hive’s data warehouse.

Let’s start by looking at the raw data sample on HDFS.

Jean-Pierre König
Head of Big Data & Analytics - YMC AG

Post has attachment
Wait while more posts are being loaded