Viswa February 2016

What is Apache Beam?

I was going through the Apache posts and found a new term called Beam. Can anybody explain what exactly Apache Beam is? I tried to google out but unable to get a clear answer.

Answers


Frances February 2016

Apache Beam is an open source, unified model for defining and executing data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

The model behind Beam evolved from a number of internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the “Dataflow Model” and first implemented as Google Cloud Dataflow -- including a Java SDK on GitHub for writing pipelines and fully managed service for executing them on Google Cloud Platform. Others in the community began writing extensions, including a Spark Runner, Flink Runner, and Scala SDK.

In January 2016, Google and a number of partners submitted the Dataflow Programming Model and SDKs portion as an Apache Incubator Proposal, under the name Apache Beam (unified Batch + strEAM processing).

We're currently working hard to get the Beam site up and running over the next couple weeks, but in the meantime you can learn more about the Beam Model, though still under the original name of Dataflow, in the World Beyond Batch: Streaming 101 and Streaming 102 posts on O’Reilly’s Radar site, and the


nealmcb March 2016

Apache Beam (Batch + strEAM) is a model and set of APIs for doing both batch and streaming data processing. It was open-sourced by Google (with Cloudera and PayPal) in 2016 via an Apache incubator project.

The page Dataflow/Beam & Spark: A Programming Model Comparison - Cloud Dataflow contrasts the Beam API with Apache Spark, which has been hugely successful at bringing a modern, flexible API and set of optimization techniques for both batch and streaming to the Hadoop world and beyond.

Beam tries to take all that a step further via a model that makes it easy to describe the various aspects of the out-of-order processing that often is an issue when combining batch and streaming processing, as described in that Programming Model Comparison.

In particular, to quote from the comparison, The Dataflow model is designed to address, elegantly and in a way that is more modular, robust and easier to maintain:

... the four critical questions all data processing practitioners must attempt to answer when building their pipelines:

  • What results are calculated? Sums, joins, histograms, machine learning models?
  • Where in event time are results calculated? Does the time each event originally occurred affect results? Are results aggregated in fixed windows, sessions, or a single global window?
  • When in processing time are results materialized? Does the time each event is observed within the system affect results? When are results emitted? Speculatively, as data evolve? When data arrive late and results must be revised? Some combination of these?
  • How do refinements of results relate? If additional data arrive and results chang

Post Status

Asked in February 2016
Viewed 2,737 times
Voted 9
Answered 2 times

Search




Leave an answer