The Stream Processing Model Behind Google Cloud Dataflow | by Vu Trinh

The Stream Processing Model Behind Google Cloud Dataflow | by Vu Trinh | Apr, 2024

Currently, data processing frameworks like MapReduce and its related tools such as Hadoop, Pig, Hive, or Spark enable users to process large amounts of batch data. On the other hand, stream processing tools like MillWheel, Spark Streaming, or Storm support real-time data processing. However, these existing models may not always meet the requirements in certain common scenarios.

For instance, a streaming video provider relies on advertising revenue and needs to bill advertisers based on the amount of advertising watched. They also require aggregated statistics about videos, ads, and viewer demographics. Additionally, they need to run offline experiments using historical data. This processing system must be quick, adaptable, and capable of handling global-scale data. Here are some key observations from Google regarding data processing systems at that time:

The main drawback of existing systems is the assumption that input data will eventually be complete, which may not hold true for today’s vast and disordered data. To address diverse real-time workloads, a unified stream processing model is proposed in the paper, focusing on simplicity, correctness, latency, and cost based on specific use cases.

– What results are being computed?
– Where in event time they are being computed.
– When they are materialized during processing time,
– How do earlier results relate to later refinements?

The paper discusses how Google implements this unified stream processing model, emphasizing that the model provides a framework for parallel computation without being tied to a specific execution engine like Spark or Flink.

The authors use the terms unbounded/bounded to describe infinite/finite data instead of streaming/batch, as they are not tied to a specific execution engine. Unbound data refers to data without a predefined boundary, while bounded data has clear start and end boundaries.

Source link