Amazon MSK Introduces Managed Data Delivery from Apache Kafka to Your Data Lake

I am thrilled to announce a new feature of Amazon Managed Streaming for Apache Kafka (Amazon MSK) that allows you to continuously load data from an Apache Kafka cluster to Amazon Simple Storage Service (Amazon S3). This feature utilizes Amazon Kinesis Data Firehose, an extract, transform, and load (ETL) service, to read data from a Kafka topic, transform the records, and write them to an Amazon S3 destination. With Kinesis Data Firehose, you can easily configure the service in the console without the need for any code or infrastructure.

Kafka is commonly used for building real-time data pipelines that efficiently move large amounts of data between systems or applications. Many AWS customers have adopted Kafka to capture streaming data such as click-stream events, transactions, IoT events, and application and machine logs. These customers rely on real-time analytics, continuous transformations, and distributing this data to data lakes and databases in real time.

However, deploying Kafka clusters comes with its own challenges. The first challenge is setting up, configuring, and maintaining the Kafka cluster itself. To address this, we released Amazon MSK in May 2019, which simplifies the process of setting up, scaling, and managing Apache Kafka in production. With MSK, you can focus on your data and applications while we take care of the infrastructure.

The second challenge is writing, deploying, and managing application code that consumes data from Kafka. This typically involves coding connectors using the Kafka Connect framework and then managing a scalable infrastructure to run these connectors. Additionally, you need to handle data transformation, compression, error management, and retry logic to ensure data integrity during the transfer out of Kafka.

Today, we are excited to announce a fully managed solution that enables you to deliver data from Amazon MSK to Amazon S3 using Amazon Kinesis Data Firehose. This solution is serverless, requiring no server infrastructure management or code development. You can configure data transformation and error-handling logic with just a few clicks in the console.

The architecture of this solution is illustrated in the diagram below. Amazon MSK serves as the data source, Amazon S3 acts as the data destination, and Amazon Kinesis Data Firehose manages the data transfer logic.

With this new capability, you no longer need to develop code to read data from Amazon MSK, transform it, and write the resulting records to Amazon S3. Kinesis Data Firehose handles the reading, transformation, compression, and write operations to Amazon S3. It also manages error handling and retries in case of any issues. Records that cannot be processed are delivered to the S3 bucket of your choice for manual inspection. The system automatically scales out and scales in to handle the volume of data without any provisioning or maintenance operations required on your end.

Kinesis Data Firehose delivery streams support both public and private Amazon MSK provisioned or serverless clusters. They also support cross-account connections to read from an MSK cluster and write to S3 buckets in different AWS accounts. The delivery stream reads data from your MSK cluster, buffers it based on configurable thresholds, and then writes the buffered data to Amazon S3 as a single file. While MSK and Data Firehose must be in the same AWS Region, Data Firehose can deliver data to Amazon S3 buckets in other Regions.

Kinesis Data Firehose delivery streams offer support for data type conversions. Built-in transformations are available to convert JSON data to Apache Parquet and Apache ORC formats. These columnar data formats optimize storage space and enable faster queries on Amazon S3. For non-JSON data, you can use AWS Lambda to transform input formats such as CSV, XML, or structured text into JSON before converting the data to Apache Parquet/ORC. Additionally, you can specify data compression formats such as GZIP, ZIP, and SNAPPY before delivering the data to Amazon S3, or you can deliver the data in its raw form.

To get started, you can use an AWS account with an existing Amazon MSK cluster and applications streaming data to it. You can create and configure the data delivery stream using the AWS Management Console, AWS CLI, AWS SDKs, AWS CloudFormation, or Terraform. Simply navigate to the Amazon Kinesis Data Firehose page in the console and choose “Create delivery stream”. Select Amazon MSK as the data source, Amazon S3 as the delivery destination, and configure the required parameters. Once the delivery stream is created, you can see the data appearing in the chosen destination format in your S3 bucket.

This new capability is available in all AWS Regions where Amazon MSK and Kinesis Data Firehose are available. You are billed based on the volume of data going out of Amazon MSK, measured in GB per month. The billing system takes into account the exact record size without any rounding. Detailed pricing information can be found on the pricing page.

We are excited to see the reduction in infrastructure and code that you will experience by adopting this new capability. Start configuring your first data stream between Amazon MSK and Amazon S3 today.

Source link