Saturday, May 17, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Building a Data Platform in 2024. How to build a modern, scalable data… | by Dave Melillo | Feb, 2024

February 10, 2024
in Data Science & ML
Reading Time: 6 mins read
0 0
A A
0
Share on FacebookShare on Twitter


How to build a modern, scalable data platform to power your analytics and data science projects (updated)

Table of Contents:

What’s changed?

Since 2021, maybe a better question is what HASN’T changed?

Stepping out of the shadow of COVID, our society has grappled with a myriad of challenges — political and social turbulence, fluctuating financial landscapes, the surge in AI advancements, and Taylor Swift emerging as the biggest star in the … *checks notes* … National Football League!?!

Over the last three years, my life has changed as well. I’ve navigated the data challenges of various industries, lending my expertise through work and consultancy at both large corporations and nimble startups.

Simultaneously, I’ve dedicated substantial effort to shaping my identity as a Data Educator, collaborating with some of the most renowned companies and prestigious universities globally.

As a result, here’s a short list of what inspired me to write an amendment to my original 2021 article:

  • Companies, big and small, are starting to reach levels of data scale previously reserved for Netflix, Uber, Spotify and other giants creating unique services with data. Simply cobbling together data pipelines and cron jobs across various applications no longer works, so there are new considerations when discussing data platforms at scale.
  • Although I briefly mentioned streaming in my 2021 article, you’ll see a renewed focus in the 2024 version. I’m a strong believer that data has to move at the speed of business, and the only way to truly accomplish this in modern times is through data streaming.
  • I mentioned modularity as a core concept of building a modern data platform in my 2021 article, but I failed to emphasize the importance of data orchestration. This time around, I have a whole section dedicated to orchestration and why it has emerged as a natural compliment to a modern data stack.

The Platform

To my surprise, there is still no single vendor solution that has domain over the entire data vista, although Snowflake has been trying their best through acquisition and development efforts (Snowpipe, Snowpark, Snowplow). Databricks has also made notable improvements to their platform, specifically in the ML/AI space.

All of the components from the 2021 articles made the cut in 2024, but even the familiar entries look a little different 3 years later:

  • Source
  • Integration
  • Data Store
  • Transformation
  • Orchestration
  • Presentation
  • Transportation
  • Observability

Integration

The integration category gets the biggest upgrade in 2024, splitting into three logical subcategories:

  • Batch
  • The ability to process incoming data signals from various sources at a daily/hourly interval is the bread and butter of any data platform.

    Fivetran still seems like the undeniable leader in the managed ETL category, but it has some stiff competition via up & comers like Airbyte and big cloud providers that have been strengthening their platform offerings.

    Over the past 3 years, Fivetran has improved its core offering significantly, extended its connector library and even started to branch out into light orchestration with features like their dbt integration.

    It’s also worth mentioning that many vendors, such as Fivetran, have merged the best of OSS and venture capital funding into something called Product Led Growth, offering free tiers in their product offering that lower the barrier to entry into enterprise grade platforms.

    Even if the problems you are solving require many custom source integrations, it makes sense to use a managed ETL provider for the bulk and custom Python code for the rest, all held together by orchestration.

  • Streaming
  • Kafka/Confluent is king when it comes to data streaming, but working with streaming data introduces a number of new considerations beyond topics, producers, consumers, and brokers, such as serialization, schema registries, stream processing/transformation and streaming analytics.

    Confluent is doing a good job of aggregating all of the components required for successful data streaming under one roof, but I’ll be pointing out streaming considerations throughout other layers of the data platform.

    The introduction of data streaming doesn’t inherently demand a complete overhaul of the data platform’s structure. In truth, the synergy between batch and streaming pipelines is essential for tackling the diverse challenges posed to your data platform at scale. The key to seamlessly addressing these challenges lies, unsurprisingly, in data orchestration.

  • Eventing
  • In many cases, the data platform itself needs to be responsible for, or at the very least inform, the generation of first party data. Many could argue that this is a job for software engineers and app developers, but I see a synergistic opportunity in allowing the people who build your data platform to also be responsible for your eventing strategy.

    I break down eventing into two categories:

    • Change Data Capture – CDC
    • The basic gist of CDC is using your database’s CRUD commands as a stream of data itself. The first CDC platform I came across was an OSS project called Debezium and there are many players, big and small, vying for space in this emerging category.

    • Click Streams – Segment/Snowplow
    • Building telemetry to capture customer activity on websites or applications is what I am referring to as click streams. Segment rode the click stream wave to a billion dollar acquisition, Amplitude built click streams into an entire analytical platform and Snowplow has been surging more recently with their OSS approach, demonstrating that this space is ripe for continued innovation and eventual standardization.

    AWS has been a leader in data streaming, offering templates to establish the outbox pattern and building data streaming products such as MSK, SQS, SNS, Lambdas, DynamoDB and more.

Data Store

Another significant change from 2021 to 2024 lies in the shift from “Data Warehouse” to “Data Store,” acknowledging the expanding database horizon, including the rise of Data Lakes.

Viewing Data Lakes as a strategy rather than a product emphasizes their role as a staging area for structured and unstructured data, potentially interacting with Data Warehouses. Selecting the right data store solution for each aspect of the Data Lake is crucial, but the overarching technology decision involves tying together and exploring these stores to transform raw data into downstream insights.

Distributed SQL engines like Presto , Trino and their numerous managed counterparts (Pandio, Starburst), have emerged to traverse Data Lakes, enabling users to use SQL to join diverse data across various physical locations.

Amid the rush to keep up with generative AI and Large Language Model trends, specialized data stores like vector databases become essential. These include open-source options like Weaviate, managed solutions like Pinecone and many more.

Transformation

Few tools have revolutionized data engineering like dbt. Its impact has been so profound that it’s given rise to a new data role — the analytics engineer.

dbt has become the go-to choice for organizations of all sizes seeking to automate transformations across their data platform. The introduction of dbt core, the free tier of the dbt product, has played a pivotal role in familiarizing data engineers and analysts with dbt, hastening its adoption, and fueling the swift development of new features.

Among these features, dbt mesh stands out as particularly impressive. This innovation enables the tethering and referencing of multiple dbt projects, empowering organizations to modularize their data transformation pipelines, specifically meeting the challenges of data transformations at scale.

Stream transformations represent a less mature area in comparison. Although there are established and reliable open-source projects like Flink, which has been in existence since 2011, their impact hasn’t resonated as strongly as tools dealing with “at rest” data, such as dbt. However, with the increasing accessibility of streaming data and the ongoing evolution of computing resources, there’s a growing imperative to advance the stream transformations space.

In my view, the future of widespread adoption in this domain depends on technologies like Flink SQL or emerging managed services from providers like Confluent, Decodable, Ververica, and Aiven. These solutions empower analysts to leverage a familiar language, such as SQL, and apply those concepts to real-time, streaming data.

Orchestration

Reviewing the Ingestion, Data Store, and Transformation components of constructing a data platform in 2024 highlights the daunting challenge of choosing between a multitude of tools, technologies, and solutions.

From my experience, the key to finding the right iteration for your scenario is through experimentation, allowing you to swap out different components until you achieve the desired outcome.

Data orchestration has become crucial in facilitating this experimentation during the initial phases of building a data platform. It not only streamlines the process but also offers scalable options to align with the trajectory of any business.

Orchestration is commonly executed through Directed Acyclic Graphs (DAGs) or code that structures hierarchies, dependencies, and pipelines of tasks across multiple systems. Simultaneously, it manages and scales the resources utilized to run these tasks.

Airflow remains the go-to solution for data orchestration, available in various managed flavors such as MWAA,…



Source link

Tags: BuildBuildingdatadataâDavefebMelilloModernPlatformScalable
Previous Post

El Impacto del Aprendizaje Automático en el Mercado Laboral LATAM

Next Post

Why Palantir Technologies Stock Skyrocketed as Much as 47% This Week

Related Posts

AI Compared: Which Assistant Is the Best?
Data Science & ML

AI Compared: Which Assistant Is the Best?

June 10, 2024
5 Machine Learning Models Explained in 5 Minutes
Data Science & ML

5 Machine Learning Models Explained in 5 Minutes

June 7, 2024
Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’
Data Science & ML

Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’

June 7, 2024
How to Learn Data Analytics – Dataquest
Data Science & ML

How to Learn Data Analytics – Dataquest

June 6, 2024
Adobe Terms Of Service Update Privacy Concerns
Data Science & ML

Adobe Terms Of Service Update Privacy Concerns

June 6, 2024
Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart
Data Science & ML

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024
Next Post
Why Palantir Technologies Stock Skyrocketed as Much as 47% This Week

Why Palantir Technologies Stock Skyrocketed as Much as 47% This Week

ADGM Partners with Solana (SOL) Foundation to Boost Blockchain Innovation

ADGM Partners with Solana (SOL) Foundation to Boost Blockchain Innovation

CMU Researchers Introduce VisualWebArena: An AI Benchmark Designed to Evaluate the Performance of Multimodal Web Agents on Realistic and Visually Stimulating Challenges

CMU Researchers Introduce VisualWebArena: An AI Benchmark Designed to Evaluate the Performance of Multimodal Web Agents on Realistic and Visually Stimulating Challenges

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In