Sunday, May 18, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

How Much Data Do We Need? Balancing Machine Learning with Security Considerations | by Stephanie Kirmer | Dec, 2023

December 15, 2023
in Data Science & ML
Reading Time: 5 mins read
0 0
A A
0
Share on FacebookShare on Twitter



For a data scientist, there’s no such thing as too much data. But when we take a broader look at the organizational context, we have to balance our goals with other considerations.

Photo by Trnava University on Unsplash

Acquiring and keeping data is the focus of a huge amount of our mental energy as data scientists. If you ask a data scientist “Can we solve this problem?” the first question most of us will ask is “Do you have data?” followed by “How much data do you have?” We want to collect data because it is the prerequisite for most of the kinds of work we want to do, in order to produce valuable models and beneficial results. We love to dig around in that data, learn what is really in there and what it means, find out how it was generated or collected, and learn generalizable conclusions from it.

Taking a hard look at data privacy puts our habits and choices in a different context, however. Data scientists’ instincts and desires often work in tension with the needs of data privacy and security. Anyone who’s fought to get access to a database or data warehouse in order to build a model can relate. It can feel like there are wildly over-cautious barriers being thrown up in the way of us doing our jobs. After all, isn’t the reason we have the data to learn from it and model it? Even the best of us sometimes demonize the parts of our organization whose primary goals are in the privacy and security area and conflict with our wishes to splash around in the data lake.

In reality, data scientists are not always the heroes and IT and security teams are not the villains. We are both working on important goals and can both get a little bit of tunnel vision in that pursuit. It helps to look at the perspectives of both roles to understand the tension in place and the competing interests.

The Data Science Perspective

From the data science angle, having large volumes of data is frequently necessary to meet the goals of our work. To build a generalizable model, you need to have many, many examples of the kinds of data that your model will need to respond to in production. Hundreds of thousands or millions of cases is not an outrageous amount to look for, by any means. However, to really make this work, data scientists must spend a lot of time and energy interrogating that data. Having a whole lot of data is great, but if you don’t know what it really represents and its provenance, the battle to do effective data science will be very uphill.

The Security Angle

If we take the security-forward perspective, on the other hand, we have to admit that the larger the quantities of data we have — particularly if there are multiple systems of storage or processes influencing the data — the larger the risk of data breach. Essentially, the more data we have, the more the chance that some of it goes missing or gets accessed by someone inappropriately. In addition, more people having access to data means more opportunities for breach or data loss, because human beings are the biggest risk vector in the technology space. We’re the weak link in the chain.

What does all this mean? I would argue that it leads us to need a middle ground. For one thing, the more data we have lying around, the lower the likelihood that we have actually done the work to understand it deeply, or that we even could with the time and tools at our disposal. If we just hoard everything indiscriminately, we actually put ourselves in a position where we can’t even understand all the data and we are simultaneously at peak risk of breach. If we store nothing, or not enough, we make it impossible to access the incredible value data science has to offer.

So, we need to figure out where this middle ground lives. Best practices in data engineering and data retention do exist, but we have to make a lot of spur of the moment decisions too. Having principles around how we think about data retention and usage are important to help guide us in these situations.

While I am on this topic of data management, I should mention—I recently started a new role! I am the first senior machine learning engineer at DataGrail, a company that provides a suite of B2B services helping companies secure and manage their customer data. This has naturally put the questions of data storage and privacy to the front of my mind, and made me think about the experiences I’ve had across my own career in companies of varying maturity levels and how they handled data.

It’s so easy for a company to become a data hoarder. You begin with a shortage of data, and you’re flying blind, collecting data about transactions, business activities, etc. as you go to help inform decisions and strategy. You may not be doing machine learning yet, but you can see the future potential, and you want to prepare the scene. It seems not only reasonable but vital to collect your data and store it! So, you set up data systems and start filling up those tables or topics.

This isn’t sustainable, though — not forever anyway. After a few years go by you can end up with huge volumes of data. Maybe you need to scale up to a cloud storage provider like Snowflake or AWS to keep up and make all this data accessible at the pace you need. You’re using the data, of course! Maybe you have begun a machine learning program, or even just advanced analytics and BI, but this is making a huge difference to your business’s effectiveness if done well. But even so, you’re going to start having to think about the cost of the infrastructure, not to mention probably hiring data engineering staff to help manage the beast.

Unfortunately, you have also started to acquire data that you don’t have a good handle on anymore. Documentation may be falling out of date, if it ever existed at all, and the staff who helped build out the original systems years ago could be turning over. What does this table mean? What is the provenance of that column? Data that isn’t interpretable generates little value if any, because you can’t effectively learn from data you don’t understand.

At this point you have decisions to make. How are you going to strategically plan for the future of your data systems? You probably need to attend to data architecture to try and keep costs from skyrocketing, but what about data retention? Do you keep all data forever? If not, what do you cut and when? Remember, however, that retaining a pretty large volume of data is a non-negotiable requirement if your business will have effective machine learning and/or analytics functions supporting your decisionmaking and products. “Throw it all out and avoid any of this nonsense” is not an option.

At the same time, you need to be thinking about the regulatory and legal frameworks that apply to having all this data. What are you going to do if a customer asks you to delete all the data you have about them, as some jurisdictions allow? Many organizations don’t take this seriously until they’re already late to the party. If you’re going to be on top of it, and you didn’t start from day 1, you have the tough task ahead of retrofitting your data architecture to handle the regulatory requirements this data is subject to.

The growth in data security regulations in recent years has increased the challenges of the scenario I describe for businesses. In some ways, it was our own doing — numerous data breaches, lax security, and opaque consent policies by assorted companies over recent years have led to public demand for better, and government filled the gap. It appears that brand trust and safety weren’t enough motivation on their own to get many businesses to tighten up the ship where data protection was concerned. If laws were necessary to ensure that our personal data and sensitive records are protected conscientiously, then I for one am all for it.

However, in my data scientist hat, I have to acknowledge the tension that I started with in this column. I want all the data, and…



Source link

Tags: BalancingConsiderationsdataDecKirmerLearningMachineSecurityStephanie
Previous Post

ML.NET Model Builder: Machine learning with .NET

Next Post

Advancements in machine learning for machine learning – Google Research Blog

Related Posts

AI Compared: Which Assistant Is the Best?
Data Science & ML

AI Compared: Which Assistant Is the Best?

June 10, 2024
5 Machine Learning Models Explained in 5 Minutes
Data Science & ML

5 Machine Learning Models Explained in 5 Minutes

June 7, 2024
Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’
Data Science & ML

Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’

June 7, 2024
How to Learn Data Analytics – Dataquest
Data Science & ML

How to Learn Data Analytics – Dataquest

June 6, 2024
Adobe Terms Of Service Update Privacy Concerns
Data Science & ML

Adobe Terms Of Service Update Privacy Concerns

June 6, 2024
Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart
Data Science & ML

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024
Next Post
Advancements in machine learning for machine learning – Google Research Blog

Advancements in machine learning for machine learning – Google Research Blog

2023 Blockchain Futurist Conference | Canada’s Largest Web3 Event! | Day 1

2023 Blockchain Futurist Conference | Canada's Largest Web3 Event! | Day 1

A Comprehensive List of AI Ethics Principles

A Comprehensive List of AI Ethics Principles

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

April 10, 2024
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In