Thursday, May 8, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Nine Rules for Accessing Cloud Files from Your Rust Code | by Carl M. Kadie | Feb, 2024

February 7, 2024
in AI Technology
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter



Practical lessons from upgrading Bed-Reader, a bioinformatics library

Rust and Python reading DNA data directly from the cloud — Source: https://openai.com/dall-e-2/. All other figures from the author.

Would you like your Rust program to seamlessly access data from files in the cloud? When I refer to “files in the cloud,” I mean data housed on web servers or within cloud storage solutions like AWS S3, Azure Blob Storage, or Google Cloud Storage. The term “read”, here, encompasses both the sequential retrieval of file contents — be they text or binary, from beginning to end —and the capability to pinpoint and extract specific sections of the file as needed.

Upgrading your program to access cloud files can reduce annoyance and complication: the annoyance of downloading to local storage and the complication of periodically checking that a local copy is up to date.

Sadly, upgrading your program to access cloud files can also increase annoyance and complication: the annoyance of URLs and credential information, and the complication of asynchronous programming.

Bed-Reader is a Python package and Rust crate for reading PLINK Bed Files, a binary format used in bioinformatics to store genotype (DNA) data. At a user’s request, I recently updated Bed-Reader to optionally read data directly from cloud storage. Along the way, I learned nine rules that can help you add cloud-file support to your programs. The rules are:
1. Use crate object_store (and, perhaps, cloud-file) to sequentially read the bytes of a cloud file.
2. Sequentially read text lines from cloud files via two nested loops.
3. Randomly access cloud files, even giant ones, with “range” methods, while respecting server-imposed limits.
4. Use URL strings and option strings to access HTTP, Local Files, AWS S3, Azure, and Google Cloud.
5. Test via tokio::test on http and local files.
6. If other programs call your program — in other words, if your program offers an API (application program interface) — four additional rules apply:
a. For maximum performance, add cloud-file support to your Rust library via an async API.
b. Alternatively, for maximum convenience, add cloud-file support to your Rust library via a traditional (“synchronous”) API.
c. Follow the rules of good API design in part by using hidden lines in your doc tests.
d. Include a runtime, but optionally.

Aside: To avoid wishy-washiness, I call these “rules”, but they are, of course, just suggestions.

The powerful object_store crate provides full content access to files stored on http, AWS S3, Azure, Google Cloud, and local files. It is part of the Apache Arrow project and has over 2.4 million downloads.

For this article, I also created a new crate called cloud-file. It simplifies the use of the object_store crate. It wraps and focuses on a useful subset of object_store’s features. You can either use it directly, or pull-out its code for your own use.

Let’s look at an example. We’ll count the lines of a cloud file by counting the number of newline characters it contains.

“`rust
use cloud_file::{CloudFile, CloudFileError};
use futures_util::StreamExt; // Enables `.next()` on streams.

async fn count_lines(cloud_file: &CloudFile) -> Result {
let mut chunks = cloud_file.stream_chunks().await?;
let mut newline_count: usize = 0;

while let Some(chunk) = chunks.next().await {
let chunk = chunk?;
newline_count += bytecount::count(&chunk, b’\n’);
}

Ok(newline_count)
}

#[tokio::main]
async fn main() -> Result<(), CloudFileError> {
let url = “https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/toydata.5chrom.fam”;
let options = [(“timeout”, “10s”)];
let cloud_file = CloudFile::new_with_options(url, options)?;
let line_count = count_lines(&cloud_file).await?;

println!(“line_count: {}”, line_count);
Ok(())
}
“`

When we run this code, it returns:
“`
line_count: 500
“`

Some points of interest:
– We use async (and, here, tokio). We’ll discuss this choice more in Rules 6 and 7.
– We turn a URL string and string options into a CloudFile instance with CloudFile::new_with_options(url, options)?. We use ? to catch malformed URLs).
– We create a stream of binary chunks with cloud_file.stream_chunks().await?. This is the first place that the code tries to access the cloud file. If the file doesn’t exist or we can’t open it, the ? will return an error.
– We use chunks.next().await to retrieve the file’s next binary chunk. (Note the use futures_util::StreamExt;.) The next method returns None after all chunks have been retrieved.
– What if there is a next chunk but also a problem retrieving it? We’ll catch any problem with let chunk = chunk?;.
– Finally, we use the fast bytecount crate to count newline characters.

In contrast with this cloud solution, think about how you would write a simple line counter for a local file. You might write this:

“`rust
use std::fs::File;
use std::io::{self, BufRead, BufReader};

fn main() -> io::Result<()> {
let path = “examples/line_counts_local.rs”;
let reader = BufReader::new(File::open(path)?);
let mut line_count = 0;

for line in reader.lines() {
let _line = line?;
line_count += 1;
}

println!(“line_count: {}”, line_count);
Ok(())
}
“`

Between the cloud-file version and the local-file version, three differences stand out. First, we can easily read local files as text. By default, we read cloud files as binary (but see Rule 2). Second, by default, we read local files synchronously, blocking program execution until completion. On the other hand, we usually access cloud files asynchronously, allowing other parts of the program to continue running while waiting for the relatively slow network access to complete. Third, iterators such as lines() support for. However, streams such as stream_chunks() do not, so we use while let.

I mentioned earlier that you didn’t need to use the cloud-file wrapper and that you could use the object_store crate directly. Let’s see what it looks like when we count the newlines in a cloud file using only object_store methods:

“`rust
use futures_util::StreamExt; // Enables `.next()` on streams.
pub use object_store::path::Path as StorePath;
use object_store::{parse_url_opts, ObjectStore};
use std::sync::Arc;
use url::Url;

async fn count_lines(object_store: &Arc>, store_path: StorePath) -> Result {
let mut chunks = object_store.get(&store_path).await?.into_stream();
let mut newline_count: usize = 0;

while let Some(chunk) = chunks.next().await {
let chunk = chunk?;
newline_count += bytecount::count(&chunk, b’\n’);
}

Ok(newline_count)
}

#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
let url = “https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/toydata.5chrom.fam”;
let options = [(“timeout”, “10s”)];
let url = Url::parse(url)?;
let (object_store, store_path



Source link

Tags: AccessingCarlcloudCodefebfilesKadierulesRust
Previous Post

How To Create A JavaScript Countdown Timer For Beginners

Next Post

Biden wins Nevada Democratic primary, NBC News projects

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
Biden wins Nevada Democratic primary, NBC News projects

Biden wins Nevada Democratic primary, NBC News projects

An Introduction to Artificial Neural Networks (ANNs)

An Introduction to Artificial Neural Networks (ANNs)

Sequoia’s Botha says leaders should stay open to opportunities that ‘don’t announce themselves’

Sequoia's Botha says leaders should stay open to opportunities that 'don't announce themselves'

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

April 10, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In