Practical lessons from upgrading Bed-Reader, a bioinformatics library
Rust and Python reading DNA data directly from the cloud — Source: https://openai.com/dall-e-2/. All other figures from the author.
Would you like your Rust program to seamlessly access data from files in the cloud? When I refer to “files in the cloud,” I mean data housed on web servers or within cloud storage solutions like AWS S3, Azure Blob Storage, or Google Cloud Storage. The term “read”, here, encompasses both the sequential retrieval of file contents — be they text or binary, from beginning to end —and the capability to pinpoint and extract specific sections of the file as needed.
Upgrading your program to access cloud files can reduce annoyance and complication: the annoyance of downloading to local storage and the complication of periodically checking that a local copy is up to date.
Sadly, upgrading your program to access cloud files can also increase annoyance and complication: the annoyance of URLs and credential information, and the complication of asynchronous programming.
Bed-Reader is a Python package and Rust crate for reading PLINK Bed Files, a binary format used in bioinformatics to store genotype (DNA) data. At a user’s request, I recently updated Bed-Reader to optionally read data directly from cloud storage. Along the way, I learned nine rules that can help you add cloud-file support to your programs. The rules are:
1. Use crate object_store (and, perhaps, cloud-file) to sequentially read the bytes of a cloud file.
2. Sequentially read text lines from cloud files via two nested loops.
3. Randomly access cloud files, even giant ones, with “range” methods, while respecting server-imposed limits.
4. Use URL strings and option strings to access HTTP, Local Files, AWS S3, Azure, and Google Cloud.
5. Test via tokio::test on http and local files.
6. If other programs call your program — in other words, if your program offers an API (application program interface) — four additional rules apply:
a. For maximum performance, add cloud-file support to your Rust library via an async API.
b. Alternatively, for maximum convenience, add cloud-file support to your Rust library via a traditional (“synchronous”) API.
c. Follow the rules of good API design in part by using hidden lines in your doc tests.
d. Include a runtime, but optionally.
Aside: To avoid wishy-washiness, I call these “rules”, but they are, of course, just suggestions.
The powerful object_store crate provides full content access to files stored on http, AWS S3, Azure, Google Cloud, and local files. It is part of the Apache Arrow project and has over 2.4 million downloads.
For this article, I also created a new crate called cloud-file. It simplifies the use of the object_store crate. It wraps and focuses on a useful subset of object_store’s features. You can either use it directly, or pull-out its code for your own use.
Let’s look at an example. We’ll count the lines of a cloud file by counting the number of newline characters it contains.
“`rust
use cloud_file::{CloudFile, CloudFileError};
use futures_util::StreamExt; // Enables `.next()` on streams.
async fn count_lines(cloud_file: &CloudFile) -> Result
let mut chunks = cloud_file.stream_chunks().await?;
let mut newline_count: usize = 0;
while let Some(chunk) = chunks.next().await {
let chunk = chunk?;
newline_count += bytecount::count(&chunk, b’\n’);
}
Ok(newline_count)
}
#[tokio::main]
async fn main() -> Result<(), CloudFileError> {
let url = “https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/toydata.5chrom.fam”;
let options = [(“timeout”, “10s”)];
let cloud_file = CloudFile::new_with_options(url, options)?;
let line_count = count_lines(&cloud_file).await?;
println!(“line_count: {}”, line_count);
Ok(())
}
“`
When we run this code, it returns:
“`
line_count: 500
“`
Some points of interest:
– We use async (and, here, tokio). We’ll discuss this choice more in Rules 6 and 7.
– We turn a URL string and string options into a CloudFile instance with CloudFile::new_with_options(url, options)?. We use ? to catch malformed URLs).
– We create a stream of binary chunks with cloud_file.stream_chunks().await?. This is the first place that the code tries to access the cloud file. If the file doesn’t exist or we can’t open it, the ? will return an error.
– We use chunks.next().await to retrieve the file’s next binary chunk. (Note the use futures_util::StreamExt;.) The next method returns None after all chunks have been retrieved.
– What if there is a next chunk but also a problem retrieving it? We’ll catch any problem with let chunk = chunk?;.
– Finally, we use the fast bytecount crate to count newline characters.
In contrast with this cloud solution, think about how you would write a simple line counter for a local file. You might write this:
“`rust
use std::fs::File;
use std::io::{self, BufRead, BufReader};
fn main() -> io::Result<()> {
let path = “examples/line_counts_local.rs”;
let reader = BufReader::new(File::open(path)?);
let mut line_count = 0;
for line in reader.lines() {
let _line = line?;
line_count += 1;
}
println!(“line_count: {}”, line_count);
Ok(())
}
“`
Between the cloud-file version and the local-file version, three differences stand out. First, we can easily read local files as text. By default, we read cloud files as binary (but see Rule 2). Second, by default, we read local files synchronously, blocking program execution until completion. On the other hand, we usually access cloud files asynchronously, allowing other parts of the program to continue running while waiting for the relatively slow network access to complete. Third, iterators such as lines() support for. However, streams such as stream_chunks() do not, so we use while let.
I mentioned earlier that you didn’t need to use the cloud-file wrapper and that you could use the object_store crate directly. Let’s see what it looks like when we count the newlines in a cloud file using only object_store methods:
“`rust
use futures_util::StreamExt; // Enables `.next()` on streams.
pub use object_store::path::Path as StorePath;
use object_store::{parse_url_opts, ObjectStore};
use std::sync::Arc;
use url::Url;
async fn count_lines(object_store: &Arc
let mut chunks = object_store.get(&store_path).await?.into_stream();
let mut newline_count: usize = 0;
while let Some(chunk) = chunks.next().await {
let chunk = chunk?;
newline_count += bytecount::count(&chunk, b’\n’);
}
Ok(newline_count)
}
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
let url = “https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/toydata.5chrom.fam”;
let options = [(“timeout”, “10s”)];
let url = Url::parse(url)?;
let (object_store, store_path
Source link