UPDATED 13:59 EDT / JULY 07 2023

BIG DATA

Healthcare analytics firm targets $10B in savings with the help of speedy data pipelines

Improving the quality of cardiovascular care while reducing its enormous cost is a big-data problem. Biome Analytics LLC believes it’s primed for a breakthrough.

The San Francisco-based firm harvests data from more than 70 hospitals, health systems and clinicians and analyzes it to determine which medical treatments yield the best and most cost-effective results.

The company is tackling a huge problem. The Centers for Disease Control and Prevention estimates that one person dies every 36 seconds in the U.S. from cardiovascular disease. The American Heart Association calculates that more than $320 billion is spent annually treating CVD in the U.S. alone and that nearly half of U.S. adults are expected to have CVD-related conditions by 2035.

Decoding best practices

Biome believes it can eliminate $10 billion in unnecessary CVD-related costs over the next seven years without diminishing the quality of patient care. It uses a suite of machine learning-driven applications to analyze what healthcare institutions around the country are doing and identify opportunities for improvement.

Gathering, validating, normalizing and acting upon all that data require massive data pipelines and processing infrastructure. Just ensuring that data is accurate and consistent is a big problem in itself, said Biome Analytics founder and Chief Executive Stuart Jacobson.

“One of the challenges in healthcare is that the data is super-messy,” he said. “Most of our clients use a common electronic medical records system, but each one is implemented differently. We get a lot of inconsistent data.”

And there’s a lot of data to manage. Each patient who undergoes cardiac surgery generates about 1,000 clinical data elements and up to 10,000 cost-related line items, Jacobson said. And large hospitals perform up to 100 such surgeries each month.

Biome Analytics’ data set spans more than 2 billion records that are collected via digital pipelines between the company, its partner data providers and customers. That data comes from a wide variety of equipment and databases, making the task of normalizing various formats into a set of standard metrics a daunting one.

The data just keeps on coming. “There may be four or five different procedures during a hospital stay done by different cardiologists,” Jacobson said. “Each one of those is a pipeline and we run those pipelines concurrently.”

Challenged to scale

Biome ingests and transforms the data that flows across its pipelines. It originally used multiple MySQL databases to store data mapping configurations with processing done by large and complex scripts. Code was stored in a single, largely unstructured repository.

As data volumes grew, performance and network costs began to get out of hand. Clients were also demanding more information faster. “We were on a cadence of delivering on a quarterly basis but doctors wanted more and more real-time data, so we had to increase our cycle time,” Jacobson said. “We went from quarterly delivery to monthly, and now we’re going to weekly. The whole cycle time has accelerated.”

The handcrafted data pipelines the company had initially built wouldn’t scale to meet demand, so Biome embarked on an overhaul of its pipeline infrastructure. It chose the Apache Spark analytics framework for data ingestion, transformation and analysis and switched from writing scripts in SQL to Python with GitLab for version control.

Its library of Python programs is stored on a Python Package index server with image-based data held in a Microsoft Corp. Azure cloud container registry. Intermediate batch processes were streamlined by moving to the open Parquet storage format and loading tables once into the database rather than after each batch of data was processed. That dramatically reduced processing times and network costs.

Automated orchestration

As Biome has progressed from batch SQL runs to parallel pipelines in Apache Spark, it’s been able to improve performance up to tenfold while reducing costs. Image: Biome

For pipeline orchestration, metrics and scheduling, Biome Analytics turned to an automated service provided by Ascension Labs Inc., which does business as Ascend.io. Ascend provides a single, Spark-based platform for building intelligent pipelines that detect and propagate changes, automate ingestion and transformation and monitor operations in real-time. Ascend claims to be able to reduce operating costs by up to 75% through tool consolidation while increasing the number of pipelines an engineer can manage sevenfold.

“The problem we solve is the difficulty companies have in ingesting and managing data at a velocity and simplicity level that gives them the ability to produce much more data with fewer people at much lower cost,” said Tom Weeks, Ascend’s chief operating officer. “A human being doesn’t have to get involved as data is being processed, cleansed, aggregated and transformed.”

With Ascend, Biome now has the option of building ad hoc pipelines for one-time data flows or using Ascend’s software development kit to hard-code pipelines that run across multiple clients and data sets. Once static and unit testing is complete, a Docker container image is generated containing data mapping configurations defined in the YAML markup language along with internal packages. That image is published to an Azure container registry for use by Ascend.

The building-block nature of Ascend’s platform has made pipeline development faster and less wasteful, Sarwat Fatima, principal data engineer at Biome, said in a video summary.

Like Legos

“Ascend is like a Lego set where various pieces can be deployed based on the data flow,” she said. “Our pipelines are automatically deployed via Gitlab and an ‘iBuilder’ class defines the interface and the construction stages for the pipeline.” A set of configurations are stored in a YAML file and a builder class implements each of the construction sets and combines the necessary data marts into a pipeline class that defines the final data flow that is then deployed via Ascend.

The new approach has given Biome the ability to use common code between data pipelines rather than writing bespoke procedures for each repository. It has eliminated repetitive steps, allowed developers to work with a consistent set of design patterns and YAML configurations and automated deployment through the software development group’s continuous integration/continuous deployment pipeline. “Ascend’s SDK also has many advanced features like dynamically creating connections and removing previous data flows,” Fatima said.

Running in parallel

Biome can now run multiple pipelines in parallel, stop and restart pipelines on demand and detect and respond to errors more efficiently. Its analysts can still access and manipulate data using SQL, thus freeing up data engineers to work on more advanced data science projects.

Processing times and network costs have been reduced and Biome can deliver results to clinicians faster. “It used to take us 12 to 24 hours to run one of these pipelines,” Jacobson said. “Because we can now parallelize it on different machines, we’re down to three to five hours. In some cases, there’s been a 10-times improvement from leveraging parallel computing and cloud infrastructure.”

Biome says its insights have helped hospitals reduce bleeding events by 69%, cut outpatient stay lengths by half, and reduced unnecessary variations in healthcare by 35%. All of that, it hopes, will add up to a $10 billion dividend over time. Though that won’t solve the problem of spiraling healthcare costs, it can at least put a dent in the growth curve.

Photo: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One-click below supports our mission to provide free, deep and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU