As Data Pile Up, Some Healthcare Organizations Turn To ‘Lakehouses’

Kaltwasser,Jared;

Drug Coverage
Hypertrophic Cardiomyopathy (HCM)
Vaccines: 2023 Year in Review
Eyecare
Urothelial Carcinoma
Women's Health
Hemophilia
Heart Failure
Vaccines
Neonatal Care
NSCLC
Type II Inflammation
Substance Use Disorder
Gene Therapy
Lung Cancer
Spinal Muscular Atrophy
HIV
Post-Acute Care
Liver Disease
Pulmonary Arterial Hypertension
Safety & Recalls
Biologics
Asthma
Atrial Fibrillation
Type I Diabetes
RSV
COVID-19
Cardiovascular Diseases
Breast Cancer
Prescription Digital Therapeutics
Reproductive Health
The Improving Patient Access Podcast
Blood Cancer
Ulcerative Colitis
Respiratory Conditions
Multiple Sclerosis
Digital Health
Population Health
Sleep Disorders
Biosimilars
Plaque Psoriasis
Leukemia and Lymphoma
Oncology
Pediatrics
Urology
Obstetrics-Gynecology & Women's Health
Opioids
Solid Tumors
Autoimmune Diseases
Dermatology
Diabetes
Mental Health

As Data Pile Up, Some Healthcare Organizations Turn To ‘Lakehouses’

May 9, 2024

Jared Kaltwasser

News

Article

MHE PublicationMHE May 2024

Volume 34

Issue 5

Healthcare organizations generate and store vast amounts of data but they often have to choose between speed and structure. Data "lakehouses" may be the answer.

Cloud computing offers a number of advantages but it would be a mistake to assume that cloud computing automatically leads to lower costs, says Nick Stepro, the chief product and technology officer at Arcadia, a healthcare IT firm in Boston.

Nick Stepro

“Those that have gone to the cloud have probably learned that of all of the things the cloud has made easy, it has made spending money easier than anything,” Stepro says. “You can basically blink and wake up to a seven-digit cloud bill like that.”

A 2022 report by the data security firm Netwrix found that 73% of healthcare organizations store sensitive data in the cloud. Respondents said that within 12 to 18 months, they expected more than half of their workload would be performed in the cloud.

Ironically, the most common reason respondents said they were moving to the cloud was to cut costs. Stepro says high cloud costs are one reason it is important to use the cloud efficiently. He says making cloud computing cost effective requires thinking not just about where an organization stores its data, but also how it stores the data. That is why some healthcare organizations are turning to a new approach to managing data: the
data lakehouse.

Why a lakehouse?

If the coinage “data lakehouse” seems odd, that’s because the concept is actually a combination of two other methods of collecting and storing large amounts of data.

The first — and older — method is called data warehousing. Just like a warehouse for an online retailer, everything that goes into a data warehouse needs to be carefully structured and organized right from the start so that it is easily trackable and accessible from the moment it arrives in the warehouse.

“It’s incredibly reliable,” Stepro says. “It makes sure that your data always writes when you ask it to write, reliably, and there’s no drift in data.”

However, all of that structure comes with a trade-off. “It’s really, really hard to run fast,” he explains. “And it’s really, really hard to take advantage of cloud-native compute and cloud-native storage.”

Those problems led to the creation of a different approach, the “data lake.” Stepro says this model started to gain traction about a decade ago. If data warehouses are akin to highly organized corporate warehouses, data lakes look more like your 9-year-old’s bedroom.

“It’s just like throwing your data over the wall,” he says. “We don’t really care. Data in any source in any format, any structure — throw it into the lake and we can apply really, really high-scale cloud compute on top of that to do all sorts of stuff.”

In other words, the emphasis is on collecting the data as quickly as possible, but at the expense of strict, front-end organization and quality control.

“And the challenge with that is that there’s basically no governance on it whatsoever,” Stepro says. “Your data can’t be reliably traced across schema changes, and it takes a lot of heavy processing people on top of that to govern that.”

If an organization is not careful, its data lake can turn into a swamp, full of duplicated, inaccurate or incomplete data.

Best of both worlds

The idea behind data lakehouses is to combine the strengths of data warehouses and data lakes while trying to minimize their downsides. That means leveraging the speed and lower cost of the data lake approach but also applying tier-based governance and technology in order to provide structure and usability.

“So you can still run really, really, really fast, you can still create massive, thousand-node compute layers on top of it, but you don’t turn it into a swamp, which is what happens to a lot of lakes,” says Stepro.

Arcadia markets a data lakehouse product designed to meet the needs of the healthcare industry, which poses some challenges because it is highly regulated and generates a wide array of data types. Other companies are selling data lakehouse products. Databricks, a data and artificial intelligence (AI) company in San Francisco, launched a healthcare-
focused data lakehouse platform in 2022. Companies such as eClinical are promoting data lakehouses as tools to optimize digital
clinical trials.

Juliana Landry

Among the healthcare organizations adopting data lakehouse infrastructures is Umpqua Health, an Oregon-based coordinated care organization. Juliana Landry, M.P.H., Umpqua’s vice president of health systems performance, said in a press release that she believes the lakehouse approach will ensure “the rapid delivery of updated data to our care teams, significantly improving our processes.

“From optimizing member outreach opportunities to accelerating care program enrollments and improving care coordination efficiency, this rapid data refresh means fresher insights and more informed decision-making across the enterprise and improves our ability to deliver better outcomes,” she added.

Stepro noted that healthcare organizations have everything ranging from very mundane data like physician directories and work hour logs to highly technical clinical trial data and highly personal health records.

A tiered approach

One way Arcadia’s lakehouse structure deals with these diverse needs is through a tiered “medallion” system. Under that system, data can enter the lakehouse as raw, “untransformed” data — the “bronze” tier. Over time, the data can be structured into a more useful product while the system applies tier-appropriate controls to govern access.

“As they are progressing through our system, every piece of data is tagged with record-level security that is temporally applied,” Stepro says. “And so what we’re able to do is to give access to big data technologies and data stores but do that in such a way where it respects the privacy of individuals and locks down different end-users from being able to access more than they should.”

Stepro gave an example of data from wastewater surveillance. Such data can be used to track the spread of illnesses such as COVID-19 through a population. The potential value of the data is high, yet it does not conform to any of the normal data structures that healthcare organizations are used to. “In the prior mindset, the work to negotiate that data into their existing structures and data strategy can be a monthlong exercise,” he says.

The lakehouse approach is to get the data into the database as soon as possible and worry about making it actionable later with the help of machine learning (ML) and other tools.

“We have this data, let it be loosely structured, throw it in the system, apply ML to traverse that dataset and do some initial insights in a relatively unstructured fashion,” he says, “and then you can graduate that data up the silver and gold chain of that modality and data model for more production, analytics assets.”

Lakehouses and artificial intelligence

Though one application of AI and ML is to make sense out of unstructured data, another role is to analyze that data in order to develop new insights that could improve patient care. Stepro says the lakehouse architecture can enable both functions.

“You can start leveraging this huge morass of unstructured data in which there’s a tremendous amount of signal,” he says, “but it needs to be married with appropriate context for these models to appropriately take advantage of it.”

After all, the “garbage in, garbage out” maxim is a major concern in the AI era. Flawed or incomplete data will lead to faulty insights, such as biased results that do not adequately account for different subgroups or scenarios. Stepro says the ideal situation is to be able to leverage a sufficient amount of data in a strategic way, one that does not require running up a sky-high cloud bill.

“Everyone right now is using a sledgehammer like GPT-4 on every problem,” he says, “but maturity will mean using finer-tuned, smaller models for very specific problems.”

Download Issue PDF

Articles in this issue