Tevogen Bio’s Journey to Streamlining Life-Saving Therapies

2026-03-25 · Source: Databricks · Field: Health & Wellbeing — Pharmaceuticals & Biotechnology, Health & Medical Research, Medical Devices & Health Technology · Depth: Intermediate, quick

Summary

Tevogen Bio, through its Tevogen.AI division, partnered with Microsoft and Databricks to accelerate its ExacTcell drug discovery platform, which traditionally takes 10-12 years and costs over $3 billion per drug. The initial SARS-COV2 target selection, a single HLA restricted product, required 18-24 months of manual wet lab testing. The collaboration aimed to transform this process from months to days or hours by ingesting and creating a multi-terabyte library of protein sequences. This dataset, comprising 24 million proteins refined into 16 billion datapoints and ~700 million unique peptides, along with ~37 million expert articles, is used to train Tevogen.AI's patented algorithmic models to predict immunologically active peptides using machine learning. The Databricks Platform, utilizing Medallion Architecture and Unity Catalog, reduced data processing time from 50 days to 24 hours.

Key takeaway

For AI Engineers and Research Scientists developing drug discovery pipelines, adopting a modern data lakehouse architecture like Databricks with Medallion Architecture can drastically cut data processing times from weeks to hours. This acceleration allows for faster iteration on machine learning models for target identification, directly impacting the speed and cost-efficiency of bringing new therapies to market.

Key insights

Integrating a data lakehouse architecture with ML significantly accelerates drug discovery target identification.

Principles

Parallel processing dramatically reduces serial workflow times.
Structured data layers improve data governance and ML model development.

Method

A modern data lakehouse, built on Databricks with Medallion Architecture and Unity Catalog, enables scalable ingestion, cleansing, and organization of multi-terabyte protein sequence datasets for ML model training.

In practice

Implement Medallion Architecture for multi-stage data processing.
Use Unity Catalog for fine-grained data access control.

Topics

Drug Discovery Acceleration
Machine Learning in Biotech
Data Lakehouse Architecture
Protein Sequence Analysis
Databricks Platform

Best for: AI Engineer, Data Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.