Tevogen Bio’s Journey to Streamlining Life-Saving Therapies
Summary
Tevogen Bio, through its Tevogen.AI division, partnered with Microsoft and Databricks to accelerate its ExacTcell drug discovery platform, which traditionally takes 10-12 years and costs over $3 billion per drug. The initial SARS-COV2 target selection, a single HLA restricted product, required 18-24 months of manual wet lab testing. The collaboration aimed to transform this process from months to days or hours by ingesting and creating a multi-terabyte library of protein sequences. This dataset, comprising 24 million proteins refined into 16 billion datapoints and ~700 million unique peptides, along with ~37 million expert articles, is used to train Tevogen.AI's patented algorithmic models to predict immunologically active peptides using machine learning. The Databricks Platform, utilizing Medallion Architecture and Unity Catalog, reduced data processing time from 50 days to 24 hours.
Key takeaway
For AI Engineers and Research Scientists developing drug discovery pipelines, adopting a modern data lakehouse architecture like Databricks with Medallion Architecture can drastically cut data processing times from weeks to hours. This acceleration allows for faster iteration on machine learning models for target identification, directly impacting the speed and cost-efficiency of bringing new therapies to market.
Key insights
Integrating a data lakehouse architecture with ML significantly accelerates drug discovery target identification.
Principles
- Parallel processing dramatically reduces serial workflow times.
- Structured data layers improve data governance and ML model development.
Method
A modern data lakehouse, built on Databricks with Medallion Architecture and Unity Catalog, enables scalable ingestion, cleansing, and organization of multi-terabyte protein sequence datasets for ML model training.
In practice
- Implement Medallion Architecture for multi-stage data processing.
- Use Unity Catalog for fine-grained data access control.
Topics
- Drug Discovery Acceleration
- Machine Learning in Biotech
- Data Lakehouse Architecture
- Protein Sequence Analysis
- Databricks Platform
Best for: AI Engineer, Data Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.