Calmcode, Explosion, Data Science
Summary
Vincent Warmerdam, a Machine Learning Engineer at Explosion (creators of Spacey and Prodigy), discusses his career, open-source contributions, and insights into the data science field. He highlights his journey, which began during the rise of random forests, emphasizing recognition gained through blogging and organizing meetups, leading to direct CTO hires. Warmerdam created Calmcode, a free platform offering concise, opinionated 5-minute videos on data science topics, attracting 10-20,000 monthly users. He details his open-source philosophy, driven by solving personal "itches" and building tools like Bulk for bulk labeling using UMAP embeddings. Warmerdam stresses the importance of rephrasing problems, citing a 5% cost reduction for the World Food Program by focusing on nutrients over specific foods. He also advocates for system thinking over isolated component optimization, warns against ML "hype," and advises new data scientists to blog "Today I Learned" snippets and consider analyst roles.
Key takeaway
For data scientists and ML engineers evaluating project scope, you should prioritize deeply understanding the problem and its system context before jumping to complex algorithmic solutions. This approach often yields simpler, more robust outcomes and prevents "artificial stupidity." Consider starting a "Today I Learned" blog to document insights and build an online presence.
Key insights
Effective data science prioritizes problem rephrasing, system thinking, and community engagement over solely optimizing algorithms.
Principles
- Cultivate many small projects; some will flourish.
- Optimize systems, not just individual algorithms.
- Constraints are fundamental in problem optimization.
Method
For bulk labeling, embed data into a 2D UMAP plot, identify clusters, and make selections for efficient annotation.
In practice
- Develop internal Python packages for reusable utilities.
- Combine ML models with rule-based systems.
- Test generative models with awkward edge-case tasks.
Topics
- Machine Learning Engineering
- Data Science Career
- Open-Source Software
- Calmcode
- Problem Framing
- Data-Centric AI
Best for: Machine Learning Engineer, Data Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.