Scraps

As machine learning has proliferated over the last decade, there has naturally been an increased focus on the technical personas responsible for enabling and building machine learning models — most notably, the focus has been on the increasingly-popular industry of data science. This offshoot of statistics has become one of the in-demand professions — and universities around the world are responding by developing and innovating on new data science curricula.

Both Joey and our other cofounder, Joe Hellerstein, have been helping develop the data science major at UC Berkeley since 2016. The first data science course taught at Berkeley was 50 students back in Fall 2016, and today Data 100 (the course Joey has been developing since 2017) is the largest upper division course in the whole UC system.

Berkeley’s data science curriculum has helped set the standard for universities around the world. The introductory series of courses focuses on Python programming in Jupyter notebook, a strong grasp What’s reflected in our curriculum is somewhat obvious: Python programming (exclusively in Jupyter Notebooks), a strong understanding of statistical basics, common machine learning libraries (SciKit-Learn and PyTorch), the basics of SQL and data manipulation, and — of course — the ability to understand and answer hard questions with data. In short, we teach you everything you need to know in order to build a great machine learning model.

That’s where our curriculum ends at Berkeley. We’ve never given much thought to teaching students what comes after the model’s built — partially because students in class don’t have clear business applications, and partially because this responsibility is often dismissed as a software engineering concern. In this world view, the output of data scientists is a machine learning model — a serialized weights file combined with some Python scripts that make predictions. How does the rest of the business (non-technical colleagues, customers, etc.) benefit from the model? Well, that’s for someone else to figure out.

We believe that’s the wrong way to think about things.

Setting the Right Goal

Over the last year and change, we’ve interviewed 165 data science teams. Most of them are building machine learning models of some sort (though very few are doing anything fancier than decision trees in SciKit-Learn), but what we’ve also found is that data scientists can very rarely write a notebook, put their legs up, and move on to the next project — the output data scientists produce isn’t just a machine learning model.

Most of the data scientists we’ve spoken to are in the business of delivering predictions, not machine learning models. It’s not just a question of how to build a good model; it’s also a question of getting that model packaged up, running repeatably and reliably, and available for stakeholders to use. Whether the model’s implemented in SciKit-Learn or PyTorch, whether it’s linear regression or a CNN, whether it has a long featurization process or uses data right out of the database — none of these things matter to the end-user, as long as they have the best predictions possible in their hands.

The catch here is that delivering predictions means that data scientists can’t stop caring about their models once they’ve been built. Models have to run somewhere (in the cloud!), they have to be reliable, changes have to be detected, and bugs and regressions have to be addressed. Effectively, data scientists need to do a whole lot of software engineering in order to deliver predictions.

Data Scientists Aren’t Software Engineers

More Scraps — Ignore Everything Below This

We all know that machine learning is the next big thing that’s going to revolutionize the world. We’ve been talking about the potential of machine learning as its made its steady march from niche interest to integral part of our daily lives over the last 15 years — there’s a million other blog posts about this, so I’ll spare you the standard stuff. We all know machine learning is important.

What’s received comparatively less attention in recent years is the who. Who are the programmers building machine learning models over the next decade? The quick answer is obvious — data scientists and machine learning engineers — but we’ve become fascinated by who those people in recent years.

Last week, Joey shared his reflections on his research career in building systems for machine learning over the last decade. Many of the lessons he shared have directly inspired what we’re building at Aqueduct. But just as critical to our perspective has been the fact that Joey and Joe have been building out the data science major at Berkeley since 2017.

The data science curriculum has grown incredibly quickly in the last 5 years, and the class they built together — DS100 — is now the largest upper division course in the UC system. The fascinating thing about the curriculum is that if you look at the introductory series of courses, a significant majority of the students at Berkeley aren’t coming from traditional computer science backgrounds. They’re either data science major students (where they focus on primarily statistics and ML and less on computer science fundamentals), or they’re from a variety of other majors across the university (biology, chemistry, economics, business, etc.). This persona has become something of an obsession of ours — the Python-fluent, ML-enabled data scientist who has expertise in another discipline other than just data science.

Critically, nowhere in this skillset (or in the Berkeley curriculum) is there discussion of distributed systems programming, low-level infrastructure configuration, or cloud server and resource management. Frankly, that’s largely by design: Data scientists are experts in data cleaning, featurization, and model design, and critically, they’re set up to apply those skills to the research or business problems they’re solving. The skillset required to build Docker containers, set up Kubernetes clusters, and maintain reliable