Modern data science has emerged in technology, from optimizing Google search rankings and LinkedIn recommendations to influencing Buzzfeed editors’ headlines. However, it is poised to transform all industries, from retail, telecommunications, and agriculture to health, trucking, and also the criminal justice system. However, the terms “data science” and “data scientist” are not always easily understood and are used to describe a broad range of data-related work.
What, exactly, do data scientists do? We had the chance to speak with over 30 data scientists from a variety of industries and academic disciplines. And asked them about their jobs, among other things.
True, data science is a diverse field. The data scientists we have interviewed approach our conversations from a variety of perspectives. They describe a wide range of work, including massive online experimental frameworks for product development at booking.com and Etsy, methods used by Buzzfeed to implement a multi-armed bandit solution for headline optimization, and the impact machine learning has on business decisions at Airbnb. That last example came up during a discussion with Airbnb data scientist Robert Chang. Chang works on productionized machine-learning models at Airbnb. Data science can be applied in a variety of ways, depending not only on the industry but also on the business and its objectives.
This is what data scientists do. At least in the tech industry, we now understand how data science works. To perform robust analytics, data scientists must first lay a solid data foundation. Then, among other methods, they use online experiments to achieve long-term growth. Finally, they construct machine learning pipelines and personalized data products in order to better understand their business and customers and make better decisions. In other words, data science in technology is concerned with infrastructure, testing, machine learning for decision making, and data products.
Other industries, besides technology, are making great strides. We spoke with Ben Skrainka, a data scientist at Convoy, about how his company is using data science to transform the North American trucking industry. Sandy Griffith of Flatiron Health spoke with us about the impact data science is having on cancer research. Drew Conway and I talked about his company Alluvium, which “uses machine learning and artificial intelligence to transform massive data streams generated by industrial operations into insights.” Mike Tamir, now Uber’s head of self-driving, discussed his collaboration with Takt to help Fortune 500 companies leverage data science, including his work on Starbucks’ recommendation systems. This non-exhaustive list depicts data-science revolutions in a variety of industries.
It’s not just self-driving cars and artificial general intelligence that hold promise. Many of my guests are skeptical not only of the mainstream media’s fetishization of artificial general intelligence (including headlines like VentureBeat’s “An AI god will emerge by 2042 and write its own bible.”) but also of the mainstream media’s fetishization of artificial general intelligence (including headlines like “An AI god will emerge by 2042 and write its own bible. Will you worship it?”), as well as the recent buzz surrounding machine learning and deep learning. Machine learning and deep learning are powerful techniques with significant applications, but as with all buzzwords, a healthy dose of skepticism is required. Almost all of our guests are aware that working data scientists earn their living by collecting and cleaning data, creating dashboards and reports, visualizing data, making statistical inferences, communicating results to key stakeholders, and persuading decision makers of their findings.
The skills data scientists require are changing (and deep learning experience isn’t the most important). “Which skill is more important for a data scientist: the ability to use the most sophisticated deep learning models, or the ability to make good PowerPoint slides?” we asked Jonathan Nolis, a data science leader in the Seattle area who works with Fortune 500 companies. He argued for the latter, arguing that communicating results is still an important part of data analysis.
Another recurring theme is that these skills, which are so important today, are likely to change in a short period of time. We’re seeing increasing automation of a lot of data-science drudgery, such as data cleaning and data preparation, as we see rapid developments in both the open-source ecosystem of tools available to do data science and the commercial, productized data-science tools. It’s a common myth that a data scientist spends 80 percent of his or her time simply finding, cleaning, and organizing data, leaving only 20% to actually perform analysis.
This, however, is unlikely to last. As we learned when we dedicated an episode to automated machine learning and spoke with Randal Olson, lead data scientist at Life Epigenetics, a lot of machine learning and deep learning is now automated.
Because of this rapid change, the vast majority of my guests tell us that the most important skills for data scientists aren’t the ability to build and use deep-learning infrastructures. Instead, they are the skills needed to learn on the fly and communicate effectively in order to respond to business questions and explain complex results to non-technical stakeholders. Aspiring data scientists should concentrate on questions rather than techniques. While new techniques come and go, critical thinking and quantitative, domain-specific skills will continue to be in demand.
The importance of specialization is growing. While data scientists lack a well-defined career path and little support for junior data scientists, we are beginning to see some specialization. Emily Robinson described the difference between Type A and Type B data scientists: “Type A is analysis — sort of like a traditional statistician — and Type B is machine learning model development.”
Jonathan Nolis divides data science into three categories: (1) business intelligence, which is essentially “taking data that the company has and getting it in front of the right people” via dashboards, reports, and emails; (2) decision science, which is “taking data and using it to help a company make a decision”; and (3) machine learning, which is “how can we take data science models and put them continuously into production.” Although many working data scientists are currently generalists who do all three, distinct career paths, such as machine learning engineers, are emerging.
One of the most difficult issues in the field is ethics. As you might expect, the profession entails a great deal of risk for its practitioners. “Do you think that imprecise ethics, no standards of practice, and a lack of consistent vocabulary are not enough challenges for us today?” Hilary Mason said in our first episode when I asked if there were any other major challenges facing the data science community.
All three are crucial, and the first two, in particular, are on nearly everyone’s mind when they visit DataFramed. What role does ethics play in a world where so many of our interactions with the world are dictated by algorithms developed by data scientists? In our interview, Omoju Miller, GitHub’s senior machine learning data scientist, said:
We need that ethical understanding, that training, and something akin to the Hippocratic oath. And we need proper licenses so that if you do something unethical, you have some kind of penalty, or disbarment, or some kind of recourse, something to say this is not what we want to do as an industry, and then we need to figure out how to remediate people who go off the rails and do things because they aren’t trained or know what they’re doing.
According to ProPublica, data science can have serious, harmful, and unethical consequences, such as the COMPAS Recidivism Risk Score, which has been “used across the country to predict future criminals” and is “biassed against blacks.”
We’re getting close to agreeing that ethical standards should come from data scientists themselves, as well as legislators, grassroots movements, and other stakeholders. A reemphasis on interpretability in models, rather than black-box models, is part of this movement. That is, we must develop models capable of explaining why they make the predictions they do. Deep learning models are excellent at many things, but they are notoriously difficult to understand. With projects like Lime, a project aimed at explaining what machine learning models are doing, a lot of dedicated, intelligent researchers, developers, and data scientists are making progress.
The data science revolution is just getting started across industries and society as a whole. It’s unclear whether the title of data scientist will remain the “sexiest job of the twenty-first century,” become more specialized, or become a set of skills that all working professionals must have. “Will we even have data science in 10 years?” Hilary Mason asked. I recall a time when we didn’t, and I wouldn’t be surprised if the title became synonymous with ‘webmaster.'”