Data science jobs in investment banks are the new hot thing. J.P. Morgan's got a chief data science office. So does Morgan Stanley. So does Deutsche Bank. Banks that once wanted traders and salespeople now want data people with finance "domain knowledge" as an aside. As Bloomberg's Matt Levine points out, finance is being reduced to a set of data challenges. All rise the data-crunchers.
Except, people in the industry suggest data science jobs aren't as exciting as they seem. If you think you fancy being a data scientist, you might as well know the truth.
"Only around 10% of data science is science," declares Dominic Connor, a veteran quant headhunter in London. "The rest of your time will be spent cleaning the data - sucking it out of whatever stupid format it comes in, and subjecting it to a sanity check so that it's actually usable."
People working at the data science coal face wholeheartedly agree.
"It's universally true that data is crap," says Jeff Holman, chief investment officer of San Francisco-based Sentient Technologies, the artificial intelligence (AI) company which has developed an AI trading system. "Most of your time is spent acquiring and cleaning data and trying to automate those two steps - and for whatever reasons, the responsibility for this lies in the lap of the data scientists who should be using the data to develop machine learning."
The disconnect between the aspirations of data scientists and the reality can cause disappointment. Especially for data scientists who haven't worked in the real world. "When people have done empirical work during their PhD they're used to it," says Holman. "When they're coming from a theoretical perspective and are applying their techniques for the first time, they're in for a shock."
"It can be frustrating," says Alexey Loganchuk, a former J.P. Morgan derivatives trader who now runs Upgrade Capital, a New York-based buy-side talent scout focused on big data. "When you look at the data programs at top universities you'll find students who are very interested in complex modelling techniques, but when you look at the data jobs in hedge funds it's usually about finding new data sets, evaluating them, and making them accessible."
Loganchuk says the really critical piece of value creation in data science is this so-called "data wrangling," which academic institutions don't focus on. Anyone can scrape the web for easily accessible company data, but web scraping jobs are "commoditized" and the data they deliver rarely offers new insights. "The most valuable data sets for hedge funds are those no one is looking at and those are rarely easy to access and analyse," says Loganchuk. "If you are looking at a data set that everyone else can access, there is not much of edge to be gained from it."
For this reason, data science is less about complex modelling and machine learning and more about "data discovery" and wrangling. Classic examples include satellite data that tracks deliveries of raw materials to China before ships have even docked, or the number of cars parked alongside retail outlets and restaurants. "Our students looked at satellite imagery from RS Metrics and found that if you compare the number of cars parked in Chipotle parking lots with those parked alongside nearby rivals, you get an idea of competitor performance," says Loganchuk. Even the most desirable hedge fund data jobs can be prosaic. Winton Capital, the systematic hedge fund, wrote a recent blog post about its use of DNS records as a proxy for technology leadership among S&P 1,500 companies, for example.
If hedge fund data jobs are unexciting, banks' data jobs can be even less so. Connor points out that banks' big need isn't for data scientists to work on the trading floor, but for data scientists to work in compliance: "Banks have got to the stage where they urgently need to automate their compliance activities. There simply aren't enough experienced compliance professionals out there."
Loganchuk says data scientists at big banks are often the most disillusioned of all: "These tend to be very large organisations with limited opportunities for greenfield development." He adds that every data scientists' dream is to work with data to solve a problem no one has ever solved: "In a bank, you might be tasked with creating a slightly better model for predicting default risk in credit cards or with identifying instances of fraud. It's certainly valuable work, but it won't excite a Millennial with a dream of making the world a better place."
None of this is likely to change any time soon. There's no getting away from the fact that data science in finance is done for commercial ends - although hedge fund Two Sigma allows its data scientists to work on charitable and environmental data projects alongside their day jobs. Nor is there any avoiding the data discovery and data wrangling aspects of the role. Holman says that delegating these causes problems: "You make a lot of small assumptions along the way in how you handle and truncate the data. This needs to be done by the people who are going to be consuming the data and creating the models otherwise it doesn't always work." Loganchuk points out that more and more "exhaust data" is becoming available: "There is more data being created - everything has a sensor on it, from the clothes you wear to individual screws holding together oil rigs. It's only a matter of time before these datasets become large enough for hedge funds to become interested."
This isn't to say that data science in finance isn't interesting - as long as you go in with your eyes open. And the alternatives may not offer opportunities to use nice clean data to save the world either. Sentient's use of machine learning and data science isn't restricted to finance - it also applies its technologies to other sectors. One of these is healthcare, where Holman says they developed a system that accurately predicted sepsis in an intensive care unit. However, the biggest growth area is online shopping, where scientists busy themselves using data signals to encourage people to buy more. In this context, sifting through messy compliance data in a bank looks quite worthy after all.