You hear a lot about Big Data, but a lot of it is a big mess — a headache for drug-development companies trying to get useful information out of the mountain of data they generate in research. It is also an opportunity for entrepreneurs who can help them.
I met one recently, Emerson Huitt, who is CEO of a software startup called Snthesis. The nine-employee Durham company, founded in 2018, has revenue of around $1.5 million. Huitt anticipates doubling that in 2023 and tripling his customer count.
Snthesis is focused on life sciences companies struggling to manage their data. This is a real problem. In order to raise money from investors and move a drug through the regulatory process, they must answer a lot of questions. If data is disorganized, these things can be very difficult. And as companies grow, moving from spreadsheets and paper notebooks to a lab information management system can be a long, expensive process.
“The sweet spot for us, as customers,” says Huitt, “are either very, very large companies where they have data management, but they have so many different systems and groups using different databases that they need an integration between all those, or companies that are hitting that point where they can’t do work with their data anymore. Kind of around 70 to 100 people.
“And that’s where we get a lot of traction,” says Huitt, “because they’re far enough along in their research process that they have a ton of data. And they are typically capital stable, so they can invest in things like software, and they’re starting to hit a lot of friction because they can’t answer questions quickly.”
Huitt, a Boone native with a bachelor’s degree in computer science and biology from the University of Maryland, started up a software company in 2006 to help the rising tide of millennials buy houses. That ran into the housing crash. He thought about going back to school and getting a Ph.D., and decided to go to work for a local vaccine research company to get lab experience, to boost his credentials for grad school. He discovered that their “scientific data was just an absolute mess.” Scientists were emailing Excel files back and forth. They had to be sure they had gotten the latest version of whatever was being emailed around. A lot of data was in paper lab notebooks. Some was on paper towels.
“Everybody that I know who’s worked in a lab has done that before where you don’t have your notebook handy, you’re in the lab, and you’ve got to write something down really quick, so [you] scrawl on a paper towel and you tape that into your lab notebook, and that’s your data collection.”
Huitt started up Snthesis in mid-2018. Initially, it was just Huitt, working out of the Frontier, a cluster of old IBM buildings in Research Triangle Park renovated for small offices and co-working space for startups. “I would just work in the lobby. I didn’t have money for an office,” he says.
By November, he started bringing in revenue, which meant that he could hire and get an office in downtown Durham.
His competitive advantage was understanding how small drug-discovery companies work. They don’t have money to spend on expensive data management systems. So they capture a lot of data on spreadsheets. “And every bit of capital that you can spend on reagents and things like more mice for your experiments . . . or more petri dishes means that you have better data. So, your conclusions that you’re presenting to your investors are better.”
What happens, to use a simple example, is that when three scientists are looking at petri dishes and trying to count up how many cells are alive, there might be columns for cell type, cell count, and temperature. “So that’s three pieces of information you want. And the three scientists label each column differently. One scientist uses the label “count,” and another uses “CC.” Type and temperature are three different ways.
“We use natural language processing technology to try to intelligently match up those columns,” says Huitt. “And, obviously, there’s a lot that goes on behind the scenes to allow scientists to set thresholds for what is an acceptable match for them, and what’s not, so you don’t have crazy matches. Our goal is to fit 80% of the data, so that humans can focus on the 20% that’s too messy for the state of natural language processing.”
It’s more complicated than three spreadsheets.
“A lot of what companies are doing, is they’re ending up with five or six different systems over the course of time. They might have a compound registry where they’re keeping track of their compounds. They might have an electronic lab notebook tool where they keep track of exploratory stuff if they’re not keeping it in paper lab notebooks. They might have a system for aggregating some experimental data that’s considered high-throughput experimental data. And they may have a homegrown database for doing things like machine learning, and then they have Excel files, which is all of the data that doesn’t fit in any of those systems, but they still need.”
And then they end up with highly paid Ph.D’s spending most of their time wading through data.
“There was one poor doctor that we talked to,” says Huitt. “Their company was getting prepared for clinical trials. They’re preparing all this data to submit to the FDA, and they have to answer all these questions about their compound. Management’s saying, ‘Hey, we have to know about this compound, x, y and z results.’ And she said ‘I’ve been working nights and weekends for the past two months. It takes me two weeks to answer any of these questions, and every week I get 10 questions. So, I’m never gonna catch up.’ She was almost to the point of tears during this meeting.”
As life sciences companies get large enough, there are plenty of lab information management systems on the market, from vendors like ThermoFisher Scientific and Benchling. “Once your process is very rigid, you can use one of those systems and repeat the same experiment 10,000 times very easily. But at the beginning of the process, where you might do 5,000 experiments two or three times, it’s very messy data,” he says. “A lot of what our software does is not only harmonize the data, but integrate it into a single database, and we have search functionality that allows scientists to visually construct those first, complicated queries that a data scientist might write with a language like SQL. But we want people to be able to do that in a more hands-on way. We want to put the ability to query in the hands of scientists that aren’t computationally sophisticated.”
One opportunity may come from the largest lab information management system vendors, who are trying to sell to life sciences companies as they evolve from small startups to larger, better-funded businesses. The problem is migrating years of spreadsheets and other files onto new platforms. “And you might easily spend a million dollars manually cleaning up that data,” says Huitt. “A lot of time what vendors say is, we do that, and our professional services rate is $300 to $400 an hour. And companies are paying that, but it costs.
“And actually, a very interesting avenue for us is we’re talking to a couple of those large vendors now that have those professional services groups, and the folks in those professional services groups look just as haggard as that lady I was talking about earlier, trying to answer those questions herself. We’re talking to those companies right now to use our platform behind the scenes to accelerate their professional services practices. If they can take 60 to 80% of the workload off these professional services folks, they can get people on to their platform much, much faster.’
The challenge – and opportunity for Snthesis – is that life sciences data is growing rapidly, particularly in North Carolina, an industry hub. “Biological data doubles every seven months.” And it is not only Big Data, but deep data, with massive amounts of metadata about each piece of data. “The metadata is often in a worse state than the data,” says Huitt.
“That’s all going into the spreadsheets or into other files. And so, one of the other things we do, we don’t just process spreadsheets. We process all those other files too, and recognize things like sample identifiers, so that we can form the web of relationships between that data in an automated fashion. So, if you ask the system, give me all of the sequences related to sample xyz, you can pull all of those sequences up and figure out where they’re stored on your cloud computing system, or your internal data warehouse, where those actual genomic sequence files live.”
And so, companies can answer questions quickly, such as “Can I file a patent about this? Can I demonstrate a chain of custody that links back to a legal agreement that I am legally allowed to use this sample for x, y and z purposes?”
He talked about group of researchers he visited that was doing imaging of zebrafish throughout their entire life cycle as part of the development of cancer treatments. “And these files are huge, like 10 terabytes per file per fish. And this lab, they have literally a storage closet full of boxes and boxes and boxes of hard drives where they’ll treat a fish with a particular compound to see how its cancer reacts. The cure for cancer could very well be on one of those hard drives.”