The Role of APIs in Genomics

When some people think of APIs, they think of things like marketing automation or maybe powering a smart home. In other words, it’s very easy to associate APIs with frivolity or, to put it less harshly, simply making things more convenient.

But as we’ve already seen in articles on this blog, APIs are capable of doing so much more than that, like maybe even saving lives. Genomics represents another area where APIs are not only capable of having a huge impact, but are already doing so through services like SMART Genomics, 23andME, and Google Genomics.

Below, we’ll be looking more closely at these services as well as uncovering why APIs and genomics are a such a great match.

Genomics 101

Unlike genetics, which is the study of a single gene, genomics looks at all of an organism’s genes. Right off the bat, we can see why that would be a lot more data intensive than examining a single gene.

It’s also an incredibly useful field of study because, while genetics is often seen as focusing on identifying “problem genes” that may be responsible for diseases and the like, genomics has the potential to identify how and why an individual’s genetic makeup makes them immune to diseases.

When you think about some of the features of the field of genomics – large volumes of complex data that, ideally, need to be shared between research institutions over the web – it quickly becomes apparent that using APIs to manage genomics data is a slam dunk.

The Genomics Problem

One of the most prominent genomics projects in the UK is the 100,000 Genomes Project, run by Genomics England. This initiative grants researchers access to, as the name suggests, 100,000 genomes from tens of thousands of different people.

Currently, data is gathered from the NHS and de-identified before being funneled to a data center. Researchers and academics must then apply for access, in the hope of being granted permission to monitored and recorded data access via a virtual computer.

If you’ve ever used a virtual computer, you’ll already know that there are a few problems associated with them:

  • Processing power, speed etc. is limited by the real machine’s output
  • It exists on a different network to the rest of your hardware
  • Downloading/sharing data, particularly in a secure way, can be difficult

In other words, there are some tricky limitations associated with dealing with genomics data using virtual machines. And we haven’t even begun to talk about the problem of size…

Genomics and BIG Data

In an article for ComputerWeekly Jim Davies, CTO at Genomics England, talks about the massive size of genomics data:

“You’re dealing with files of 150GB each…The application programming interfaces (APIs) and abstractions of that data are still under development – there is a lot of work going on with global standards.”

In the same article, Anthea Martin of Cancer Research UK states that “computer power has advanced, allowing us to analyze that massive pile of data” but later highlights the fact that data transmission is a huge issue for existing computer networks, remarking that members of her team have resorted to sending and receiving hard drives by post. Hardly the most secure way of sending information that may have serious data protection issues surrounding it!

Could APIs, through a combination of private access keys and database abstraction, offer a more effective solution? It sure seems like it. But we should point out that RESTful data exchanges are still subject to the limits of processing speeds on the web, so APIs aren’t necessarily a perfect solution where terabytes of data are involved. That said, let’s look at how three genomics API services are already making waves in this space.


When it comes to making genomics accessible to a wider audience, the conversation has to start with services offered by 23andME. 23andME is an interesting case for us because the company offers an API that allows API developers to access genomic data to create apps. The downside? It’s only possible to access information about your own genome.

That offers some interesting potential for apps that allow users to plug in their genomic information to do some cool things, like integrating it with other data sources. Entering data could help with suggestive nutrition/exercise programs using MyFitnessPal, for example, or integrating with Apple’s Health app to suggest preventative healthcare measures that may apply to those at risk of diabetes.

But it does effectively close the door on any wider research implications because of data protection. Unfortunately, because of the sensitive nature of genomics data, that will probably end up being the downfall of 90% of genomic services. Or will it…?

Google Genomics

Google Genomics is such an enticing proposition because it facilitates access to petabytes, “rapidly growing towards exabytes” in Google’s own words, of genomic data for research groups. So where does all this data come from? Researchers have to bring their own.

In their whitepaper Google Genomics provides the compelling case study of the MSSNG Project, managed by Autism Speaks. With 100 terabytes of data from 1,300 genomes (plus more queued), equating to around 10,000 individuals, it’s a great example of Google Genomics’ scalability.

At $0.022/GB per month, storing genomic data with Google Genomics (based on the ballpark figure provided by Jim Davies above) works out at around $3 per individual per month. A piece of research, like the study above, that involves 10,000 people could become prohibitively expensive quickly for those with limited funding.

While other researchers can be granted individual access to all of this data, Google doesn’t seem to have any plans to make it more widely available to the public or medical professionals not in on the project. That’s unfortunate, because it limits the possibilities of looking at how diseases or conditions (or susceptibility to them) affect others with genomic similarities.

For more on the evolving personal data store, read The API of Me

SMART Genomics

One of the most exciting genomics API projects on the market is SMART Genomics because of its direct healthcare implications. A project at Harvard, SMART Genomics interfaces with the FHIR (Fast Healthcare Interoperability Resources) framework and was developed to resolve the lack of genomics communication standards. The end goal is an easy method for doctors or web apps to query a patient’s genomic information.

The project is open sourced, and uses three main resources to help standardize the cataloguing and retrieval of genomics data:

  • Sequence – Stores data on patient amino acid, RNA or DNA sequences.
  • SequencingLab – Container resources that hold results about specific labs used to sequence genomes.
  • GeneticObservation – Resources that relay information about phenotype and genotype.

For more on SMART Genomics, read the developer’s guide.

“SMART Genomics provide support for using genomic information in the healthcare environment by providing new resource definitions and resource extensions to the existing SMART on FHIR framework. Using SMART Genomics, medical apps can quickly access a patient’s genomic information in conjunction with their standard clinical data to provide high quality, personalized care to their patients”.

In other words, SMART Genomics represents a move toward standardizing a vast treasure trove of human information that could be critical to diagnosis and preventative medicine. In an issue of Genetic Engineering & Biotechnology News, Ph.D. Kate Marusina acknowledges that a genomic singularity is near, saying “once we’ve uploaded our genomes and the net is abuzz with our biosensor data, virtual doctors and clinics could become very real.”

What other industries are APIs affecting? Read our coverage of Sectors for Exploitation With APIs

Genomics: Another Industry Ripe for Standardization with APIs

The most significant barrier between genomics adopting APIs more widely is, as it is in so many other fields, a lack of consistency or standards. To return to Jim Davies’ views on the subject: “We don’t yet have a definitive stable data management architecture.”

To put it another way, there’s no standard for data abstraction or annotation when it comes to genomic sequences and genetic data. Establishing these standards requires a confluence of pure API tech with incredibly complex scientific principles, and that’s not an easy ask.

The implications for healthcare and lifestyle improvements, if swathes of genomics data were made public via APIs, is huge. Beyond identifying genetic components of diseases or conditions, it might be possible to research the impact of behaviors like diet and exercise on the onset or avoidance of those conditions.

But it’s tricky to even start talking about that without considering the issue of how funding would be affected by breakthroughs that rely on genomic information pulled from multiple institutions via APIs making data more widely accessible.

Let’s say that Joe Public identifies that lifting weights appears to delay, say, the onset of dementia. If he’s using data from three different clinical trials, a bunch of data uploaded by the weightlifters via a fitness app and APIs maintained by two different research institutions, who gets credit for the discovery?

As a result, even though some services like SMART Genomics appear to be starting to break down barriers, we should temper any expectations of true breakthroughs in the API space for genomics. But, given the potential there, we’d say that it’s not a case of if but when.