Once you fall down the genealogical rabbit hole, it's hard to find your way back out. My journey began with my grandfather, a polio survivor confined to a wheelchair who took to computers in his later years. One of his passions was researching his ancestors, and the tool he used to collect his findings was Brøderbund's Family Tree Maker. I was fascinated by the charts and tables that he'd print out on his bubble jet printer, but I didn't have the patience for all the data entry.
Fast forward twenty years and we have software and social networks built around genealogical data. The ease of sharing data with one another means that it's very likely that someone else has already found what you're looking for. We're all related, after all.
And now, with personal genomics services like 23andMe and deCODEme, we can ship away a cotton swab with some spit on it, and explore our genetic connections even more closely. If we open up and share that genetic data with one another, there's a lot we could discover about human phenotypes: how our height, eye color, and preferences for certain foods connect us and shape our lives and health.
openSNP is a non-profit, open source web application project that allows users to take consumer genotype tests and upload the raw data so that it's accessible to everyone. The tool parses and annotates the data, and lets users share it with others. I spent some time chatting with one of the founders of the project, Bastian Greshake, about why he started openSNP, what technology the project uses, and how they actively try to scare their users away before getting them to sign up.
Tell us the story behind the site. I understand that you were frustrated with the data available within the 23andMe test and wanted to explore it more completely. What lead you to creating the project?
Frustration is definitely the right word here. Back in 2011 when I was genotyped through 23andMe, there was little possibility to publish your data. The first thing I did was to publish it to GitHub and some other people did similar things. But then you ended up with data being on GitHub, Google Code, personal blogs, and who knows where else. Searching for those data sets took forever, they weren't annotated with phenotypes, and in the end, it wasn't worth the time.
That's when I approached Philipp Bayer, the co-creator, and asked him what he thought about doing a small side project, allowing people to publish their genetic data in a repository that's not only easily searchable but also offers APIs for standardized access to the data.
Can you elaborate a little bit on what sort of technology is involved in parsing and annotating the data?
The main technology used for openSNP is Ruby on Rails. When we started to work on the project, Philipp and I already had some experience from different projects we worked on separately while still studying.
For the moment the parsing of the 23andMe data is also done using Ruby and the data are stored in a Postgres database. Management of the parsing jobs (and all other background jobs) is done using Sidekiq and Redis is used for the message queue. Our search backend is provided by Solr. We monitor errors and performance using New Relic and Errbit.
For annotating the data, we are largely relying on the JSON and XML APIs of different services: the SNPedia API, the Mendeley API, and the PLOS API. We also regularly download all SNP annotations from genome.gov and from the Personal Genome Project.
So technology-wise, openSNP is really a mixed bag, which can be challenging in a good way, as you keep up to date on many technologies, but also in a bad way, when things break down.
Aside from satisfying curiosity, what benefits does a tool like openSNP offer?
This really depends on the perspective of the user. I know many people who are using openSNP as individuals for their private research. Say you have a rare disease that might possibly be genetic. openSNP allows you to look for others who share your symptoms and then you can even compare your genetic makeup if you want to. In that way it's also social, because people with shared symptoms can get in touch with one another.
Then there are teachers at universities who do courses on human genetics. With openSNP, they have the chance to use real data for their studies.
On the other side of things, you have researchers who are really keen in using the data provided through openSNP for their own studies. While the costs for genetic analyses have outperformed Moore's Law for the last couple of yours, creating huge amounts of data is still prohibitively expensive in many instances. This is why open data holds such a great promise for biomedicine. While the data hosted with openSNP alone still isn't enough to perform large scale studies, it is a great proof of principle for the willingness of people to share.
The disclaimer on the site is pretty straightforward. There's zero privacy or anonymity. How would you approach someone who is concerned about their privacy and convince them of the benefits of openSNP?
In short: we wouldn't. We scare away people on purpose. Our aim is not to convince people to share their data. There are many legitimate (and maybe some less legitimate) concerns about sharing genetic data with the public domain. Dystopian visions like the movie Gattaca come to mind easily. And we wouldn't want anyone to regret the decision to share data through openSNP.
So we state what we consider to be amongst the worst case scenarios. And if you're still in for sharing your data, despite those risks? That's awesome, thanks for sharing. You're contributing to a project that will be of use to many people to come. But if you're not completely sure you want to do it? Feel uncomfortable with the idea and have the slightest doubt? Please, just do not do it. We won't judge you. There are tons of reasons why some people might deem the risks acceptable while others don't.
Are there any other open tools for those interested in researching their genealogy or genotypes? Any tools that need open source alternatives?
If you like the open source programming language R, there's a tutorial on analyzing 23andMe data using Bioconductor.
PLINK v2 has support for 23andMe files, but it's a tool to run genome-wide association studies and isn't very useful for just a single genotyping.
This script can convert 23andMe files to the common VCF format. The VCF file can then be fed into common SNP analysis tools like SNPEff, which gives a very detailed gene-by-gene report on the effects of each SNP.
Any other ways in which you use open source software or philosophies in daily practice?
Being a PhD student in bioinformatics, I would say around 95% of the software I use for my daily work is open source. One of the hallmarks of modern research is that your work and your findings need to be reproducible. It's not good enough to claim that something worked fine in your lab if no one else is able to observe the same thing. For example, look at the recent STAP stem-cell controversy, in which no one was able to replicate the results.
In the same way, it's not good enough to perform your analyses with closed source software. You have no way to check for its correctness and whether the cool results you are observing are just due to some random bug. It gets even worse if the results change between different software versions, and you can't find out why. So having the possibility to look into the code yourself is paramount.
openSNP is on GitHub. What do you like about GitHub and have you seen anything interesting done with your source?
The whole project started on GitHub, as Philipp and I are separated by around nine time zones and needed an easy way to collaborate on this project, even though we can't meet in person. We both worked with git itself on earlier projects. What made GitHub attractive for us was how easy it was to track issues and bugs and comment on commits.
From what I have seen, not too many people are using our source for anything. But the social side of GitHub definitely helped us in getting visibility and people who contribute to the project. Helge Rausch, one of the most prolific contributors, an invaluable help and now a core member of our little team, found us on GitHub by chance, I guess. People report issues on the issues tracker. Sometimes we even receive pull requests from others not associated with the project.
Who is your open source mentor or hero?
As open source mainly works because it is a community effort, it would be unfair to single out individuals. But on the programming side of things, I discovered a lot through BioPython, an open source community that makes my daily life as a bioinformatician a lot more easier. The effort of all contributors to the project is a stellar example of how great things can work and a inspiration to many, including me.
Just some helpful links:
- Coursera Useful Genetics course using openSNP in a side lecture
- Sound installations
- Studies on attacks on genomic privacy
- Two papers on ethical considerations around public genomics