Making science more open at GitHub

Image by:

Opensource.com

Arfon Smith works at GitHub and is involved in a number of activities at the intersection of open science, open source, and online research. He has worked on several successful citizen science projects, like Zooniverse, a platform he co-founded where people can analyze real astronomical data and make significant contributions. Since his move to GitHub, Arfon has broadened his focus to how GitHub can help make academic collaborations behave more like open source ones.

Read more on how Arfon Smith made it to his dream job in this interview.

What led you to building 'citizen science projects'?

Going way-back, I studied chemistry at university and then later a recieved a PhD in Astrochemistry (the chemistry of space). During my PhD, I wrote a lot of code, but at that point, like most academic coders, I really didn't know anything about software development such as how to use version control.

Towards the end of my PhD, I realised that I was as interested in the code I was writing and the science results it was producing. I'd started to play around with web frameworks such as Ruby on Rails (this was 2006, so pretty early days for Rails), and I was really enjoying building tools for the web. I think this was one of my main reasons for making the move into academic software post-PhD. After a brief stint as a developer for a web marketing agency, I joined the Production Software Group at the Wellcome Trust Sanger Institute in Cambridge and built Rails applications to manage the processing of biological samples that were going to have their DNA sequenced.

After a brief but very enjoyable year at Sanger, I joined the Galaxy Zoo team at the University of Oxford, and together with Chris Lintott, co-founded the Zooniverse working as the technical lead of the project. Over the next five years, I was responsible for building (later leading a team doing all the hard work!) more than 20 citizen science projects, engaging more than 1 million members of the public in doing real science online.

Both my role at Sanger and with Zooniverse were about building better tools for science. Sanger was a formative time for me as it was an eye opening experience to be on campus filled with bioinformaticians: people who's expertise was not in pure biological research but in the craft of building tools to facilitate and advance that research domain. Coming from astrophysics, where the astroinformatics field is still very niche, it was remarkable to see what could be done when significant resources were invested in an hybrid technical/science domain.

How did you get to GitHub?

My move to GitHub was prompted by my (not unique) observation that the role of a scientific software developer isn't really well recognised or rewarded in the current academic model. As my role evolved with Zooniverse and my team grew, I spent more and more time thinking about how to build careers for these people. As things stand today, if you want to be successful in an academic department then you should probably write a bunch of papers in high-profile journals. Yet as research becomes more data-intensive, an increasing fraction of scholarly activities are encoded in software which doesn't typically receive the same amount of credit as a paper.

I'd known a few folks at GitHub for a number of years and had always been a huge fan of the product. Last year, we started talking about what GitHub could do to support those researchers writing software and sharing it on GitHub.

What is your job at GitHub? What does a typical day look like?

Broadly, my job at GitHub is to make it a better place for a researcher to share their work. In truth there are a number of ways this can be approached, so a typical day might include any of the following activities:

helping researchers incorporate version control into their workflow
being a voice for academic users when advising an internal team at GitHub on product development
working with community partners such as data publishers, government agencies, and journals to ensure that GitHub is providing value in the academic ecosystem
developing (and delivering) resources for academic users wishing to use GitHub

You are involved in several collaborative projects such as the Workshop for Sustainable Software for Science of the Mozilla Science Lab. Could you tell us more about your overall goals at GitHub to improve the state of collaboration in science, especially as it relates to source code, data, publication, and reproducibility?

It's taken me a while to get to this point, but my one-line job description I use these days is that I'm trying to "bring the sharing and credit mechanisms prevalent in open source communities to those of academia." This description is far from perfect but it points to lots of the things I'm trying to address.

Every day, millions of people collaborate on GitHub, working together to build something better than they could produce on their own. My goal is to get researchers working together in the same way, versioning their work, reusing (and building upon) the work of others, and receiving credit for these activities. Software encodes an ever-growing fraction of the academic output and yet as a researcher, if you want to gain credit for software you've written you still need to publish a paper about it. David Donoho nailed it when he described these papers as 'advertising'. I think there's a bunch of opportunity here.

Reproducibility is an interesting topic and one that I'm very interested in. In my mind, reproducibility is a byproduct of using better tools (such as Git) to capture more effectively the provenance around research. I think it's a failure mode to build tools that have the primary goal of making research 'reproducible'. As a community we should focus on building tools that make people's research lives better. Ultimately, I believe that research will become more reproducible as individual researchers derive benefits from working with better tools and as we develop credit mechanisms for sharing more of the research outputs.

What do you hope to accomplish over the coming years at GitHub? How can others get involved?

As I mentioned earlier, making academic collaborations behave more like open source ones is a long-term goal. Python's Fernando Perez wrote an excellent article about how open source communities succeed because they have reproducibility as a core foundation.

There are a number of ways to get involved. If you've written some code as part of your research then consider sharing it on GitHub or any of the other code-hosting platforms. When you share, I encourage you to do three things:

Choose and apply license.
Write a README markup.
Write a 'bootstrap' task to get people up to speed as fast as possible on your project.

More generally, if you're a researcher writing software and you'd like to chat about your experiences then feel free to reach out to me at arfon@github.com. I'd love to talk.