We cannot do modern science unless it's open

Image by:

Opensource.com

Open is about sharing and collaboration. It's the idea that "we" is more powerful, more rewarding and fulfilling than "I". I can't promise jobs, but I do know that open is becoming very big. Governments and funders are pushing the open agenda, even though academics are generally uninterested or seriously self-interested.

Some governments and some companies recognize the value of teams; academia and academics generally don't. The false values of impact factor and the false values of academic publishing mean that open access is a poor reflection of open, or what you may recognize as the open source way.

I first started thinking about code re-use in 1980 when I had developed an approach to re-using crystallographic data as a research tool. Crystals were published as tens of thousands of single papers; my vision was that by using all of these together we'd discover patterns that would show new science. In particular, myself and my collaborators showed that snapshots of a crystal in different environments could give information on vibrations and even chemical reactions. I wrote lots of software in FORTRAN IV. It built on Sam Motherwell's great CONNSER and GEOM packages. I built on a whole raft of statistical and analytical tools, and we published papers together.

Then, I went into pharmaceutical industry to use these ideas in drug discovery and donated the software to an organization, on the basis that if they wanted to develop it they would contact me and we would work jointly together. It didn't happen. This was before licences, before RMS, before people worried about ownership. The software got subsumed into their system, and my name got removed. I even sat through a lecture where they presented it as their own. I've gotten over it now, but I've learned.

View the complete collection of articles from Careers in Open Source Week

The code was very complex, and I realised that there must be a better way, where we write re-usable modules. I'd been impressed with NAG and took the modular approach as central. It was difficult to do this in chemistry as it's less clear what the fundamentals are. You can write a matrix diagonalizer becasue it's clear what the inputs and outputs are, but it's less clear how to calculate a molecular mass (it's harder than it looks—remember isotopes!?) So I started writing a reusable set of routines in 1990 in FORTRAN. At that stage, I was also giving evening lectures at Birkbeck College on bio- and chemo-informatics and took these modules to the students. The problem was that languages were changing, and so I converted them to C using f2c (it works, but don't look at the generated code!). Then, I discovered tcl/tk and loved it because of the graphics—soon after which I was discovered by a salesmen from Sun Microsystems.

They found me only because I was much more visible than others.

In 1994 Henry Rzepa and I had developed Chemical MIME—this was an open project (though not formally labelled) where we generated a chemical meme that swept the web in six weeks. It relied on the open programs RasMol and Mage, which we could freely distribute to run in browsers. Chemical MIME was the ideal open project: open specs, open software, and enough open molecules to give it a WOW factor! That visibility gave me my first (part-time) consultancy job and kept me alive for some years after I left Glaxo. At the same time, Alan Mills and I ran the first multimedia course on the web (1995), Principles of Protein Structure. We ran it in a derivative of BioMOO and the Globewide Network Academy; they were all completely open projects stemming from LambdaMOO (Pavel Curtis, Xerox). PPS showed the value of community, and we had 250 volunteers/students (we didn't distinguish) on the course. And, the PPS got me my second job, as a part-time Professor of Pharmacy at Nottingham, setting up virtual educations.

We were all optimists and thought that it would take off rapidly, but we failed to realise that education is ultra-conservative and has to map into real-world constraints. For me, in 1993, the world wide web was transforming because there were no barriers. It engendered open systems, sources, and protocols. They were so prevalent you didn't think about them. We don't realise how powerful a force Tim Berners-Lee has been for open. As I whirled along in a portfolio career of research, consultancy, hacking, I was able to stay alive and develop my ideas. It's worked well for me as some of these ideas have needed 20 years to build and for the community to realise them. (That's not arrogance, many web protocols like MathML or SVG or RDF have had stuttering starts but are now mainstream.)

I was very heavily involved in XML and ran the XML-DEV mailing list—it had 10,000 emails a year and was the basis whereby the community developed XML. I'm most proud of the SAX protocol which was entirely developed on the list in 4 weeks. All this XML not only gave me the basis for the Chemical Markup Language (CML) but lead to a consultancy with JB in London, delivering training in XML. Running courses can be hard work but rewarding enough to gain a living from it. (This was my third job.) Then, I saw the advert for Cyberinfrastructure in Chemistry in Cambridge (Unilever Centre) where one of the pillars was training. Because of my experience I was able to create and deliver training courses and this led to my appointment in the Department (This was my fourth job).

Cambridge gave me great resources (especially through the 250 million GBP eScience program run by Tony Hey in Southampton). I set the goal for myself of building an artificially intelligent (AI) chemist (though I didn't make much noise about it). It was to be based on knowledge and code modules that I had been building for 10 years. I started building it all myself in Java. I got to one stage where I added graphics, using Java3D. Java3D was awful; a wrapper on C code and closed binary. It was consuming my time to too great of an extent. I'd earlier used XMol, which is Dan Gezelter's molecular viewer that ran under X windows. At that stage, it was fairly basic and Java was a better approach. I then noticed the emergence of Jmol, the port to Java. I suddenly thought: "If I don't try to compete with Jmol, I can do the things I really want to do (chemical semantics)." So, I decided to junk my code and link in Jmol at that stage.

This was a really important decision in the scheme of things. Although I generally acted openly, I wasn't really conscious of the open source way in terms of licensing and commitments. But, I became aware of it at this point and started to look for other codebases to link. The uniting architecture was Chemical Markup Language. CML is designed to support most of chemistry in semantic form. Because I was collaborating with the other code groups—CDK, Bioclipse, Jmol, JSpecview, OpenBabel, etc—they adopted CML. This was a massive community win and more than any commercial manufacturer can achieve.

No one will write code for a competitor but many will write to interoperate with a collaborator. We got to know each other, and in 2005 most of us met at the American Chemical Society (ACS) under the blue obelisk in San Diego. I suggested we form a close, informal community under the label Blue Obelisk and that we adopt the mantra: open data, open standards, open source (ODOSOS). We have a mailing list and at intervals I buy Blue Obelisks as awards for publicly valuable contributions. There's a communal agreement to interoperate but no downwards control. It just happens in its own way and at its own speed. We reviewed 5 years on and had 20 groups authoring the paper, which is a remarkable achievement for a very conservative discipline (chemistry) where established companies are more valued than innovation.

And the AI chemist? What I hadn't reckoned on is that I couldn't build on knowledge because vested interests would throw lawyers to stop it. The major data "owners" fight to prevent re-use of data. When Wikpedia wanted to use CAS registry numbers they got a legal letter from ACS. When NIH developed a free database of chemical information (PubChem), the ACS lobbied Congress to have it closed down. So, I have developed tools to extract facts from the scientific literature; the STM publishers are throwing money and lobbyists at Brussels to stop it happening. It's no surprise that I am now known as an open activist (see my Wikipedia entry).

We cannot do modern science unless it's open. And, I am looking for allies.

Last year I applied for a Shuttleworth Foundation Fellowship (We provide funding for dynamic leaders who are at the forefront of social change.) And, in March 2014 I was awarded one (This was my fifth job in open source). We are going to extract 100 million facts from the literature whether or not the publishers like it, because we've had the law changed.

For my fellow academics, the question is: Can open source get you a job? My answer is: By itself it probably won't get you a lectureship, but all my group have been able to get good jobs in the high-tech industry, or science. I think the public exposure of the open source way has helped. I'm very proud of them.

View the complete collection of articles from Careers in Open Source Week.