A glimpse into R counterculture

The statistical computing languages R and Python offer similar features. The decision comes down to contrasting philosophies.

3 cool machine learning projects using TensorFlow and the Raspberry Pi

Image by:

Opensource.com

Back in 2009, Anne Milley of SAS dismissed the increasing significance of the R language (whose rivals include SAS, Python, and, more recently, Julia) in a New York Times article. She said:

"We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet."

After many readers expressed their indignation, Milley wrote a follow-up blog post on the SAS website, which took on a considerably more diplomatic tone. She defended SAS as software that can be valued for its "support, reliability, and validation." Recent history, however, has made it much more difficult to conflate proprietary software with reliability or functionality.

R certainly presents a powerful case study in how an open source language has rendered long-dominant proprietary software, such as SAS, largely irrelevant. Although it is difficult to quantify the size of R's user base, one interesting metric of popularity is its use in academic journal articles. In that court, R surpassed SAS in 2015. Additionally, although it is merely anecdotal, it is amusing to note a thread from 2017 on the Statistics subreddit, in which the original poster wonders why SAS is still around in substantial numbers. To paraphrase the prevailing response, companies still buy SAS because it's what they have always used in the past and change is hard! Or as Woodrow Wilson put it, "If you want to make enemies, try to change something."

In contrast, there are developers and data science professionals who don't want to make any concessions to functionality. They want the optimal tools for their analyses, even if it means having to dig through Stack Overflow every now and then. For them, there is R. It started as a statistical computing environment, but it's had so many additions that it can now be classified as a general-purpose language.

What about Python?

This begs the question: "What about Python?" Indeed, Python is also a popular open-source language used for data analytics. And if we have Python, why should we care about R? This can no longer be answered by appealing to functionality; Python and R have been copying each other's functionalities for years. For example, the R graphics library ggplot2 has been ported to Python; there are implementations of Jupyter notebooks with support for R; and the DataFrame class in Python's pandas library has an uncanny conceptual similarity to the data.frame class in base R. Accordingly, it is now far less common for a data scientist to make the choice between R and Python on account of differing functionality. There are exceptions to this rule, such as (in Python's favor) the full-stack capabilities of Python and (in R's favor) Shiny, an API to HTML and JavaScript that is implemented as an R library, allowing for seamless integration between web app development and R's capabilities.

Instead, the "What about Python?" question is best answered by clarifying the contrasting design philosophies between R and Python, then choosing which one most closely aligns with your personal style. The largest conceptual difference between the two languages is Python's preference of having only one obvious way to do something (a rule in the Python Philosophy), versus R's belief in providing limitless possibilities to programmers and allowing them to choose the approach they desire. There is certainly no analogue in the R community to the use of the word "Pythonic" in the Python community. R believes in giving choice to programmers rather than advocating regimented approaches. While this is certainly an issue of personal taste, I think it makes R more closely aligned than Python to the values upheld by the open source community.

Three reasons to choose R

At the end of the day, programmers should choose the language they feel is most comfortable, provided its utility meets their needs. I like that R syntax is very close to the way I think, which makes it very comfortable for me to use. Consider these three simple, but illustrative, examples.

R indexes from 1, rather than the usual 0. I have been surprised by the severity of reactions to this; one of my colleagues even prefers Python over R for this very reason. But the point of a programming language is to be a middleman between our thoughts and 1s and 0s. If a language is a more effective "middleman" (for example, counting from 1, the same way we do), then what's wrong with that? I'm generally a fan of following convention, except when there's a good enough reason not to.
One added benefit of R's approach to indexing is that you can remove elements from a vector by subsetting with negative indices (which requires the language to index from something greater than zero). For example:
```
> x = 1:5
> print(x)
[1]
1 2 3 4 5
> x = x[-3]
> print(x)
[1]
1 2 4 5
```
Base R has four different assignment operators, each with a different ranking in the order of operations. The following four statements all produce the same effect:
```
assign('x', sqrt(pi))
x = sqrt(pi)
x <- sqrt(pi)
sqrt(pi) -> x
```
The third operator above (called "leftward assignment") is the most common, and I would not be surprised if most R programmers (out of habit) use it exclusively. I find it useful to have all of these available, as I think certain options are better suited to expressing how I form certain thoughts. Also, optional arguments to the first one, the assign() function, can explicitly specify in which environment/namespace to store the new variable. Moreover, R has the super-assignment operators <<- and ->> (which parallel leftward and rightward assignment, respectively) that allow a variable to be stored globally, even deep within nested functions or structures. (This can also be accomplished through the assign() function.)

I think R beats every other language when it comes to ease of implementing list comprehension, even though this is typically touted as a Python selling point. One of several list comprehension methods in R is the "apply" family of functions, which provide a feature-rich way to apply functions across vectors or lists (i.e., R's equivalent of C structs). There is also a simpler approach based on R's convention of "recycling" which dictates that even when a function is declared to have only one element of input, an entire vector can be passed to the function anyway, and the function will be evaluated at each of the vector's elements. For example, the factorial() function is defined to take only one element of input, but you can nonetheless use it as:
```
> factorial(1:9)
[1]
1      2      6     24    120    720   5040  40320 362880
```
Although the "apply" functions were originally considered a nuance in R, they inadvertently encouraged R programmers to set up their computations in embarrassingly parallel ways. Consequently, the R community naturally developed libraries for parallel and GPU computing.

In these and many other ways, R's embrace of the open source philosophy has made it a niche but growing language whose capabilities rival those of any other high-level interpreted language.

Samuel Lurie will be presenting Highlights of R at SCaLE16x this year, March 8-11 in Pasadena, California. To attend and get 50% of your ticket, register using promo code OSDC.