R love you, R hate you

I do the majority of my coding in R. I’ve grown to love the language despite many of its quirks. Perhaps this is just Stockholm Syndrome as I joked in my “Intro to programming” class recently, though I genuinely enjoy the language and its functional programming roots.

That said, there are things that are objectively bad about it, with defaults that no sane person would choose. The more time I spend writing code for packages rather than data analysis, the more I’ve grown annoyed at how much extra work you need to do to avoid perplexing bugs and behavior. This is partly a weakness of dynamically typed languages - rather than depending on type-checking built into the language, you need to program defensively and explicitly handle user-input and whether it is what you expect it to be (see excellent discussions here and here). It’s a problem important enough that it has led to the proliferation of many packages that aim to assist you with it - e.g. checkmate, ensurer, tester, assertive, assertr. But another part is just due to some genuinely weird choices built in many of R’s base functions.

A case of bait and switch

Here’s a simple case which recently led to a head-scratcher of a bug in our bmm package. Consider a simple a function which gives a different greeting depending on a language parameter:

greeting <- function(name, language) {
  if (language == "chinese") {
    out <- paste0("嗨 ", name, "! 多么美好的一天")
  } else if (language == "german") {
    out <- paste0("Hallo ", name, "! Was für ein toller Tag!")
  } else if (language == "bulgarian") {
    out <- paste0("Здравей ", name, "! Какъв страхотен ден!")
  }
  cat(out, "\n")
}

Simple enough, right?

greeting("Ven", "german")
#> Hallo Ven! Was für ein toller Tag!

Currently, you will get weird errors with non-character input if you don’t properly check that the language argument is a character or that it is a valid choice. But at least the user will get an error:

greeting("Ven", 2)
#> Error in greeting("Ven", 2): object 'out' not found
greeting("Ven", "japanese")
#> Error in greeting("Ven", "japanese"): object 'out' not found

Ok, maybe a little confusing for the user who doesn’t know how the function is implemented (“object ‘out’ not found? What is object ‘out’?”), but fine. You can and should go the extra mile and validate the input, but let’s keep going. Now, if you had many different language options, you might decide to use a more efficient switch statement instead of an if/else structure:

greeting2 <- function(name, language) {
  out <- switch(language,
    chinese = paste0("嗨 ", name, "! 多么美好的一天"),
    german = paste0("Hallo ", name, "! Was für ein toller Tag!"),
    bulgarian = paste0("Здравей ", name, "! Какъв страхотен ден!")
  )
  cat(out, "\n")
}

Our switch handles 3 explicitly defined cases, just like the if/else version, so unless you read the documentation, or have been already burnt by this, you might expect that this should fail:

greeting2("Ven", 2)
#> Hallo Ven! Was für ein toller Tag!

Oh, it runs! Ok, so even though we have explicitly defined the cases by name, switch tries to be clever and interpret integer input as a case selection. Not only that, but it will try really hard to make the input an integer if it can. The following makes no sense as a language selector, but R happily forces 2.8 into an integer sleeve and greets us again:

greeting2("Ven", 2.8)
#> Hallo Ven! Was für ein toller Tag!

Fine, it’s strange, but who in their right mind will try to pass numeric values for a language variable? Maybe no one on purpose, but some things in R are (sometimes) secretly integers. Namely, factors. Consider this - we have some data.frame with names and language preferences of users, the latter of which is coded as a factor variable.

users <- data.frame(
  name = c("Ven", "Gidon", "Chenyu"),
  language = factor(c("bulgarian", "german", "chinese"))
)

users
#>     name  language
#> 1    Ven bulgarian
#> 2  Gidon    german
#> 3 Chenyu   chinese

If we were to use our if/else-based greeting function to greet each user, everything works as we would expect:

purrr::pwalk(users, greeting)
#> Здравей Ven! Какъв страхотен ден! 
#> Hallo Gidon! Was für ein toller Tag! 
#> 嗨 Chenyu! 多么美好的一天

What about the greeting2, which uses switch to determine which greeting to use?

purrr::pwalk(users, greeting2)
#> 嗨 Ven! 多么美好的一天
#> Здравей Gidon! Какъв страхотен ден!
#> Hallo Chenyu! Was für ein toller Tag!

What the hell? Well, I cheated a bit and hid the warning R helpfully gave us (naughty!). Here is the output again without suppressing the warning:

purrr::pwalk(users, greeting2)
#> Warning in switch(language, chinese = paste0("嗨 ", name, "! 多么美好的一天"), : EXPR is a "factor", treated as integer.
#>  Consider using 'switch(as.character( * ), ...)' instead.
#> 嗨 Ven! 多么美好的一天
#> Warning in switch(language, chinese = paste0("嗨 ", name, "! 多么美好的一天"), : EXPR is a "factor", treated as integer.
#>  Consider using 'switch(as.character( * ), ...)' instead.
#> Здравей Gidon! Какъв страхотен ден!
#> Warning in switch(language, chinese = paste0("嗨 ", name, "! 多么美好的一天"), : EXPR is a "factor", treated as integer.
#>  Consider using 'switch(as.character( * ), ...)' instead.
#> Hallo Chenyu! Was für ein toller Tag!

Ah, that explains it (although doesn’t excuse it). When a factor variable is passed to a switch statement, the variable is treated as an integer. By default factor levels are ordered alphabetically, which we can see if we examine our language factor structure:

users$language
#> [1] bulgarian german    chinese  
#> Levels: bulgarian chinese german
str(users$language)
#>  Factor w/ 3 levels "bulgarian","chinese",..: 1 3 2

So Gidon gets greeted in Bulgarian, because his “german” language is coded as 3 in the factor variable, and the third check in switch corresponds to “bulgarian”. The documentation (?switch) helpfully explains that

switch works in two distinct ways depending whether the first argument evaluates to a character string or a number.

If the value of EXPR is not a character string it is coerced to integer. Note that this also happens for factors, with a warning, as typically the character level is meant. If the integer is between 1 and nargs()-1 then the corresponding element of ... is evaluated and the result returned: thus if the first argument is 3 then the fourth argument is evaluated and returned.

If EXPR evaluates to a character string then that string is matched (exactly) to the names of the elements in ...…

Wow, ok, I never knew this, or if I did I have completely forgotten about it.

So what, we do get a warning don’t we?

We sure do, but warnings can be suppressed, just like I did above. It’s common to suppress output of functions when running many iterations of a chatty function in an analysis script and problems can easily go unnoticed. That’s exactly what happened recently in our Bayesian measurement modeling R package when a colleage reported a weird bug. We have one computational model that can apply different forms of Luce’s choice decision rule - a standard version and a version passed through a softmax normalization. Due to this weird way that switch treats factors, the wrong normalization was applied. This is a recent and not yet officially released model, so we are yet to write all input validations. This could have easily gone unnoticed and led to incorrect model specification that nevertheless lets the model run.

Warnings are not a reliable way to signal undesired behavior. Especially when the documentation of switch’s warning itself notes that “typically the character level is meant”. Well, if typically a character level is meant, why is the default EXACTLY THE OPPOSITE?!?

Fine, but you tell users “language” should be a character… right?

We do, but here’s the kicker - R loves to turn character vectors into factors. So much so that disabling such behavior was one of the main motivation behind the development of tibbles. Until R4.0.0, whenever you used a function like read.csv() to read a file as a data.frame, R by default converted character columns to factors. Thankfully now this default has been reversed, but here’s the deal - NOT EVERYWHERE!

One place which I never knew R created factors out of character vectors is expand.grid. Expand.grid is a commonly used function to get a data.frame with all combinations of several variables, which is useful for running models with orthogonality manipulated conditions. E.g.:

conditions <- expand.grid(
  value = c(1, 100),
  version = c("cs","ss"), 
  choice_rule = c("simple", "softmax")
)
conditions
#>   value version choice_rule
#> 1     1      cs      simple
#> 2   100      cs      simple
#> 3     1      ss      simple
#> 4   100      ss      simple
#> 5     1      cs     softmax
#> 6   100      cs     softmax
#> 7     1      ss     softmax
#> 8   100      ss     softmax

Can you tell that version and choice_rule are factors? I’ve used expand.grid for years without knowing that default behavior of expand.grid, and it’s partly that standard data.frame print method does not differentiate character and factor columns in any way. You can see that indeed we have factors underneath by using str for more details:

str(conditions, give.attr = FALSE)
#> 'data.frame':    8 obs. of  3 variables:
#>  $ value      : num  1 100 1 100 1 100 1 100
#>  $ version    : Factor w/ 2 levels "cs","ss": 1 1 2 2 1 1 2 2
#>  $ choice_rule: Factor w/ 2 levels "simple","softmax": 1 1 1 1 2 2 2 2

You can examine the documentation, or directly check with formals that expand.grid, just like read.csv has a argument stringsAsFactors, which defaults to TRUE:

formals(expand.grid)
#> $...
#> 
#> 
#> $KEEP.OUT.ATTRS
#> [1] TRUE
#> 
#> $stringsAsFactors
#> [1] TRUE

Wait, but didn’t I just write that R4.0.0 solves the problem? As of R4.4.2, it only does that for read.table and data.frame, but not expand.grid

version$version.string
#> [1] "R version 4.4.2 (2024-10-31)"
formals(read.table)["stringsAsFactors"]
#> $stringsAsFactors
#> [1] FALSE
formals(data.frame)["stringsAsFactors"]
#> $stringsAsFactors
#> [1] FALSE
formals(expand.grid)["stringsAsFactors"]
#> $stringsAsFactors
#> [1] TRUE

You can of course change that by being explicit about not wanting factors, or use tidyr::expand_grid alternative instead:

expand.grid(
  value = c(1, 100),
  version = c("cs","ss"), 
  choice_rule = c("simple", "softmax"),
  stringsAsFactors = FALSE
) |> str(give.attr = FALSE)
#> 'data.frame':    8 obs. of  3 variables:
#>  $ value      : num  1 100 1 100 1 100 1 100
#>  $ version    : chr  "cs" "cs" "ss" "ss" ...
#>  $ choice_rule: chr  "simple" "simple" "simple" "simple" ...

tidyr::expand_grid(
  value = c(1, 100),
  version = c("cs","ss"), 
  choice_rule = c("simple", "softmax")
) |> str()
#> tibble [8 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ value      : num [1:8] 1 1 1 1 100 100 100 100
#>  $ version    : chr [1:8] "cs" "cs" "ss" "ss" ...
#>  $ choice_rule: chr [1:8] "simple" "softmax" "simple" "softmax" ...

Consistency is important

Consistency is a loaded term. Whether two things are consistent or not necessarily depends on context and goals. One could go on and on about consistent naming styles, consistent argument order, consistent default behaviors, etc. At its core the problem with inconsistent design choices in a programming language is that it makes it very difficult, if not impossible, to build an accurate mental model of how the language works. It’s more than a bit ironic when programming languages, which are supposed to be the precise counterpart of natural languages, become a similar tangled mess of exceptions you need to learn by heart, just like irregular verbs in English.

The main example of this post concerns the concept of control flow in programming. Both if/else statements and case/switch statements aim to accomplish the same goal - to execute different parts of a program depending on a prespecified condition. One would expect that logical comparisons would work the same between those two constructs, but as this example has illustrated, they do not in R. Sometimes complex examples obscure the core problems, so let’s present it at it’s simplest form:

x <- factor("Peace", levels = c("Violence", "Peace"))

x == "Peace"
#> [1] TRUE
x == 2
#> [1] FALSE
x == "2"
#> [1] FALSE

switch(x, Peace = "I choose peace", Violence = "I choose violence")
#> Warning in switch(x, Peace = "I choose peace", Violence = "I choose violence"): EXPR is a "factor", treated as integer.
#>  Consider using 'switch(as.character( * ), ...)' instead.
#> [1] "I choose violence"

Was Cercei mislead by R? We’ll never know.

Logical comparisons treat factors as character vectors, switch comparisons treat them as integer values. This is not ok. Sure, you can learn that this is the case, but you shouldn’t have to.

These inconsistencies in base R have led to various attempts at reform, most notably through the tidyverse ecosystem. While the tidyverse strongly emphasizes consistency, it presents its own challenges. Breaking changes are frequent, forcing regular code updates and relearning of functionality. While useful for analysis code, the packages’ interdependence and heavy footprint make them difficult to justify in lightweight, stable package development. Many tidyverse functions on the surface appear to simply wrap basic R operations, unnecessarily expanding the language’s complexity. The tidyverse’s impact on the R community remains contentious, sparking numerous debates about its influence (see discussions here, here, here, here, here). My own perspective has evolved from enthusiasm to skepticism, though I still use tidyverse tools when appropriate. I don’t want to throw away the baby along with the bathwater though - despite its drawbacks, the tidyverse gets one thing right: function APIs should be consistent and avoid “magic” behavior. R along with its predecessor S, is an old language, and a lot can be forgiven due to its legacy.

This situation mirrors broader patterns in programming language evolution. Consider C++: despite being a modern, widely-used language, it carries significant legacy baggage due to its strict backwards compatibility requirements. Newer languages like Rust, free from such constraints, can implement better defaults and more consistent behavior from the start. The tidyverse represents a curious case of attempting to fix a language’s shortcomings ‘from within’ rather than through dialect separation or replacement. While this approach maintains ecosystem cohesion, it creates a complex divide within the R community.

Python offers an instructive contrast in language evolution. Through semantic versioning, Python made major breaking changes between versions 2 and 3, prioritizing language improvement over backwards compatibility. While the transition wasn’t painless—Python 2 still claimed 25% usage a decade after Python 3’s 2008 release—by 2023, Python 2 usage had dropped to just 6% (and the transition plans were announced well in advance - PEP 3000 was published in 2006). R, conversely, maintains strong API stability between major versions. Many of the tidyverse’s consistency-oriented features could serve as excellent candidates for a base R rewrite if the community embraced the possibility of meaningful breaking changes in major versions. Yes, such transitions can be challenging—Python’s experience demonstrates this—but they also show that systematic language improvement is achievable with proper planning and community support.”

What now?

First, of course, is for me to sit down and do the annoying grunt work of going through all user-facing functions and ensuring that we test and validate every input. There’s a ton of ways to do it, either with vanilla R and custom functions (stopifnot and match.arg are your friends), or with the help of some validation packages I listed earlier. Then add more unit tests about edge cases to make sure things like this don’t happen. And so on… We have done this for a lot of our existing code, but as this example taught me, it’s easy to forget - and sometimes easy to not know.

Long-term, however, I’m becoming more and more interested in programming languages that use a strong static type system. I’ve long taught that static typing is simply annoying - as a self-taught programmer, I haven’t had the benefit of learning things “the right way”. Over the last few months I’ve started digging deeper into programming as a core skill, exploring various languages and resources. I’m growing more and more attuned to the virtues of good type systems, for many other reasons. I’m sure another post on this topic is incoming at some time. In the meantime:

if (language == "R") {
  stopifnot(inputs_are_carefully_validated())
}

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{popov2025,
  author = {Popov, Vencislav},
  title = {R Love You, {R} Hate You},
  date = {2025-02-21},
  url = {https://venpopov.com/posts/2025/r-love-you-r-hate-you/},
  langid = {en}
}

For attribution, please cite this work as:

Popov, Vencislav. 2025. “R Love You, R Hate You.” February 21, 2025. https://venpopov.com/posts/2025/r-love-you-r-hate-you/.