As a person who’s worked with various programming languages over time, I have become interested in the nuances and overlaps among languages. In particular, concepts related to code syntax and organization–everything from technical concepts such as lexical scoping, to more broad concepts such as importing and naming data–really fascinate me. Organization “enthusiasts” like me truly appreciate software/applications that follow consistent norms.
In the R community, the
tidyverse ecosystem has become extremely
popular because it implements a consistent, intuitive framework that is,
consequently, easy to learn. Even aside from the
system for package development, documentation, etc. has strong
community-wide standards. 1 In all,
there’s something magical about R’s framework that has made it my
favorite language to use.
While I tend to follow most of the R community’s “best practices”, there
are certain “traditional” or well-agreed upon aspects of
where I knowingly deviate from the norm. I think
spelling out exactly why I follow the norm on some fronts and not on
others is a good exercise in self-understanding, if not just a fun
exercise in seeing how many people disagree with me.
- Use quotes (i.e standard evaluation) when importing a package.
library("dplyr") # instead of library(dplyr)
I believe most people use the non-standard evaluation (NSE) approach,
but I like not having to use NSE when it is not really intended to be
used as the primary means of invoking a function. (Note that with some
packages, such as
dplyr, NSE is intended to be the primary means
of working with variables and functions. In that case, I would not go
out of my way to invoke standard evaluation methods.)
- Import tidyverse packages explicitly instead of importing the
library("dplyr") library("stringr") # instead of library("tidyverse")
Again, I think I’m probably in the minority on this one (although most
people may not even have an opinion on which method is “best”). One can
argue that the whole point of bundling the six or so packages that get
library("tidyverse") is to “abstract away” having to think
about what packages will be necessary for a specific analysis.
- Do not import a package if it is only being used once or twice.
Instead, use the
iris_cleaned <- janitor::clean_names(iris) # instead of library("janitor") iris_cleaned <- clean_names(iris) # ... Then go on to not use janitor again.
Additionally, I may sometimes use the
package::function notation even
after importing a package if the function name might be confused with a
variable name. For example, I might specify
year might seem like a
verb. (There’s more on that in another section.)
I’m not really sure if there is a general consensus or not on this matter.
paste()(or another similar function) if inserting numerics or strings in some kinds of “boiler-plate” string.
sprintf() offers slightly better “find-and-replace” syntax,
which can be useful when re-using code or even when debugging. Also, I
think that it makes the underlying string more readable in the case that
there are a lot of variables to substitute in to the string.
Nonetheless, my justification in this matter is not unwavering–one can
paste() has the advantage of more clarity depending on the
length of the boiler-plate string and the number of variables to be
inserted. (In fact, I often find myself switching among
stringr::str_c() with no real discretion,
so I can by hypocritical on this matter.)
person <- "Tony" sprintf("%s is a wonderful person.", person) # is not really any easier or more explicit than paste(person, "is a wonderful person.") # but for a case like this team_1 <- "Yankees" team_2 <- "Red Sox" score_1 <- 3 score_2 <- 4 sprintf("The %s defeated the %s by a score of %0.f to %0.f.", team_1, team_2, score_1, score_2) # is probably better than paste("The", team_1, "defeated the", team_2, "a score of ", score_1, "to", score_2, ".")
stringrfunctions over their base R equivalents (i.e.
My behavior in this matter is probably not all that controversial, but I
just wanted to make a note of it because I sometimes catch myself using
grep() in “tidy” pipelines, only to
realize that a
str_detect() would save me some
effort because I don’t have to use a
. to anonymously specify the
string argument. Nonetheless, I have found that using the
stringr functions is not necessarily second nature for me because the
base R methods shown in many Stack Overflow
have ingrained in me non-
tidyverse style. (I should be clear–these
other techniques are not “bad” or “incorrect” in any way.)
Files and Folders
For naming files and folders, Jenny Bryan’s informative presentation “Naming Things” is a must-read. I do my best to follow the principles she outlines, while also adding in some of my own “flair”.
- Use “-” as a separator in file names to distinguish different
...scrape-[website]-.... Otherwise, use an underscore “_” as a separator for words that make up a single topic/entity/component (e.g.
I don’t think my choices in this matter are too controversial.
- Add a numeric prefix for files that are to be executed in a pre-defined order.
01-scrape.R 02-process.R 03-analyze.R 04-report.R
I don’t think my convention is unusual here–there is no conflict with Jenny’s principles.
Notably, I like to add a “main” file, i.e.
00-main.R, that sources
each of the files in a workflow. Something like GNU Make could also be
used, but I like to stick with R tools exclusively. In that case, an R
drake could be used, but typically my project
workflow is not so complex that it necessitates the added functionality
that these packages present. Jenny discusses these things in her All
the Automation Things Topics
page Topics page for her
Sometimes my project has a “configuration”, i.e.
functions-analyze.R file that might need to be “sourced”
at the beginning of a workflow. I tend to not added numeric prefixes to
these files, and I source them in at the beginning of my “main” script.
- Add numbers for file suffixes, i.e.
-v02, etc., for versioning.
This practice of mine is probably atypical. I realize that this is not really necessary if using git or some other version control service, but I have a difficult time completely deleting a file that I am not 100% sure I will never need to look back at. (I generally oppose “hoarding” of things in this manner, but some habits are hard to break).) In my opinion, keeping the file is simply easier than restoring previous snapshots in a git life cycle. Once I’m completely convinced that I no longer have use for an “old” file, then I delete it.
For functions, I do my best to implement the principles outlined in the Functions chapter of the R for Data Science e-book. The notion of using verbs for function names (unless the function is so simple that a noun might be more appropriate) makes a lot of sense to me.
Perhaps the reader might find it interesting that I like to use the verb
compute_ for functions that use
dplyr functions in a piped
action simply because it is not extremely common and, consequently, not
likely to interfere with an existing function name from an imported
package. I think my logic on this matter is in tune with general
practices–many packages nowadays use package-specific prefixes for their
functions in order to distinguish them (e.g. the
str_ prefix for
Additionally, sometimes I’ll use the simple verb prefix
get_ (a la the
set methods of an object class in an object-oriented
programming framework) if the function is fairly simple and there is no
other intuitive name to use.
I think, in general, my conventions with variables are “acceptable”.
Personally, I implement “snake case” for my variable names, although I
have no qualms with those who use some other convention (e.g. “Camel
Case”) as long as they are consistent. Interestingly, I went through a
phase with R programming where I implemented a form of the Visual Basic
style of prefixing variable names with the class type of the variable
strVariable for a string,
intVariable for an integer, etc.).
2 I was prefixing my character variables
ch_, data-like variables (i.e.
etc.). This practice is probably not all too abhorrent, but I think its
more verbose than is necessary, and goes against the original
R, a language created by statisticians in order to
make scientific work more natural for those with little to no experience
with computer science.
Variables for Files and Folders
I tend to break up the names of files into the following components:
dir: Directory. i.e. what would be returned by
gsub(basename(.), "", .)
filename: Name of the file, without the directory or extension. i.e. what would be returned by
ext: File extension, without the “.”. i.e. What would be returned by
When combining these three components, I name the variable
(using the OS-agnostic
file.path() function). Perhaps the more
traditional name for such a thing is simply
path, but I tend to think
that that term is a bit ambiguous.
# Should be a relative path! dir <- "/path/to/file/" filename <- "mysupercoolfile" ext <- "txt" filepath <- file.path(dir, paste0(filename, ".", ext)) # Don't care if an extra forward-slash is added here. It will still work. filepath
##  "/path/to/file//mysupercoolfile.txt"
Variables for Dates
Because I often work with dates, I have an opinion on naming date-related variables.
- I like to use
yyyymmdd(as opposed to something maybe more intuitive like
ymd) to refer to a date in the ISO 8601 international standard format (i.e. “2017-03-14” for March 3, 2017).
I feel that this format is more explicit and stands out. Similarly, if
assigning a number for a year, month, or day, I like to use
dd respectively (even if the month or day is a single
digit). Others may reason to use
day, but these
names conflict directly with functions in the
lubridate package (which
I often use for converting numbers representing some kind of time/date).
The very short
d might be another option, but I think
each can be can be extremely ambiguous. Is
d a day or date? Or data?
m a month or minute? Is
y a year or the response variable in a
Likewise, my logic advocating the use of
yyyymmdd for dates explains
my preference for
hhmmss for time variables (and
individual components of time-related variables).
- I prefer to use
col_namewhen referencing the name of a column in a
matrix. Since I don’t think
col(for column) and
namereally represent separate entities, I don’t think it’s best to separate them with an underscore. (This convention also explains my preference for
file_name, as well as
My stance on this matter might be “out of sync” with the
mantra. Note that the
readr package uses
col_names as a parameter.
- I prefer
varwhen referencing the values in a
varcan be ambiguous. (Is it a variable in a formula?)
I think I’ve seen someone high-up in the R community (Hadley?) mention
that they prefer
var as well, so maybe my practice is
“advisable”, if not the norm.
- I prefer
dataas the name of the data parameter in a function (assuming that the operation is being performed on a
data.frame), as opposed to
d. In the case of a function that operates on columns/rows in a
matrix, I prefer the name
I think there’s probably a wide range of styles that people have
regarding this specific matter, especially the
I try to truncate longer names or actions to 3 or 4 letters. For example, For example, I often use
processedto indicate that a data object has been transformed in some manner. (In this example,
procwould be appended on the name of the input variable (i.e.
data_procif the original variable is
data.) Some of my other “go-to”’s include
summarized). Notably, I like to pair
tstwhen working with
testing sets for machine learning.
I like to use
outfor the output variable of a function, irregardless of what the function does (assuming that there are more than a single action in the function, which would require assignment to a variable).
I think this is fairly common. In fact, I think I picked up this convention when studying other people’s package code.
- I like to end variables with the “s” suffix when I know that it is a
vector. For example, I might use
yyyymmddsfor a vector of dates.
yyyymmdds <- c(as.Date("2018-01-01"), as.Date("2018-02-01"))
This technique can sometimes result in awkward names, which makes me think that most people would be against it. (Perhaps the more likely case is that most people would see the difference as trivial and ask me why I’m thinking so hard about this anyways.)
Projects and Workflow
- For project skeleton, emulate the structure of an R package as closely as possible.
To be explicit, my projects tend to look like this.
[project_name]/ |--data/ |--data-raw/ |--docs/ |--figs/ |--output/ |--R/
A short explanation of each: +
data/: Data created from scripts
(e.g. a web scraper) that are not intended to be shared as part of
the project’s output. Preferably stored in an R data format (e.g. .rds
data-raw/: Manually-collected data. Probably saved as a .csv
or .xlsx file. Appropriate location for SQL files and output if a
database connection (i.e. with a company/enterprise database) cannot be
made directly through R. +
docs/: Documentation of the project.
Preferably RMarkdown or Markdown file(s). +
visualizations created interactively that are not designated for the
project’s output. +
output/: Deliverables. Preferably RMarkdown files
knitted from RMarkdown. Possibly includes nested
figs folders. +
R/: R code. .R files that function primarily as
scripts or host a set of functions that are “sourced” into a script file
(See Hadley Wickham’s book on R Packages for more explanation.)
Probably the most significant difference between my structure and that
of an R package is the
output/ folder (which might be considered the
analogue of a
vignettes/ folder in an R package).
Regarding project documentation, I prefer the
docs/ name for the
documentation folder, as opposed to the R package standard of
docs/ is more appropriate because it seems to be the
standard name for a folder hosting extensive documentation of a package
(as opposed to
man/, which is for package functions). (Additionally, I
should note that the
pkgdown R package for creating a website for an R
package and its documentation generates a
Finally, in compliance with standard R package development, one should
add folders for
tests/ for C code and testing scripts. (I
personally have never written C code or tests for my projects.)
- Create an
old/folder in the same location that a no-longer-used file was previously used.
This practice of mine goes hand-in-hand with my “manual” versioning of
files. When I begin using a newer version of a script, I’ll typically
put the previous version(s) in an
In general, Hadley Wickham and Jenny Bryan are the “go-to” knowledge
resources in the
R community on all things related to
style, practices, etc.
Of course, these resources represent only a very small subset of amazing resources out there. I know I could go on for days reading about style, syntax, etc., but maybe that’s just me. 3
- Of course, other programming languages/communities have their own sets of standards that deserve merit. ^
- Note that Camel Case is the preferred style amongst the Visual Basic community. ^
- If you enjoy this topic, then you probably enjoy classic software-related debates such as “spaces vs. tabs” and “R vs. python”. ^