I just finished reviewing a pull request in the knitr repo that tries to improve the error message when it fails to parse YAML, and I feel three base R functions are worth mentioning to more R users. I have been inspired by Maëlle Salmon’s 3 functions blog series, and finally started writing one by myself.
regexec(): get substrings with patterns
If you want to master string processing using regular expressions (regex) with
base R, the two help pages
?regexp are pretty much all you need.
Although I had read them many times in the past, I did not discover
until about three years ago, while this function was first introduced in R
This function gives you the positions of substring groups captured by your
regular expressions. It will be much easier to understand if you actually get
the substrings instead of their positions, which can be done via
another indispensable function when you work with functions like
regexpr(). For example:
x = 'abbbbcdefg' m = regexec('a(b+)', x) # positions regmatches(x, m) # substrings
[]  "abbbb" "bbbb"
The length of the returned value depends on how many
() groups you have in the
regular expression. In the above example, the first value is the whole match
abbbb is matched by
a(b+)), and the second value is for the first group
(b+) (any number of consecutive
If you do not know
regmatches(), it is natural to do
like the aforementioned pull request originally did:
message = e$message regex = "line (?<line>\\d+), column (?<column>\\d+)" regex_result = regexpr(regex, message, perl = TRUE) starts = attr(regex_result, "capture.start") lengths = attr(regex_result, "capture.length") line_index = substr(message, starts[,"line"], starts[,"line"] + lengths[,"line"] - 1) column_index = substr(message, starts[,"column"], starts[,"column"] + lengths[,"column"] - 1) line_index = as.integer(line_index) column_index = as.integer(column_index)
Its goal is to extract a line and column number from a string of the form
"line x, column y". I rewrote the code (using my obnoxious
one-letter-variable-name style) as:
x = e$message r = "line (?<row>\\d+), column (?<col>\\d+)" m = regmatches(x, regexec(r, x, perl = TRUE))[][-1] row = as.integer(m['row']) col = as.integer(m['col'])
(<?NAME>...) means a named capture, so you could later extract the
substrings by names instead of numeric indices, e.g.,
m['row'] instead of
m. But this is not important. It is okay to use a numeric index.
BTW, if you are new to regular expressions and not sure if you should use
perl = TRUE or
FALSE (often the default) in the regex family of functions,
perl = TRUE. Perl-compatible regular expressions (PCRE) should
cause you fewer surprises and are more powerful.
strrep(): repeat a string for a number of times
How many times have you done this?
paste(rep('abc', 10), collapse = '')
I have done it for numerous times. Now, no more
It is even vectorized like most other base R functions, e.g.,
strrep(c('abc', 'defg'), c(3, 4))
I do not want to pretend that I have always known everything—in fact, I did not discover this function until about two years ago.
It is common to generate
N spaces like the original pull request did:
spaces = paste(rep(" ", column_index), collapse = "") cursor = paste(spaces, "^~~~~~", collapse = "")
And I rewrote it as:
cursor = paste0(strrep(" ", col), "^~~~~~")
append(): insert elements to a vector
Maëlle has mentioned
append() in her post. Interestingly, it could be used in
this pull request, too. Original code:
split_indexes = seq_along(meta) <= line_index before_cursor = meta[split_indexes] after_cursor = meta[!split_indexes] error_message = c( "Failed to parse YAML: ", e$message, "\n", before_cursor, cursor, after_cursor )
x = c("Failed to parse YAML: ", x, "\n", append(meta, cursor, row))
I remember when I first learned S-Plus in 2004, I was surprised to see a
classmate wrote a
t.test() function by herself (which was actually cool) and
she was equally surprised when I told her that there was a built-in
function. I think similar things still happen today. If you are not aware of
append(), it is easy and tempting to reinvent
them, which can make your code lengthy and complicated.