Cache

Examples for the cache feature

2011-12-04

We can use the chunk option cache=TRUE to enable cache, and the option cache.path can be used to set the cache directory. See the options page.

Cache examples

The cache feature is used extensively in many of my documents, e.g. you can find it in the knitr main manual or its graphics manual. Here are a few more examples:

Important notes

You have to read the section on cache in the main manual very carefully to understand when cache will be rebuilt and which chunks should not be cached.

Let me repeat the factors that can affect cache (any change on them will invalidate old cache):

  1. all chunk options except include; e.g. change tidy=TRUE to FALSE will break the old cache, but changing include will not
  2. R code in a chunk; a tiny change in the R code will lead to removal of old cache, even if it is a change of a space or a blank line

It is extremely important to note that usually a chunk that has side-effects should not be cached. Although knitr tries to retain the side-effects from print(), there are still other side-effects that are not preserved. Here are some cases that you must not use cache for a chunk:

  1. setting R options like options('width') or pdf.options() or any other options in knitr like opts_chunk$set(), opts_knit$set() and knit_hooks$set()
  2. loading packages via library() in a cached chunk and these packages will be used by uncached chunks (it is entirely OK to load packages in a cached chunk and use them only in cached chunks because knitr saves the list of packages for cached chunks, but uncached chunks are unable to know which packages are loaded in previous cached chunks)

Otherwise next time the chunk will be skipped and all the settings in it will be ignored. You have to use cache=FALSE explicitly for these chunks.

The functions source() and setGeneric() have side effects of creating objects in the global environment even if the code is evaluated in a local environment. Before knitr v0.4, it was unable to cache these global objects (e.g. issue #138), but since v0.4, they can be cached as well because knitr checks newly created objects in globalenv() and save them as well.

Although the list of packages used in cached chunks is saved, this is not a perfect way of caching package names: if you loaded a package but removed it later, knitr will be unable to know it (knitr is only able to capture newly loaded packages). You have to manually edit the __packages file under the cache directory as described in #382.

Even more stuff for cache?

While the above objects seem reasonable to affect cache, reproducible research may be even more rigorous in the sense that cache can be invalidated by other changes. One typical example is the version of software; it is not impossible for two different versions of R to give you different results. In this case, we may set

opts_chunk$set(cache.extra = R.version.string)
opts_chunk$set(cache.extra = R.version) # or even consider platform

so the cached results are only applicable to a specific version of R. When you upgrade R and recompile the document, all the results will be re-computed.

Similarly you can put more variables into this option so that the cache is preserved only given environments. Here is an ambitious example:

## cache is only valid with a specific version of R and session info
## cache will be kept for at most a month (re-compute the next month)
opts_chunk$set(cache.extra = list(
  R.version, sessionInfo(), format(Sys.Date(), '%Y-%m')
))

The issue #238 shows another good use of this option: the cache is associated with the file modification time, i.e. when the data file is modified, the cache will be rebuilt automatically.

Note you can actually use any option name other than cache.extra to introduce more objects into the cache condition, e.g. you can call it cache.validation. The reason is that all chunk options are taken into account when validating the cache.

Associate cache directory with the input filename

Sometimes we may want to use different cache directories for different input files by default, and there is one solution in issue #234. However, I still recommend you to do this setting inside your source document to make it self-contained (use opts_chunk$set(cache.path = ...)).

More granular cache

Besides TRUE and FALSE for the chunk option cache, advanced users can also consider more granular cache by using numeric values for cache = 0, 1, 2, 3: 0 means FALSE, and 3 is equivalent to TRUE. For cache = 1, the results of the computation (from evaluate::evaluate()) are loaded from the cache, so the code is not evaluated again, but everything else is still executed, such the output hooks and saving recorded plots to files. For cache = 2, it is very similar to 1, and the only difference is that the recorded plots will not be resaved to files when the plot files already exist, which might save some time when the plots are big. It is recommended to use cache = 2 instead of 1, because there is no guarantee that recorded plots in a previous R session can be safely resaved in another R session, or using another version of R.

When cache = 1, 2, only a few chunk options affect the cache; see knitr:::cache1.opts and knitr:::cache2.opts for the option names. Basically, the cache will not be invalidated if a chunk option that does not affect the code evaluation is changed. For example, we change echo from TRUE to FALSE, or set fig.cap = 'a new caption'; however, if we change eval from TRUE to FALSE, or cache.path='foo/' to 'bar/', the cache has to be rebuilt.

See the example #101 (output) for some examples.

In this way, we can separate the computing from document output rendering, and it can be useful to tweak the output without breaking the cache. See #396 and #536.

Reproducibility with RNG

Knitr also caches .Random.seed and it is restored before the evaluation of each chunk to maintain reproducibility of chunks which involve with random number generation (RNG). However, there is a problem here. Suppose chunk A and B have been cached; now if we insert a chunk C between A and B (all three chunks have RNG in them), in theory B should be updated because RNG modifies .Random.seed as a side-effect, but in fact B will not be updated; in other words, the reproducibility of B is bogus.

To guarantee reproducibility with RNG, we need to associate .Random.seed with cache; whenever it is modified, the chunk must be updated. It is easy to do so by using an unevaluated R expression in the cache.extra option, e.g.

opts_chunk$set(cache.extra = rand_seed)

See ?rand_seed (it is an unevaluated R expression). In this case, each chunk will first check if .Random.seed has been changed since the last run; a different .Random.seed will force the current chunk to rebuild cache.