We can use the chunk option cache=TRUE
to enable cache, and the option cache.path
can be used to set the cache directory. See the options page.
Cache examples
The cache feature is used extensively in many of my documents, e.g. you can find it in the knitr main manual or its graphics manual. Here are a few more examples:
- basic examples
- cache large data: 056-huge-plot.Rmd (output)
- an example using the Rtex syntax: knitr-latex.Rtex
- automatic dependencies
- Rnw source: knitr-dependson.Rnw
- with the chunk option
autodep=TRUE
and the functiondep_auto()
, knitr can figure out the dependencies among chunks automatically, which may save some manual efforts to specify thedependson
option
Important notes
You have to read the section on cache in the main manual very carefully to understand when cache will be rebuilt and which chunks should not be cached.
Let me repeat the factors that can affect cache (any change on them will invalidate old cache):
- all chunk options except
include
; e.g. changetidy=TRUE
toFALSE
will break the old cache, but changinginclude
will not - R code in a chunk; a tiny change in the R code will lead to removal of old cache, even if it is a change of a space or a blank line
It is extremely important to note that usually a chunk that has side-effects should not be cached. Although knitr tries to retain the side-effects from print()
, there are still other side-effects that are not preserved. Here are some cases that you must not use cache for a chunk:
- setting R options like
options('width')
orpdf.options()
or any other options in knitr likeopts_chunk$set()
,opts_knit$set()
andknit_hooks$set()
- loading packages via
library()
in a cached chunk and these packages will be used by uncached chunks (it is entirely OK to load packages in a cached chunk and use them only in cached chunks because knitr saves the list of packages for cached chunks, but uncached chunks are unable to know which packages are loaded in previous cached chunks)
Otherwise next time the chunk will be skipped and all the settings in it will be ignored. You have to use cache=FALSE
explicitly for these chunks.
The functions source()
and setGeneric()
have side effects of creating objects in the global environment even if the code is evaluated in a local environment. Before knitr v0.4, it was unable to cache these global objects (e.g. issue #138), but since v0.4, they can be cached as well because knitr checks newly created objects in globalenv()
and save them as well.
Although the list of packages used in cached chunks is saved, this is not a perfect way of caching package names: if you loaded a package but removed it later, knitr will be unable to know it (knitr is only able to capture newly loaded packages). You have to manually edit the __packages
file under the cache directory as described in #382.
Even more stuff for cache?
While the above objects seem reasonable to affect cache, reproducible research may be even more rigorous in the sense that cache can be invalidated by other changes. One typical example is the version of software; it is not impossible for two different versions of R to give you different results. In this case, we may set
opts_chunk$set(cache.extra = R.version.string)
opts_chunk$set(cache.extra = R.version) # or even consider platform
so the cached results are only applicable to a specific version of R. When you upgrade R and recompile the document, all the results will be re-computed.
Similarly you can put more variables into this option so that the cache is preserved only given environments. Here is an ambitious example:
## cache is only valid with a specific version of R and session info
## cache will be kept for at most a month (re-compute the next month)
opts_chunk$set(cache.extra = list(
R.version, sessionInfo(), format(Sys.Date(), '%Y-%m')
))
The issue #238 shows another good use of this option: the cache is associated with the file modification time, i.e. when the data file is modified, the cache will be rebuilt automatically.
Note you can actually use any option name other than cache.extra
to introduce more objects into the cache condition, e.g. you can call it cache.validation
. The reason is that all chunk options are taken into account when validating the cache.
Associate cache directory with the input filename
Sometimes we may want to use different cache directories for different input files by default, and there is one solution in issue #234. However, I still recommend you to do this setting inside your source document to make it self-contained (use opts_chunk$set(cache.path = ...)
).
More granular cache
Besides TRUE
and FALSE
for the chunk option cache
, advanced users can
also consider more granular cache by using numeric values for cache = 0, 1, 2, 3
: 0
means FALSE
, and 3
is equivalent to TRUE
. For cache = 1
,
the results of the computation (from evaluate::evaluate()
) are loaded from
the cache, so the code is not evaluated again, but everything else is still
executed, such the output hooks and saving recorded plots to files. For
cache = 2
, it is very similar to 1
, and the only difference is that the
recorded plots will not be resaved to files when the plot files already
exist, which might save some time when the plots are big. It is recommended
to use cache = 2
instead of 1
, because there is no guarantee that
recorded plots in a previous R session can be safely resaved in another R
session, or using another version of R.
When cache = 1, 2
, only a few chunk options affect the cache; see
knitr:::cache1.opts
and knitr:::cache2.opts
for the option names.
Basically, the cache will not be invalidated if a chunk option that does not
affect the code evaluation is changed. For example, we change echo
from
TRUE
to FALSE
, or set fig.cap = 'a new caption'
; however, if we change
eval
from TRUE
to FALSE
, or cache.path='foo/'
to 'bar/'
, the cache
has to be rebuilt.
See the example #101 (output) for some examples.
In this way, we can separate the computing from document output rendering, and it can be useful to tweak the output without breaking the cache. See #396 and #536.
Reproducibility with RNG
Knitr also caches .Random.seed
and it is restored before the evaluation of each chunk to maintain reproducibility of chunks which involve with random number generation (RNG). However, there is a problem here. Suppose chunk A and B have been cached; now if we insert a chunk C between A and B (all three chunks have RNG in them), in theory B should be updated because RNG modifies .Random.seed
as a side-effect, but in fact B will not be updated; in other words, the reproducibility of B is bogus.
To guarantee reproducibility with RNG, we need to associate .Random.seed
with cache; whenever it is modified, the chunk must be updated. It is easy to do so by using an unevaluated R expression in the cache.extra
option, e.g.
opts_chunk$set(cache.extra = rand_seed)
See ?rand_seed
(it is an unevaluated R expression). In this case, each chunk will first check if .Random.seed
has been changed since the last run; a different .Random.seed
will force the current chunk to rebuild cache.