My Biggest Regret in the knitr Package

The knitr package was written in late 2011, when I knew little about character encodings. I still don’t know much about them today, but I have had enough pain. At RStudio, I have become the (self-nominated) “Character Encoding Ambassador”, mainly because I’m the only native Chinese in the company, and Chinese characters are multibyte. It is common that problems related to character encodings are reported by Chinese users. If we can fix these problems, chances are problems for other languages will disappear at the same time.

The default knitr::knit(..., encoding = getOption("encoding")) was the biggest mistake in knitr in retrospect. Now I regret and hate it day and night. I used this default encoding because it is what all base R functions do when reading or writing files. For example, take a look at the help page ?file. The default means the system native encoding (native.enc), which is a total mess on Windows. Different languages have different default character encodings (e.g., Simplified Chinese uses GBK). On *nix, the default encoding is often UTF-8.

If Windows didn’t introduce all these different encodings, I think half of programmers in the world could spend three months on the beach enjoying the sunshine every year. FWIW, I have heard rumors about the UTF-8 support in a certain future version of Windows. Even with that, I guess it will take several more years to clean up the mess.

UTF-8 works everywhere, but it takes courage for software developers to force Windows users to use UTF-8. Over the years, I have been admiring the decision of Pandoc’s author, John MacFarlane: UTF-8 in, and UTF-8 out. Most other software packages chose to compromise to the system native encoding by default. The compromise brings endless headache to developers, but the gain for users is at most marginal, if not nil. If users are forced to use UTF-8 consistently, the life of developers will be much easier, and the hassle for users is actually minimal: users only need to choose the UTF-8 encoding when reading or saving a file. Actually if the file editor gives users an option to set the encoding globally, the hassle will be none. For example, in RStudio, you can set the default text encoding to UTF-8:

Set UTF-8 as the default encoding in RStudio

Needless to say, if you haven’t set this option on Windows, I won’t be your friend.

If you don’t use a language with multibyte characters, you probably won’t feel the pain. This issue has wasted us so much time, as Xianying Tan wrote earlier this year (in Chinese). He has been trying to fix these encoding problems in various R packages like myself in the past, e.g., in plumber, roxygen2, devtools, data.table, odbc, and so on.

Both knitr and rmarkdown default to the native encoding due to my initial blind decision, but since bookdown, I have started enforcing UTF-8. All of bookdown, blogdown, xaringan, tinytex,and my packages since 2016 only support UTF-8 based on two simple helper functions in the xfun package: xfun::read_utf8() and xfun::write_utf8(). Before then, I was a little worried if this would offend any users. The fact is, I haven’t received any complaints about UTF-8 in these packages.

If you plan to write a package that involves file input/output, I’d strongly recommend that you support, and only support UTF-8. The UTF-8 assumption can save you a lot of time guessing and debugging.

Someday I’ll force knitr and rmarkdown into the UTF-8 world, too (perhaps in knitr and rmarkdown v2.0), but this will certainly take a long time to happen.

Donate

As a freelancer (currently working as a contractor) and a dad of three kids, I truly appreciate your donation to support my writing and open-source software development! Your contribution helps me cope with financial uncertainty better, so I can spend more time on producing high-quality content and software. You can make a donation through methods below.

Venmo: @yihui_xie, or Zelle: [email protected]
Paypal
- If you have a Paypal account, you can follow the link https://paypal.me/YihuiXie or find me on Paypal via my email [email protected]. Please choose the payment type as “Family and Friends” (instead of “Goods and Services”) to avoid extra fees.
- If you don’t have Paypal, you may donate through this link via your debit or credit card. Paypal will charge a fee on my side.
Other ways:

WeChat Pay (微信支付：谢益辉) Alipay (支付宝：谢益辉)

When sending money, please be sure to add a note “gift” or “donation” if possible, so it won’t be treated as my taxable income but a genuine gift. Needless to say, donation is completely voluntary and I appreciate any amount you can give.

Please feel free to email me if you prefer a different way to give. Thank you very much!

I’ll give back a significant portion of the donations to the open-source community and charities. For the record, I received about $30,000 in total (before tax) in 2024-25, and gave back about $15,000 (after tax).

UTF-8, and UTF-8 only, or we cannot be friends

Yihui Xie 2018-11-09