Yihui Xie

Random Thoughts (Episode 1)

Yihui Xie / 2021-03-31


As an attempt to close the 300 tabs in my browser, I’m starting this series of blog posts to write down some random thoughts. Some of them are from the tweets that caught my attention, and some are from Github issues or random articles, etc. Each item may be worth a separate blog post, but I cannot possibly afford the time to write so many posts. However, I’m not sure if there will ever be an Episode 2, because these random thoughts include some unpopular opinions and open questions on my mind, to which I have no answers. I may have to just give up writing this type of post to avoid creating even more noise to this noisy enough world. To minimize the spread of my noise, I will hold the actual content of this post for at least 90 days before publishing it. I also wish that readers do not share this post. Thanks!

  1. I hate starting with this one, but I sigh a sigh upon seeing the cost of pursuing things that “look good visually,” such as the arrow in Fira Code. Ligatures are both clever and beautiful in Fira Code, but you get asked every single time by someone in the audience “how did you input that arrow or triangle?” Then you clarify that it was not a real arrow or triangle but actually two characters cleverly joined. The audience thought it was cool nonetheless, just like the past you. Once this person has been successfully “infected,” and gives a presentation the next time, there will be another person in the audience asking the same question, and the infection goes on and on…

    Does <- look so much worse than ←? Or |> worse than ▷?

    Well, do I use Fira Code? I actually do. I’m one of the infected. Will I miss it if it were gone or not invented in the first place? No, I will not. Perhaps we could be better off if it were not invented…

  2. Not surprisingly, the above tweet brought up one of the most controversial topics about R on Twitter—the assignment operator <- vs =. Grant has linked to a Github reply that I wrote in 2018 and a blog post by Matthew Martin (for the record, John Mount has also written a great post on this topic in 2016). This topic always reminds me of a mistake that John C Nash made in 2011 (fun(arg <- value) by accident), and a more recent one Roger Peng made in 2019 (typing i < -2 when he meant i <- 2). Although I have my own strong opinion, I often refrain from debating on it or recommending other people to use = (I have even provided an option in formatR::tidy_source() for users to automatically substitute = with <-.), because I’d be more worried if I win, which will only make the whole situation even more miserable. People who use <- know there is no compelling technical reason to use it, and I just wish they do not push us too hard. I promise I will be as quiet as possible.

    This is perhaps another example of something that looks so good visually that its appearance makes its cost negligible.

  3. A hex logo seems to be mandatory now when developing an R package, because everyone is making a hex sticker. I started to doubt if this trend was a good thing overall, because not all package authors have the graphic design talent or could afford (the time and cost of) hiring a designer or purchasing hex stickers. Well, Nan Xiao has at least provided a simple solution to generate a minimalist hex logo, which can be helpful.

  4. I do not understand why people need to be embarrassed for not being good at programming with the Tidyverse but good at base R. Do I feel bad for primarily using base R? No. The vast majority of my programming code is written with base R. I admit base R has its quirks, but I’m reasonably familiar with them. As I mentioned in this post, I have only a single complaint about base R, which is the partial matching by name of list elements. That said, I do not mean to recommend everyone to use base R. On the contrary, I’d still warn newcomers if they want to start with base R, but there is nothing embarrassing about being a master of base R.

  5. I have seen people tripped over base R’s concept of “vectors” several times, e.g., when they use is.vector(), what they actually need is often is.atomic() instead. If you realize that a list is a generic vector, you would probably not be too surprised by the behaviors of functions that operate on vectors. As a Chinese, I find the English phrase “make sense” interesting, because I have rarely consciously thought that “sense” needs to be “made” instead of being self-explanatory, although in reality, a lot of sense has to be “made.”

  6. Some quirks in base R are surprising or annoying, but I do not think R core wants to trip you over deliberately. Their hands are much more tightly tied than package authors and users, and they have to be more conservative. Stability is a valid choice. When they want to correct mistakes (especially design mistakes), it will often take a long time. We should acknowledge that they have been trying to do so over the years. For example, the infamous stringsAsFactors = TRUE was finally changed to FALSE in R 4.0.0. Similarly, they are aware of the && operator’s issue (i.e., it should not have operated on vectors of length greater than one), and have started to fix it since 2017. It is not a trivial change, and cannot happen overnight.

  7. Why does the typo lenght() occur frequently? My explanation is that it is faster to type h than t because h requires your right index finger to move horizontally by one key (from j), whereas t requires your left index finger to move diagonally to the up-right (from f), which means the distance is sqrt(2) keys.

  8. I wished to make a change in knitr, which will require the authors of packages that depend on knitr to make a relatively simple change accordingly. Originally I felt there was no hurry and was hoping to give these authors a year or two to make the change in their packages, but due to an email from CRAN, I had to make this change sooner, so I started to throw errors in R CMD build for packages that have not made the change. I did not throw this error during R CMD check on CRAN, because I did not want these packages to be removed from CRAN just because of this simple change, otherwise I’d be very guilty. However, I did not consider the case of BioC, and soon Hervé Pagès arrived, with smoke… Five days later, the smoke was gone, fortunately. I want to thank him and those authors who were affected by this issue for their patience and understanding. I wish you could trust me that I hate disturbing you with this type of issue, but sometimes things just go beyond my control, in which case I will try my best to make it up.

  9. When I have a scheduled meeting or phone call on a day, usually it will be hard for me to focus on anything important until the meeting/call is over. I’m certainly not alone (e.g., this and this). I’m all for on-demand meetings, but scheduled recurring meetings make me nervous from time to time, especially when I do not have an idea about what the meeting is going to cover or what I need to prepare.

  10. I’m also nervous about Slack direct messages. I liked Slack for a while when it first came out a few years ago, but found it too distracting to me later, so I disliked it (just like Hilary). Emails have been overwhelming enough to me. Slack messages are like an email split into small pieces and shooting out randomly. I’m nervous because I feel as if the other person were waiting on the other end for me to reply immediately (which I know is unlikely to be true), and if I quit Slack without replying, I will forget to reply since there are so many messages. Once I kept Slack open for about a whole week because I was asked a question that I did not have time to answer. Of course, simply saying “Hi” without any other messages following up can make everyone nervous.

    Tools have made it faster and more convenient to communicate (e.g., instant/direct messages), but our brain’s capacity to deal with the growing number of messages has not increased substantially at the same time. I wonder if we’d actually benefit from some hurdle in tools for communication. No one would write a letter to a friend with a single word “Hi” on paper and mail it out with a stamp.

  11. An educator received an email titled “F* you” from a student and mentioned it on Twitter. For some reason, the tweet was deleted later (replies still exist). I do not understand why she chose Twitter as an outlet for such a complicated matter. In my opinion, for anything that is potentially controversial, if you post it on social media, everyone will lose, and the only winner will be the social media platform. I’m not blaming the OP at all. I just want to understand why and wonder if there could be a better solution.

  12. I have mixed feelings about memes now. I do not know how I should take them—should I treat them as pure fun of the OP? I find it disturbing when a meme is about demeaning, e.g., “How to master R; Haha good luck; crying.” Perhaps it was not meant to be demeaning. Tweets are often too terse, and I do not know how to interpret them.

  13. I envied Karthik when he said there was almost no FOMO in 2020. I was able to get rid of the FOMO in previous years (which was not easy), but it struck me heavily again in late 2020, and I could not stop checking Twitter multiple times every day, because my sense of security was destroyed, and I could no longer afford reading Twitter only once a week.

  14. Sometimes I feel conclusions are drawn too quickly on Twitter, but it is hard to question them in the environment of social media, where it seems that people have to be either friends or enemies. For example, what if the software engineer really does not know that there exist last names with fewer than 3 characters? Would ignorance mean malice? It was a fault, and needed to be corrected (and was corrected soon with an apology), but could there have been a better way to resolve this issue? What if she contacted the company in private and confirmed the malice before blaming them on social media? When an individual is confronting a company, the individual may be on the weaker side, but when a company is confronting the social media, the company may be on the weaker side.

  15. Why is the internet such a brutal place? I’m very sorry for you.

  16. This bullet took me quite a bit of courage to write, and I had to hesitate for months. Finally I decided to write it in an HTML comment. Please do me a favor not to spread it. Someday I will try to say it more openly, but not now, because now I have too many perplexities and doubt if I have enough sanity. <!–# Sometimes I feel the comparison between Tidyverse with base R is not done fairly, such as https://twitter.com/drob/status/1341841693990150147 and https://twitter.com/andrewheiss/status/1359583543509348356. For @drob’s tweet, I think the comparison would have been much fairer, had it used base R’s expand.grid() instead of the obviously awkward nested loops. In this particular case, I do not see an obvious advantage of tidyr’s crossing() over base R’s expand.grid(). For @andrewheiss’s slide on convincing people of using the pipe, a fairer comparison IMHO would be something like this:

    me <- wakeup(me, time = “8:00”) me <- getoutofbed(me, side = “correct”) me <- getdressed(me, pants = TRUE, shirt = TRUE) me <- leavehouse(me, car = TRUE, bike = FALSE)

    Does that look horrible? For sure, it does not look as elegant as the pipe code, but is not that bad, either. I use this coding style myself all the time. It is a dumb style, but also super easy to develop and debug, since I can easily examine the result of any intermediate step.

    Both cases feel like attacking a straw man and exaggerating the merits of Tidyverse and the pipe.

    If you have comments on this, please feel free to email me in private. Again, please do not spread it on Twitter. Thanks! –>

  17. I feel the joke “for compatibility with S” has gone too far at times (especially on Twitter). It would be nicer if we could confirm with R core that a certain (mis)feature is really for compatibility with S before mocking at this reason, otherwise this possibly made-up reason may discourage newcomers from suggesting changes to R core. I think the stringsAsFactors example above has given us some hope that some improper features or design ideas could be changed or abandoned. Personally I do not believe that “compatibility with S” would be the sole reason that a certain peculiarity of base R has not been removed. How about thinking a little more of Chesterton’s fence? If a fence is apparently useless or silly, perhaps the right thing to do next is to ask why, instead of deciding to take it down right away.

    That said, I admit that sometimes I cannot hold the urge to take down this type of fence myself.

  18. BTW, speaking of R core, I really appreciate Martin Mächler’s openness and responsiveness. He has been so nice and patient over so many years. It is not easy. I truly respect him.

  19. Then should I talk about CRAN? My mixed feelings tell me no. Okay. But… how about the new URL checking policy enforced since the end of 2020? I can never understand why a redirected URL (not to mention dead URLs) inside a package’s documentation can make the whole package unacceptable to CRAN. This is too stern.

    I believe the intention was good. They want to make sure all URLs are actually accessible. I love the idea of automatically checking the validity of URLs, and I appreciate the NOTEs. I absolutely do. I just do not understand why a problematic (should 301 redirect really be considered problematic?) URL can fail a submission, which may contain hours, days, or even months of work. It hurts.

    I’m not sure if CRAN expected this consequence: since the policy is so stern, package authors would rather remove all URLs from the package to dodge the check. I have removed certain URLs myself, and hidden README files, and even package vignettes from the source package because of the “problematic” URLs, which I believe are valid. If the goal of CRAN’s policy maker was to improve the quality of package documentation, the actual effect could be the opposite—package authors have to remove working URLs (such as 301 redirects), and hide more documentation from the package. Is this an example of perverse incentive?

  20. I have been surprised for a few times when people say for-loops are too slow (example) or too verbose (example), so you should always use lapply() or purrr instead. I have also heard people say that for-loops should absolutely not be taught to beginners. I’m not convinced that for-loops are that bad or advanced. Vectorization in R is great, but it does not mean for-loops need to be banned, or only advanced users can write for-loops. I agree that if you teach people the for-loop hammer, they may tend to hit every nail with this hammer (just because it is easy to understand and works universally). Is that the fault of for-loops, or just because it takes time and effort to learn other specialized (vectorized) hammers?

    I did a quick Google search to see why people think for-loops are bad and found this one. A better hammer would be indexing by [] instead of using for-loops. The thing that bit the author was actually the unnecessary gsub() (stringr::str_replace_all() was also an overkill). I’d just use []:

    # if the data only contains codes R/C/M/W/1/2/3/4
    recode_data = function(x) {
      y = c(
        'R' = 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family',
        'W' = '', '1' = '', '2' = '', '3' = '', '4' = ''
      )
      y[x]  # use the original data to index y
    }
    
    # slightly more complicated if the data also contains other codes
    recode_data = function(x) {
      y = c('R' = 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family')
      z = y[x]
      z[x %in% c('W', 1:4)] = ''
      i = is.na(z)  # preserve the additional codes
      z[i] = x[i]
      z
    }
    

    I have also heard people say that the indexing methods in base R (e.g., [] , [[]], and $) should never be taught (at least to beginners). I do not know if that makes sense, but I think looking up elements in a table (e.g., a vector) is a fundamental and important skill. I do not feel [] is that bad or complicated. Perhaps it is just me.

  21. I learned a service named Typefully, which helps you convert an article into a Twitter thread. I feel sad that such services exist. Since when we find reading a full article so unbearable that we can only chew at most 280 characters at a time?

  22. We may well be used to this miserably polarized world in which people constantly and even stubbornly choose sides, but I was still quite surprised to come across this paper “Breiman’s two cultures: You don’t have to choose sides.” The very first sentence in the abstract shocked me:

    Breiman’s classic paper casts data analysis as a choice between two cultures: data modelers and algorithmic modelers.

    I’m not sure if I understand the word “cast” (shape?), but I do not remember Breiman saying that data analysts have to “choose sides.” Thomas’s reply confirmed my impression. Then I had no idea about why the authors implied that Breiman seemed to be promoting the dichotomy. Perhaps I misunderstood the authors, or the authors misunderstood Breiman, or it was just a matter of improper wording. BTW, thanks to Thomas, it was the first time that I had learned about The Two Cultures, which seemed to be very interesting to me.

  23. A few years ago I learned that #daddad happened to be a valid hex code for color and remembered it. This year I used it for the first time when Sharla asked how to show a color when typing a hex code. It seems that my solution was not what she was looking for, but here it is anyway:

    <input onkeyup="document.getElementById('circle').style.background=this.value;"/>
    <span id="circle" style="display:inline-block;border-radius:50%;width:1em;height:1em;"></span>
    

  24. I tend to agree with Colin that most people do not care if things on the Internet are real or fake. The reinforcement of views takes place all the time. I’m not blaming people whose false views are reinforced, but blaming the technology. Our brain simply cannot handle so much information, and we are left little room to verify if anything is real or fake. The verification is too darn hard, even if it is a quick Google search—typing is too hard, compared to scrolling down to browse the next hot tweet.

  25. I feel very sorry for you about the benchmark. This is a complicated matter, and unfortunately I do not fully understand how it happened. Once again, I doubt if I have the sanity to comment on it. Benchmarks can hurt people easily, unless people seriously consider Maximilian Kähler’s reply, i.e., you may be comparing your own best performance against other people’s non-optimal performance. How many benchmark authors would ask for feedback from their competitors before releasing the benchmarks into the public? Perhaps few.

    On a related note, Rob Tibshirani made an honest recommendation at JSM 2018:

    If someone shows you simulations that only show the superiority of their method, you should be suspicious. A good set of simulations will show where the method shines but also where it breaks.

  26. I have had the same experiences as Miles. I know people will ask “what’s the package?” Unfortunately, I cannot tell. It would be too expensive to reveal.

  27. The year 2020 was extremely traumatic to me. Some things fell apart, and some almost did. My anxiety level was higher than ever. I felt touched upon seeing Amelia’s tweet. It was one of the warmest moments to me. Thanks, Amelia!

    Similarly, thanks, Miles!

  28. One thing I really dislike about Twitter is its branching process in the replies, especially when we talk about anything serious (why discussing anything serious on Twitter is another question). Sometimes I even wonder if this branching process contributes (a lot) to today’s polarization of opinions. By “branching,” I mean replying to replies. This is nothing new, and has long existed in web forums and commenting systems. Twitter’s problem is that it collapses replies, so you cannot see all replies effortlessly—you have to keep clicking on certain replies or “Show replies” in all branches of conversations under the same tweet, as if Twitter deliberately wanted to hide the full picture from readers so people could bark endlessly on their own separate branches, without seeing other branches.

    Disqus users may remember the community’s outrage when Disqus rolled out and forced the awful new feature of “collapsed replies” in March 2019. The feature was finally disabled four months later due to the fierce pushback. The rationale of the new feature was:

    Through this testing, we found that by collapsing long sub-threads of replies, more overall readers choose participate in discussions. Because the collapsed experience exposes more unique viewpoints (more top-level comments), there are more opportunities for readers to both leave their own top-level comments and reply to several different opinions from others.

    I’m not sure if it is fair to summarize it as a single word: engagement (their testing found that collapsing replies increased engagement). When engagement is the goal, it could be very dangerous (it is fine to be a by-product of, say, quality). The article “How Facebook got addicted to spreading misinformation“ mentioned that “models that maximize engagement increase polarization.” In terms of engagement, perhaps Facebook’s weapon is to spread misinformation, and Twitter’s weapon is to force short text for no technical reason (to make it hard to articulate things on this platform when necessary—articulation often requires length) and let people branch out infitnitely under a thread (so it is harder to see what other branches are talking about). Perhaps I have assumed too much malice of these platforms. Anyway, I was glad that Disqus reverted this awful feature.

    Lastly, let me give a recent example to show the problem of this branching process, with my apologies to Jeroen in advance (you are still one of my favorite hackers). This example is meant to show a common problem of the Twitter platform, instead of Twitter users. Basically, Jeroen asked if there was a way to configure tinytex to automatically install missing packages. I did not see the tweet until 19 hours later. By that time, four people had reacted on three branches like “Wait a second… Isn’t that what tinytex already does?” Only one of the branches revealed the actual question he was asking. After I understood it, the paranoid me also explained it to people on other branches. Since the question was hard to answer on Twitter, I asked Jeroen to file a Github issue, which he did. If there were other people who replied with surprise in the Github issue, their surprise and confusion could be cleared with a single reply.

    That was just a normal example of asking a technical question on Twitter. There was nothing controversial or emotional. I cannot imagine how I could possible participate in discussions on social media that involve opinions. I cannot image the chaos. There was once a time when people would write “letters to editor” to express opinions or debate on controversial topics. Now every controversial topic could attract millions of “(re)tweets to OP.” My brain is too small to deal with them. I’m at a loss.

    It is quick and easy to send out a tweet. What is the price that we pay for this speed?

  29. A couple of months ago, I learned about the goal of the like button on social media in the film “The Social Dilemma.” To my surprise, the original goal was to show and feel love. I’d feel touched by this goal, had I learned about it ten years ago. In my opinion, the biggest design flaw of the like button, in addition to other problems I wrote about in 2018, is that the number of likes is publicly shown. These numbers distort love, because I do not think love should ever be quantified or the quantities should be publicly displayed on social media. With numbers in our hands, we will inevitably start to sort and compare, no matter if they are comparable, which leads to all sorts of problems.

  30. My daily intake of calories is fairly low. I rarely eat anything outside meals (such as snacks, candies, or ice creams), and primarily drink tap water. Normally I do not pay attention to the number of calories on food, but I suddenly noticed a few weeks ago that an ice cream box had labeled the total number of calories in addition to calories per serving. I was surprised because this had been something that I never understood in the US. Why and how did the producer determine the serving size for me? Why don’t they tell me the total number but force me to do the multiplication? The spiteful me thought it was because the number per serving would often be much smaller than the total number. If the consumer were not careful enough to read the label, they might think it is the total number and comfortably consume the whole package. The total number of calories would look too scary.

  31. In 2019, I wrote a Github issue guide to remind users of common problems found in past Github issues, so it would be easier for me to help them. I felt the language might be too strong or unwelcome. I was sorry that I had to be a bad guy like this, but before August 2020 (when I finally managed to hire a new colleague), I had been dealing with most Github issues related to rmarkdown and knitr by myself. There had been thousands of them. It was hard for me to deal with the exhaustion due to writing the reminder of providing reproducible examples, or asking people to ask pure questions in other places, since I could not afford doing customer service in Github issues. I wish to help as much as I could, if only you let me help you more easily.

    I felt a little relieved when I saw a Stack Overflow user said:

    It might not be a proper usage of GH issues, but Yihui tends to be forgiving for this sort of thing.

  32. I did not know what here::here() meant until I read its documentation a long time after I first heard of it. I saw people were extremely excited about it, but I found the term “here” confusing to me, because it is relative. If a path is relative, the key question to ask is, “relative to what?” It turns out that here::here() is relative to a project root. In other words, “here” means the project root. That is why I originally found it confusing, especially in the context of knitr/R Markdown documents, where “here” means the document file by default, i.e., I’m thinking “inside the document” instead of “from the project root.” Partly because of that, I wrote a simple experimental function xfun::from_root(). I guess this function name is less confusing.

    I saw people mentioned that here::here() could help solve the problem of setwd(). It definitely could, but to me, using relative paths can also avoid the need for setwd(), which is a more natural solution in my eyes, and does not assume the existence of a project. As I explained in the R Markdown Cookbook, there is not one single “correct” answer to the question “relative to what?” People just have different mental models as well as strong opinions about relative paths.

    Inevitably, popular tools can be abused in certain cases, perhaps because people think they are just magic. For example, I do not understand the point of passing an absolute path to here::here().

  33. I agree with Ben Hamner:

    It’s crazy how much our universities focus the next generation on test results, course completions, and degrees. I wish they empowered students to create and build. The transcript they should be aiming for is “here’s the ten best things we created during our time here”

    My own transcripts when I was a student often looked decent, but I did not feel particularly proud of them. Instead I did feel proud of things that I built and created.

  34. Comparing this tweet with its author’s comments under this blog post, once again I do not understand why people start discussions on Twitter. Twitter’s arbitrary limit of characters is a perfect means to discourage clarity. All lose. Twitter wins.

  35. Karl said “I don’t understand this talk at all.” I was relieved to hear that. I thought I was the only dumb person at JSM. Seriously, there have been talks of which I could not understand a single sentence other than “good morning” in the beginning and “thank you” at the end. Perhaps they were too specialized to me, e.g., I know little about biology and pharma. Even the talks are about statistics, I could not understand, either (perhaps because of too much math). Sometimes I felt the speakers were too nervous or inexperienced in giving talks, so they just kept reading their slides out loud word by word (that is probably what Max meant).

  36. There have been people who argue to death about Tidyverse vs base R. It always feels ironic to me that R is a programming language born for statistics and data analysis, but few people would respond to the debate with randomized controlled experiments and data. Most people tend to argue from their personal feelings. The paper “A Randomized Controlled Trial on the Wild Wild West of Scientific Computing with Student Learners” was a rare exception, and did not conclude which was better in terms of task completion time (base, tidy, or tilde).

    That said, even with randomized experiments, I doubt if we could get a clear answer, because the question is too broad to answer from a few experiments.

  37. I wrote about it before, but I feel downvoting without explanation is still the worst problem of Stack Overflow. I really dislike people who downvote and run. It hurts (here is one of the many examples).

  38. I learned for the first time on my 14 years’ R journey that we could cut() dates in base R (see ?cut.Date). There are still too many things that I do not know in base R. This post “R Base Gems” is a good read.

  39. I found it hard to believe that an Editor in Chief who seems to be influential (> 50K followers) would publicly ask R users to switch to Python. That was going way too far, and almost equivalent to saying R users are simply insane and sanity only exists in Python. The next day he apologized. I found two things interesting. One was that 475 people liked the offense (also retweeted and quoted 59 times), and by comparison, 86 people liked the apology (retweeted 5 times). Tell me why we should not delete Twitter like Facebook. The other thing was 8 months later, the same person tweeted (with 526 likes):

    is R really this hard to use? what am I doing wrong?

    To me, this would be a completely normal complaint about a language, had the previous offense not existed. Well, that is not a big deal. I’m fine with it. The big deal is, as I have mentioned several times before in this post, the terseness of tweets. “What am I doing wrong” is a great question, but no one would be able to answer this question in the tweet, unless you tell us exactly what you did. In other words, if we do not even know what you did, how can we possibly tell you what you did wrong? The terseness is great to set up people for language wars (one side would respond tersely “R is easy” and the other side would insist that “R is hard”).

  40. I was shocked by Jeremy Howard’s post “I violated a code of conduct” last year. A few things that he mentioned in the post resonated with me. For example, unilateral conclusions and decisions suck:

    I asked why they didn’t take a statement from me before that finding, and they said “we all watched the video, so we could see for ourselves the violation”.

    I also understand this feeling of his too well:

    One the call, I was surprised to find myself facing four people. The previous call had been with just one, and suddenly being so greatly outnumbered made me feel very intimidated.