Like Rob, I have wished to ditch Disqus for a long time, which I have used for more than a decade. I was aware of Maëlle Salmon’s work in 2019 (and Nan Xiao’s in 2020), but hesitated at that time. Unlike Maëlle, removing Disqus was both emotionally and technically hard for me because I had more than 10K comments in Disqus, in which I had a lot of memories of the old days!
The problems that I had with migrating Disqus comments
In September this year, I saw that Rob’s post on migrating from Disqus to Giscus. It seemed that Mitch O’Hara-Wild had fully automated the job of extracting Disqus comments and re-posting them to GitHub Discussions so that we could use Giscus, which sounded awesome to me, but again, the number of Disqus comments on my site was huge. I knew GitHub has some rate limit, and I had no idea how long it would take to post the 10K comments. In addition, I was still hesitating on two other things:
-
I didn’t like the fact that all guest comments must be posted using my own GitHub account, even though there is a header line in each GitHub comment that says “this comment was originally posted by […] on […]”. If the readers don’t pay close attention, they may feel that I’ve been talking to myself on my own site for 17 years :) More importantly, the original authors will longer be notified if someone replies to their comments. There isn’t an easy way for them to view their comment history, either (Disqus allows you to view all of your past comments on a single page).
-
I also felt a little sad that the original timestamp of a comment couldn’t be preserved but could only be written in a header note. Readers may see a comment which GitHub says was posted a week ago but was actually posted ten years ago.
Neither problem has a solution. You can’t post a comment on another person’s
behalf on GitHub, and nor can you modify the timestamp of a comment. However, I
did come up with a way to remedy #1: I registered a GitHub account @giscus-bot
to post guest comments, and used my personal account to post my own comments.
Then the whole comment thread looks like this (a real
example):
@giscus-bot 3 days ago
Guest *John Doe* @ 2012-03-02 08:38:03 originally posted:
Hi Yihui!
---------------------------------------------------------
@yihui 2 days ago
Hi John!
> Originally posted on 2012-03-02 20:28:48
To further remedy #1, I added @username
in the comments from some friends of
whom I know both their GitHub usernames and Disqus names. This means when
someone replies to their comments in the future, they will get notified.
However, this also means that during the migration to GitHub Discussions, these
friends could get tons of GitHub notifications, depending on how many comments
they have left on my site before. I told some of them that the closer our
friendship is, the more “spam emails” you will get from me this time.
Afterwards, among the people whom I didn’t notify in advance, two told me they
were surprised by the sudden flood of tens of email notifications, and another
told me he got a few thousands… My apologies! I hope our friendship has stood
up well to this pressure test.
I also took this chance to clean up some Disqus comments programmatically. For
example, Chinese readers love using ~~
to express cuteness, but that coincides
with the syntax for strikeout in GitHub’s Markdown. I substituted them with
the full-width ~~
. Another example is that Disqus shortens long bare links,
and I expanded them back.
An Utterances problem
I also had some Utterances comments. Migrating them was much easier since GitHub allows us to convert GitHub Issues (on which Utterances is based) to GitHub Discussions. However, I had another complication on my site: I enabled both Utterances and Disqus, and got comments from both systems for the same posts. I had to merge these comments. Rob’s script was only for creating new GitHub discussions, so I modified it to check if a discussion exists (migrated from a GitHub issue) and post Disqus comments to existing discussions if possible.
To post to an existing discussion, you need to know its ID. Here is how I
obtained a data frame of all existing discussions (you need to set up a GitHub
token first, e.g., in the environment variable GITHUB_PAT
):
get_discussions = function(owner, repo) {
has_next = TRUE
next_cursor = NULL
info = NULL
while(has_next) {
next_cursor = if (is.null(next_cursor)) '' else {
paste0(', after: "', next_cursor, '"')
}
query = gh::gh_gql(sprintf('query FindRepo {
repository(owner: "%s", name: "%s") {
discussions(first: 100%s) {
pageInfo {
hasNextPage
endCursor
}
edges {
node {
title
body
id
}
}
}
}
}', owner, repo, next_cursor
))
res = query$data$repository$discussions
has_next = res$pageInfo$hasNextPage
next_cursor = res$pageInfo$endCursor
info = c(info, lapply(res$edges, function(x) unlist(x$node)))
}
info = do.call(rbind, info)
as.data.frame(info)
}
# fetch all discussions from the repo yihui/yihui.org
discussions = get_discussions('yihui', 'yihui.org')
In Rob’s script, I added a check to see if a discussion exists to create new discussions only conditionally.
Deal with character escaping in GitHub GraphQL
The core technique for posting comments to GitHub Discussions is GraphQL. I was not familiar with it before but it looked straightforward to learn (at least for some simple tasks like querying or updating discussions). Here is an example of querying your rate limit:
gh::gh_gql('query {
viewer {
login
}
rateLimit {
limit
cost
remaining
resetAt
}
}')
I hit an error for a few times in the beginning, and realized that some special
characters needed to be properly escaped. To avoid the backslash madness (i.e.,
thinking about how many backslashes I really need in gsub('"', '\\\\"', x)
,
which hurts a lot), I found it much easier to just use jsonlite, e.g.,
str_json = function(x) {
jsonlite::toJSON(x, auto_unbox = TRUE)
}
str_json('A title containing "double quotes"')
Then you pass the result to gh::gh_gql()
. It’s guaranteed to be valid GraphQL
syntax. For example, if you want to update the title of a discussion:
gh::gh_gql(sprintf('mutation {
updateDiscussion(input: {discussionId: "%s", title: %s}) {
discussion {
id
}
}
}', id, str_json(title)))
Giscus’s strict matching
One thing that Utterances bothered me a lot was the fuzzy matching. I sent a
pull request early last year
but it seemed to be ignored. The problem can lead to comments being loaded under
the wrong page. Giscus, as a successor of Utterances, provides a very clever
option to solve this problem: data-strict="1"
(great job, Sage
Abdullah!).
GitHub doesn’t provide strict matching in searching discussions, but Giscus’s
clever method has made it possible. In short, when using the strict method,
Giscus searches for a hash of the searching term instead of the term directly.
This has well solved my problem: I prefer using pathname
of the page URL as
the searching term, but the pathname
can be quite fuzzy if you search for it
directly. Searching for the SHA-1 hash of the pathname
gives much more
accurate results, which almost guarantees one-to-one mapping between a web page
and a GitHub discussion. No more fuzziness.
To enable data-strict
, I had to append the SHA-1 hashes of the URL pathname
s
to GitHub discussions when creating them. The hash can be computed via
digest::digest()
:
sha1 = function(x) {
digest::digest(x, 'sha1', serialize = FALSE)
}
sprintf('<!-- sha1: %s -->', sha1(pathname))
Be sure not to avoid HTML escaping, as that would escape the <!-- -->
comment.
If you use whisker::whisker.render()
in Rob’s R script, use {{{ }}}
instead
of {{ }}
.
Other little things
I had a lot of comments written in Chinese. To re-post them to GitHub, I added a header note in Chinese, and the test for (common) Chinese characters I used was:
has_chinese = function(x) {
length(grep('[\u4E00-\u9FFF]', x)) > 0
}
To post with a different GitHub account (e.g., giscus-bot
in my case), you can
use the .token
argument of gh::gh_gql()
, e.g.,
gh::gh_gql(..., .token = if (guest) 'ghp_xxxxxx')
When converting Disqus’s HTML comments to Markdown via Pandoc, I strongly
recommend using the gfm
output format with the option --wrap=none
.
rmarkdown::pandoc_convert(
input = msg_html, from = "html",
output = msg_md <- tempfile(fileext = ".md"),
to = "gfm", options = '--wrap=none'
)
In Rob’s script, to = "markdown"
is not a great choice (e.g., it results in a
lot of unnecessary escaping), and gfm
is a much more natural choice for
GitHub. The option --wrap=none
is also critical, because Pandoc will hard-wrap
long lines by default. Unfortunately, GitHub treats line breaks in Markdown as
hard breaks (i.e., <br/>
). Without the --wrap=none
option, you may see a lot
of unexpected line breaks in the comments.
The for
loop
Mitch used a for
loop to post discussions one by one. The for
loops seem to
have a bad reputation in the R community (e.g., for
loops are ugly and slow),
but I think sometimes it’s just rumor or misunderstanding. I’ll save this for
another post in the future. Here I think the for
loop is indispensable and
actually extremely valuable. Why? Because you never know what kind of error you
will run into in the loop (GraphQL syntax errors, Internet problems, and so on).
If you do run into an unexpected error, it’s quite simple to resume the for
loop: you just start with the current step index instead of from the beginning
(usually 1
).
A quick and silly example to illustrate this:
# take the square root of each element of a list
elements = list(1, 2, 3, '4', 5, 6)
roots = numeric(length(elements))
for (i in seq_along(elements)) {
roots[i] = sqrt(elements[[i]])
}
Suppose that you didn’t know there was a character value in the data. When you
run the loop, you will hit an error. No panic. Now you fix the problem and check
the value of i
. After you know i
is currently 4
, you restart the loop from
4 instead of 1:
# convert character to numeric
elements[[i]] = as.numeric(elements[[i]])
# resume the loop
for (i in 4:length(elements)) {
roots[i] = sqrt(elements[[i]])
}
This means you don’t need to repeat the computation for i = 1, 2, 3
. When the
computation is relatively expensive, this can be a big time-saver. I don’t
remember how many times I have done this during the migration of the comments.
The while
loop with browser()
I can’t praise Mitch enough for this:
while (!is.null((out <- gh::gh_gql(query))$errors)) {
if (out$errors[[1]]$message == "was submitted too quickly") {
Sys.sleep(60)
} else {
# Unknown error, debug interactively
browser()
}
}
It also gave me a chance to inspect the error and resume the loop after fixing the unexpected problem.
Summary
I’m happy that I can finally say bye to Disqus. It has served me for more than a decade, for which I feel thankful, but I can’t stand its heavy weight and tracking any more.
Special thanks to Rob, Maëlle, and Mitch for the R code, which has saved me
countless hours! I’m sorry that I didn’t share my full script in this post.
Although I believe Rob would grant me permission to modify and publish the code,
my code is just too messy and may be confusing, too. To be honest, it’s not a
single script any more—I have a few Untitled-N*
scripts in my RStudio
editor. You won’t want to read them. For most people, I think Rob’s original
script would suffice. My situation is just too complicated.
Knowing that I can manipulate (not for evil, of course) comments
programmatically via GitHub GraphQL was a strong motivation for me to move the
comments over to GitHub. I have already taken advantage of this to batch modify
the discussion titles (to make them clearer instead of only having a pathname
in them).
In the end, I’m particularly grateful to people who have posted comments on my site over these 17 years (you can view all of them on GitHub if interested). I have noticed an obvious decline in the number of comments after social media became popular, but I have already made enough long-time friends. Feel free to sign in Giscus with your GitHub account to leave comments at the bottom from now on. See you in the next 17 years!