Pandoc
Notes related to Pandoc.
It can be useful to generate a Pandoc identifier outside Pandoc. For example, I do this in the static site generator for this site. I use it to link each tag to the list of pages that have the tag on the tag page.
While the algorithm is documented in prose, I have not found official pseudocode for it. I looked up the original Haskell code to make sure everything was correct.
The algorithm in Python. This version supports Unicode.
import re
def pandoc_id(s: str) -> str:
s = s.lower()
# The character class `\w` includes underscores.
s = re.sub(r"[^\s\w.-]", "", s)
s = re.sub(r"\s+", "-", s)
s = re.sub(r"^[\d\W_]+", "", s)
return s or "section"
Note that this is the default identifier algorithm in Pandoc. GitHub Flavored Markdown, which Pandoc also support, uses a different algorithm.
If a generated identifier is the same as one that already exists in the document, append -
and n to the new identifier, where n is an integer starting with 1.
This is a POSIX shell and POSIX Basic Regular Expressions implementation. It works correctly only on ASCII text.
tr '[:upper:]' '[:lower:]' \
| sed '
s/[^[:space:]a-z0-9._-]//g;
s/[[:space:]][[:space:]]*/-/g;
s/^[^a-z]*//;
s/^$/section/
'
The regular expressions do not use +
, because +
isn’t part of BRE.