Pandoc

Notes related to Pandoc.

It can be useful to generate a Pandoc identifier outside Pandoc. For example, I do this in the static site generator for this site. I use it to link each tag to the list of pages that have the tag on the tag page.

While the algorithm is documented in prose, I have not found official pseudocode for it. I looked up the original Haskell code to make sure everything was correct.

The algorithm in Python. This version supports Unicode.

import re


def pandoc_id(s: str) -> str:
    s = s.lower()
    # The character class `\w` includes underscores.
    s = re.sub(r"[^\s\w.-]", "", s)
    s = re.sub(r"\s+", "-", s)
    s = re.sub(r"^[\d\W_]+", "", s)
    return s or "section"

Note that this is the default identifier algorithm in Pandoc. GitHub Flavored Markdown, which Pandoc also support, uses a different algorithm.

If a generated identifier is the same as one that already exists in the document, append - and n to the new identifier, where n is an integer starting with 1.

This is a POSIX shell and POSIX Basic Regular Expressions implementation. It works correctly only on ASCII text.

tr '[:upper:]' '[:lower:]' \
| sed '
    s/[^[:space:]a-z0-9._-]//g;
    s/[[:space:]][[:space:]]*/-/g;
    s/^[^a-z]*//;
    s/^$/section/
'

The regular expressions do not use +, because + isn’t part of BRE.