Pandoc

Notes related to Pandoc.

Generating Pandoc ids without Pandoc

It can be useful to generate a Pandoc identifier outside Pandoc. For example, I do this in the static site generator for this site. I use it to link each tag to the list of pages that have the tag on the tag page.

While the algorithm is documented in prose, I have not found official pseudocode for it. I looked up the original Haskell code to make sure everything was correct.

First, a POSIX shell and POSIX Basic Regular Expressions implementation. It works correctly only on ASCII text.

tr '[:upper:]' '[:lower:]' \
| sed '
    s/[^[:space:]a-z0-9._-]//g;
    s/[[:space:]][[:space:]]*/-/g;
    s/^[^a-z]*//;
    s/^$/section/
'

The regular expressions do not use +, because + isn’t part of BRE.

Now, the same algorithm in Python. This version supports Unicode.

import re


def pandoc_id(s: str) -> str:
    s = s.lower()
    # The character class `\w` includes underscores.
    s = re.sub(r"[^\s\w.-]", "", s)
    s = re.sub(r"\s+", "-", s)
    s = re.sub(r"^[\d\W_]+", "", s)
    return s if s else "section"

Note that this is the default identifier algorithm in Pandoc. GitHub Flavored Markdown, which Pandoc also support, uses a different algorithm.

If a generated identifier is the same as one that already exists in the document, append - and n to the new identifier, where n is an integer starting with 1.

License

The code on this page is distributed under the 0BSD license. It does not require attribution.

The text of the license follows.

Copyright (c) 2023 D. Bohdan

Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.