# Pandoc

Notes related to Pandoc.

## Contents

## Generating Pandoc ids without Pandoc {#pandoc-ids}

It can be useful to generate a
[Pandoc identifier](https://pandoc.org/MANUAL.html#extension-auto_identifiers)
outside Pandoc.
For example,
I do this in the static site generator for this site.
I use it to link each
[tag](#bottom)
to the list of pages that have the tag on the
[tag page](/tags).

While the algorithm is documented in prose,
I have not found official pseudocode for it.
I looked up the
[original Haskell code](https://github.com/jgm/pandoc/blob/e0e60871458329679c81cf8c589199d4b52922f8/src/Text/Pandoc/Shared.hs#L447)
to make sure everything was correct.

### Python

The algorithm in Python.
This version supports Unicode.

```python
import re


def pandoc_id(s: str) -> str:
    s = s.lower()
    # The character class `\w` includes underscores.
    s = re.sub(r"[^\s\w.-]", "", s)
    s = re.sub(r"\s+", "-", s)
    s = re.sub(r"^[\d\W_]+", "", s)
    return s or "section"
```

Note that this is the default identifier algorithm in Pandoc.
GitHub Flavored Markdown,
which Pandoc also support,
uses a different algorithm.

If a generated identifier is the same as one that already exists in the document,
append `-` and *n* to the new identifier,
where *n* is an integer starting with 1.

### POSIX shell and BRE {#posix-shell}

This is a POSIX shell and
[POSIX Basic Regular Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions)
implementation.
It works correctly only on ASCII text.

```shell
tr '[:upper:]' '[:lower:]' \
| sed '
    s/[^[:space:]a-z0-9._-]//g;
    s/[[:space:]][[:space:]]*/-/g;
    s/^[^a-z]*//;
    s/^$/section/
'
```

The regular expressions do not use `+`,
because `+` isn't part of BRE.

## Page metadata

URL: <https://dbohdan.com/pandoc.md>

Published 2023-08-26, updated 2025-01-14.

Tags:

- algorithm
- POSIX shell
- programming
- Python
- shell