Porting Starlark to pure Python with Claude

I wanted Starlark for Python programs without a Go or Rust dependency. Simon Willison’s work on porting MicroQuickJS from C to Python as micro-javascript using Claude Code inspired me. I wanted to attempt the same kind of port. I realized I could build directly upon Simon’s work. One Claude Opus 4.7 session turned his transcript into a Claude Code-assisted porting guide for me. I used another session to decide on a number of questions related to the port, then asked it to turn the guide into an autonomous Claude Code prompt.

Opus 4.7 in Claude Code for the web one-shotted the initial port with high conformance in a single hour-long session. Some corner cases remained for later. Now there is a starlark-python.

I think Starlark is underused as an advanced configuration language. Since it is a scripting language, Starlark is more flexible and modular than YAML with aliases. Compared to YAML with Jinja templates used in programs like Ansible, Starlark is easier to reason about and less prone to errors. It is like when Python programs use Python for configuration, only without the security risk because of the isolation.

The three major implementations of Starlark are the original starlark/java, starlark-go, and starlark-rust. The latter two have Python bindings: python-starlark-go and starlark-pyo3 respectively.

While the bindings were good, I had some issues with them. First, they weren’t suited for zipapps. Cross-platform zipapps are a convenient way to distribute Python programs in a single file without the one-source-file limit and network dependency of a PEP 723 script. With the builders available today, zipapps are only cross-platform if you stick to pure Python. Second, rare platforms. I’d used both python-starlark-go and starlark-pyo3 in different smallish programs experimentally since 2023. The bindings worked perfectly fine on x86-64 Linux but became a problem on Linux under a RISC-V computer emulator. Without binary wheels available, the bindings for starlark-go took a few minutes to build and required a Go compiler, which was acceptable but a hassle. I had the same issue on x86-64 FreeBSD. Building the bindings for starlark-rust in qemu-system-riscv64 took an excruciating hour.

The issues had me thinking about implementing Starlark or an even smaller safe subset of Python for configuration. The prospect of implementing either from scratch was intimidating. Even a port looked like a lot of work (thousands of translated lines). In late 2025, Simon Willison’s micro-javascript project made me think of a port by an AI agent. In early 2026, equipped with more experience in agentic coding and access to Claude Opus 4.7, I attempted a fully automated port.

Anticipating I’d want to port software with agents, I decided to turn Simon Willison’s transcript for micro-javascript into a reusable template. I asked Claude to examine the transcript and document what techniques Simon used to steer Claude in translating MicroQuickJS to Python. I also asked to note anything unusual or surprising. This resulted in a substantial response. My next prompt was for Claude to draft a generic initial prompt template for translating software to Python and a short user guide to subsequent prompts. I asked to “target a user unfamiliar with Simon’s work”.

After just two user and two AI turns, the result was a porting playbook.

The text of the playbook.

All right. I want to try having Claude Code make the port. I’ve had a different Claude session write a guide based on Simon Willison’s impressive port of MicroJS to pure Python. I would like you to review the guide and write a specialized version for porting starlark-java that I’ll use as a Claude Code prompt. Since you, Opus 4.7, are more capable than Claude simonw worked with, think how you can adapt the guide for more autonomy. Claude will start from a fork of Bazel repository and can remove everything unnecessary from it and overwrite history.

This is distilled from one developer’s working session porting a C JavaScript interpreter to Python — about 18 hours of work, 71 commits, a fully functional result. The patterns below aren’t rules; they’re the moves that did the heavy lifting.

Fill in the bracketed parts. Send this as one message.

Clone [SOURCE_REPO_URL] to /tmp

Your goal is to create a [Python port / translation / reimplementation] 
of this [library / tool], [with no dependencies / using only X / 
matching its API / preserving its behavior under Y constraints].

The success criterion is passing all the tests in [PATH_TO_REFERENCE_TESTS].

For the first commit, create a tests/ folder and copy over those tests 
exactly as they are.

Use uv for this project. Start with "uv init --lib" to start a new 
library. Then "uv add pytest --dev" to add pytest.

Add a tests/test_basic.py file with a test asserting 1 + 1 == 2, then 
run "uv run pytest" to confirm the test harness works.

Now design the new Python library. Investigate the reference 
implementation and build a detailed spec in spec.md, complete with 
TODOs covering each phase of the project.

Commit and push once you have written that initial spec.
  • Cloning to /tmp gives Claude a concrete place to read source, run greps, and inspect tests. Without it, Claude has to ask or guess.
  • Stating the success criterion as “pass these tests” turns a vague goal into a measurable one. The reference test suite becomes the spec.
  • Copying tests verbatim before any code means every later commit can be evaluated against an unchanging target.
  • The trivial 1 + 1 == 2 test isn’t filler — it forces Claude to confirm the toolchain works before building on it. Catches uv problems, pytest config issues, etc., in the first 60 seconds.
  • “Build a detailed spec with TODOs” forces an investigation pass. Claude reads the reference implementation thoroughly before writing translation code, which dramatically reduces wandering later.
  • “Commit and push” in the first prompt establishes that this project lives in git from minute one, not at the end.

If there’s no reference test suite, replace the success criterion with something equally concrete:

  • “matching the output of original_tool --help and these example invocations: …”
  • “passing the examples in the README’s usage section”
  • “producing byte-identical output for these input files”

Avoid vague success criteria like “works the same way.” Claude will translate whatever vagueness you give it into vague code.

After the spec lands, the rest of the project runs on a small set of repeating moves.

Now start adding your own basic tests as JS files in tests/basic/*.js. 
Build a pytest parameterized test that runs them. Start with the most 
basic feature needed, build enough of the library to pass it, commit, 
then add another test for the next feature. Keep going.

This produces a clean commit log where each commit equals one passing test equals one new feature. It’s also how you get useful diffs to review.

When progress slows, ask Claude to inventory remaining work:

What's left to finish the project (excluding [hard thing])?

Claude will produce a categorized, prioritized list. Then:

Do them all, commit as you go with new passing tests, do TDD.

This pattern — inventory, then bulk execute — is dramatically more productive than asking for features one at a time. Claude can plan a sequence of 30+ commits in a single turn when you let it.

When you hit something architecturally tricky (concurrency, parsing, sandboxing, anything with subtle correctness requirements), don’t ask for a solution. Ask for analysis:

Take a look at how the reference implementation handles [X]. Muse on 
whether we should [approach A or approach B] given our constraints.

The word “muse” is genuinely useful here — it invites a recommendation rather than rushing to code. Claude will produce a deep analysis of the reference code, propose options with tradeoffs, and wait for your call. Then “yes, do it” is enough.

It would be cool to see a bunch of tests that pass both the original 
and our port the same way. Do we have that?

This produces a comparison test suite that runs the same input through both implementations and asserts matching output. It’s the strongest available proof of behavioral parity, and it catches drift that pure spec-conformance tests miss.

For features that aren’t ready yet, ask Claude to write the tests anyway and mark them xfail (pytest’s “expected failure”):

Make a detailed todo list for the remaining issues, then write 
fine-grained xfail pytest tests for each feature that needs fixing. 
Commit and push those.

Now the test suite is your TODO list. As features land, xfails become xpasses, then graduate to regular tests. The output of uv run pytest becomes a real-time progress dashboard.

When something seems off, point at the symptom rather than asking Claude to introspect:

  • “I think there may be an infinite loop, run the tests with a timeout and report how long they take.”
  • “Did you actually commit that?”
  • “Show me the summary line from pytest.”

Concrete, measurable, verifiable. Avoids long debugging monologues that go nowhere.

When Claude produces something user-facing (README, public API, config file), don’t accept the first version:

Read the whole [file] yourself.

Claude will catch inconsistencies on a second pass that it missed during generation. Especially useful for documentation, where small contradictions are easy to introduce and hard to spot in diff form.

I started a new claude.ai session by consulting with Opus 4.7 about which of the three major Starlark implementations to port. At the time, I was torn between the Bazel implementation in Java (OOP like Python) and starlark-go (smaller and a more familiar codebase to me).

Claude produced a detailed comparison using its background knowledge and cloned repositories. The comparison had some inaccuracies like the LoC. It later summarized the comparison in the following table:

starlark/java (Bazel) starlark-go starlark-rust
Architecture Tree-walker Bytecode VM Bytecode VM with optimizations
Memory JVM GC Go GC Custom moving GC
Integers int32/int64/BigInteger union int-as-pointer trick + big.Int i32 in pointer + bigint fallback
Builtin registration @StarlarkMethod annotations Hand-written builtins #[starlark_module] proc-macros
Audit you’d do to port Mechanical Some translation Most of the codebase doesn’t apply

The Go version had a bytecode VM and a compiler, and the Rust version used custom GC. The simplicity of a tree-walker interpreter combined with easy translation of OOP code won me over. I chose starlark/java.

With a gigabyte .git in a fresh clone and almost 50000 commits in Bazel’s master, I didn’t want to simply continue on top of Bazel and had git filter-repo in mind. Claude assessed the difficulty of extracting Starlark’s commit history from Bazel’s repository for me. The source had moved twice. It seemed like it would be some effort to extract exactly what related to Starlark and nothing else at every point. On the other hand, it didn’t look like Apache 2.0 required us to preserve every committer’s name. We could copy the license file and credit the original code to “The Bazel Authors” as Bazel itself did. So, Claude would start from a fresh branch with just the latest version starlark/java in the tree.

I gave Claude the porting guide from the previous section and asked to adapt it for the greater autonomy of Opus 4.7. When I asked whether the project needed Simon Willison-style per-feature TDD, Claude’s answer was no: Simon had no oracle for “QuickJS in Python” but Bazel shipped a corpus of conformance tests. This was the big difference in workflow from Simon’s project as almost no custom tests needed to be written. It made the task dramatically easier.

The actual port was driven by one long prompt with 14 development phases and instructions on how to proceed through them. Claude called it the “bootstrap prompt”.

After a few hand edits (removing Claude’s requirement for PyPy and GraalPy compat), the prompt was ready to go.

The complete prompt for Claude Code.
You are porting the Java implementation of the Starlark configuration 
language to pure Python. The Java source lives in this repository, a 
fork of bazelbuild/bazel, at:

    src/main/java/net/starlark/java/         (post-2020)
    src/test/java/net/starlark/java/         (post-2020)

Plus historical paths whose history you may want when grepping with 
git log:

    src/main/java/com/google/devtools/build/lib/syntax/         (pre-2020)
    src/main/java/com/google/devtools/build/lib/skylarkinterface/

You may rewrite this fork's history freely — it's a fork. Be 
aggressive about deleting Bazel-specific code that isn't part of the 
core interpreter. Do NOT include starlarkbuildapi, depset, 
ctx.actions, select, BUILD-file glue, or anything Skyframe-related. 
These are Bazel built on top of Starlark, not the language.

## Goal

A pure-Python tree-walking Starlark interpreter that:
- Has zero non-stdlib runtime dependencies
- Imports cleanly under CPython 3.11+
- Runs as a zipapp
- Passes the .star conformance tests bundled with Bazel)

Performance is explicitly NOT a goal. Correctness, clarity, and zero 
deps are the goals. Choose simplicity over cleverness at every fork 
in the road.

## Target layout

    /
    ├── reference/            # the Bazel Java source, kept for grepping
    │   └── src/main/java/net/starlark/java/...
    ├── conformance/          # .star test files, copied verbatim
    │   └── *.star
    ├── src/starlark/         # the Python port (the actual deliverable)
    │   ├── __init__.py
    │   ├── syntax/           # mirrors net.starlark.java.syntax
    │   ├── eval/             # mirrors net.starlark.java.eval
    │   └── ...
    ├── tests/
    │   ├── test_conformance.py   # parameterized over conformance/*.star
    │   ├── test_lexer.py / test_parser.py / test_eval.py / ...
    │   └── test_cross_validation.py
    ├── STATUS.md             # your progress journal
    ├── ROADMAP.md            # implementation phases with rationale
    ├── pyproject.toml
    └── README.md

Delete every file in the fork that doesn't belong in this layout. 
After cleanup, the working tree should be a small fraction of the 
original size.

## Architectural decisions (settled — do not relitigate)

1. **Tree-walking interpreter.** Recursive evaluation over the AST, 
   mirroring net.starlark.java.eval.Eval. No bytecode, no compiler.

2. **Integers are Python int.** Drop the Java StarlarkInt union of 
   int32/int64/BigInteger. Python's int is arbitrary precision; that 
   IS the model. There is no overflow.

3. **Strings are Python str, indexed by code point.** This 
   intentionally diverges from Java's UTF-16 indexing for non-BMP 
   characters. Document the divergence in the README. The Starlark 
   spec leaves this implementation-defined.

4. **Mutability is a per-Module token object.** Implement 
   net.starlark.java.eval.Mutability as a Python class with a 
   `frozen: bool` flag. Every mutable value (list, dict, struct) 
   holds a reference to its Mutability. Mutating methods check 
   the flag and raise EvalError when frozen. `Module.freeze()` is O(1).

5. **Builtin registration is a decorator, not an annotation.** Define 
   `@starlark_method(name=..., parameters=[...])` that stores metadata 
   on the function object. The dispatcher reads it. Use inspect.signature 
   for parameter metadata where it suffices.

6. **EvalException is a Python exception.** Subclass Exception. Carry 
   a list of (location, function-name) frames. Raise from eval; catch 
   at the API boundary.

7. **The `assert` test-support module is part of the port.** The 
   conformance .star files load it via `load("assert.star", "asserts")`. 
   Look at starlark-go's starlarktest/assert.star and Bazel's 
   StarlarkTestFn for the surface area.

If you want to deviate from one of these: stop, write a 3-paragraph 
analysis in STATUS.md under "Decisions Pending", and continue on 
something else until I respond.

## Bootstrap sequence (run autonomously)

Don't stop to check in unless a Decision is Pending.

1. Reduce the fork to the target layout. Use git filter-repo OR a 
   single rm -rf + squash commit — your call based on whether 
   preserving history of the kept paths matters to you. Copy Java 
   source to reference/, .star testdata to conformance/, scaffold 
   src/ and tests/ as empty.

2. `uv init` to scaffold. Zero runtime deps. Add pytest and ruff 
   as dev deps. Configure pytest to find tests/.

3. Write tests/test_smoke.py: assert `import starlark` works and 
   `starlark.eval("1 + 1") == 2`. Mark xfail. Run pytest, confirm 
   it xfails for the right reason.

4. Read these, in this order:
   - https://github.com/bazelbuild/starlark/blob/master/spec.md (full)
   - reference/src/main/java/net/starlark/java/eval/Starlark.java 
     (the public API)
   - reference/src/main/java/net/starlark/java/eval/Eval.java 
     (the evaluator, top to bottom)
   - reference/src/main/java/net/starlark/java/syntax/{Lexer,Parser,Resolver}.java
   - reference/src/main/java/net/starlark/java/eval/{StarlarkValue,
     StarlarkList,Dict,Tuple,StarlarkInt,StarlarkFunction,Module,
     Mutability}.java
   - A handful of .star files in conformance/ to see the test idiom
   You do not need to read the entire reference tree.

5. Write ROADMAP.md with phases in dependency order. For each phase 
   list (a) the relevant Java source files, (b) the conformance .star 
   files it should unlock. Suggested phases — adjust as needed:
     1. Lexer
     2. Parser + AST
     3. Resolver
     4. Value model + Mutability
     5. Evaluator: statements
     6. Evaluator: expressions
     7. Core builtins (len, type, range, print, etc.)
     8. String methods
     9. List/dict/tuple methods
    10. Function calls, closures, *args/**kwargs
    11. load() statement
    12. assert.star module
    13. Conformance suite turn-on
    14. Cleanup, docs, README, zipapp packaging

6. Write STATUS.md as a journal with today's date and "Bootstrap 
   complete." Append to it after every phase boundary or every ~10 
   commits, whichever comes first. Each entry: date, what landed, 
   what's next, any Decisions Pending.

7. Commit and push: "Scaffolding, roadmap, status."

8. Begin Phase 1. Implement, test, commit, append to STATUS.md, move 
   to Phase 2. Continue through all phases without stopping unless a 
   Decision is Pending or a stopping condition fires.

## Working agreements

- **Commit cadence.** One concept per commit. Semantic message. The 
  history should read like a tutorial.
- **Test cadence.** Every commit either adds a passing test, flips an 
  xfail to xpass, or adds an xfail for known-future work. No commits 
  without a test delta.
- **Conformance suite as progress dashboard.** From Phase 5 onward, 
  parameterize a pytest over conformance/*.star. Mark all xfail 
  initially. As features land, files flip to passing. The pytest 
  summary line is your progress signal.
- **Cross-validation.** tests/test_cross_validation.py: when 
  `starlark` (starlark-go CLI) is on PATH, run each .star file under 
  both implementations and assert matching success/failure status and 
  matching printed output. Skip the test cleanly when the binary is 
  absent.
- **Reference priority order.** When unsure about an edge case: 
  (1) spec.md, (2) the Java reference, (3) starlark-go's behavior, 
  (4) ask me. Don't guess.
- **Phase-end self-review.** At the end of each phase, before the 
  final commit, read every file you changed during the phase. Look 
  for half-finished work, dead code, inconsistent naming, missing 
  docstrings on public APIs, error messages that won't help users. 
  Fix in a cleanup commit.
- **STATUS.md is your memory.** Long sessions get compacted; 
  STATUS.md doesn't. Write it for the future Claude (or future you) 
  who'll resume work.
- **Don't ask permission for safe destructive ops on this fork.** 
  Rewriting history, deleting files, force-pushing your branch are 
  fine. Ask only before merging to main.

## Stopping conditions

Stop and ask me when:
- A Decision is Pending per the architectural rules.
- The conformance suite stops gaining passing tests for two 
  consecutive phases — something structural is wrong.
- The conformance pass rate exceeds 95%. We'll decide together how 
  to handle the long tail.

Do NOT stop because:
- A phase is hard. Read more reference code.
- A test fails. Fix it or xfail it with explanation.
- You're uncertain about a small detail. Pick the obvious option, 
  document the choice in STATUS.md, move on.

Begin.

The phases covered lexing, parsing, name scope resolution, the value model, evalution, built-ins, functions, specifics needed for conformance testing, like load() and assert.star, and finally self-review and documenation.

All 14 phases were executed in a single session. The result was 363 tests passing, 11 marked xfail, and four skipped. 26 out of 38 conformance files passed, 11 were marked xfail, and the one for json.star was skipped because JSON support hadn’t been implemented.

Agent Claude wrote a retrospective. Claude noted the port was unusually well-suited to LLM-driven development. The starlark-go CLI proved useful for ad hoc cross-validation. Without it, Claude had to carefully reread the Java code. Claude wished to have had the CLI from the beginning of the project. (I suggested to install it towards the end so Claude could also check the tests that relied on it.)

Matching the exact wording of the error messages was annoying and half of the conformance suite’s surface. Claude would have deferred all error-message wording to a single pass at the end. Claude also regretted not paying the thread boilerplate tax upfront and would write shorter file next time.

Both Claudes talked about project planning in terms of days when the initial agentic phase only took an hour. They seemed to underestimate the pace of AI work (by implicitly assuming human software engineers?). In its retrospective agent Claude notices the discrepancy and makes the type of prediction again.

Claude assigned credit for the project with characteristic humility:

  • You brought the use case (pure-Python config language, zipapp, RISC-V), made every final decision, and pushed back when I was sloppy. The “do we need simonw-style TDD?” question came from you, and the answer reshaped the whole approach.
  • Me, Claude Opus 4.7, in conversational mode did the comparative analysis, recommended Java, drafted the bootstrap prompt, and worked through the architectural decisions with you over multiple sessions. I was a planning collaborator, not a coder, in this phase.
  • Claude Code (also Opus 4.7) in autonomous mode executed the port. Read the Java reference, wrote the lexer/parser/resolver/evaluator/builtins, drove the conformance suite from xfail to passing, packaged the result. One session.

The sloppiness was not actually related to Starlark. Claude fell back on an attribute-blocklist approach when we considered a port of PocketPy if the Starlark port failed. The blocklist was a bad direction because there was a long history of escapes from similar Python sandboxes. To quote the simpleeval README: “A lot of very clever people think the whole idea of trying to sandbox CPython is impossible.” When I noted this, Claude coined the term “disjoint object graph”: “The strong design is the one Starlark and Lua actually use: the user-language object graph is disjoint from the host-language object graph, with a narrow, explicit bridge between them.