Skip to content

Local cache — v10 → v13

Status: v13 shipped. This doc covers the design history from v10 (the move to SQLite + on-disk outputs) through v13 (the current unified per-entry layout). Each version’s rationale is preserved so later contributors can see why the layout looks the way it does.

TL;DR — what the cache looks like today (v13)

Section titled “TL;DR — what the cache looks like today (v13)”
<cacheDir>/ (default: <workspaceRoot>/.vx/cache)
├── cache.db SQLite metadata + run history
├── cache.db-wal write-ahead log
├── cache.db-shm shared memory
└── <hash>/ one directory per cache entry
├── stdout captured stdout (text)
├── stderr captured stderr (text)
└── outputs/ declared output files, project-relative
└── dist/index.js
└── ...

One entry is one directory. Eviction is rm -rf <hash>/. Concurrent readers and writers coordinate via SQLite’s WAL plus atomic rename(2) of the per-entry temp dir. Outputs stay as files (not BLOBs) because cache-hit restore is a recursive file copy into the project anyway.

See docs/caching.md for user-facing reference and docs/modules/cache.md for the module-level contract.

The pre-v10 cache stored everything under <hash>/:

<cacheDir>/<hash>/
├── meta.json # taskId, command, exitCode, durationMs, outputFiles, stdout, stderr
└── outputs/
└── <project-relative paths>

Rough edges as the cache grew:

  • No index. Listing entries meant reading the directory. Sizes, ages, project/task associations required parsing every meta.json.
  • No eviction. Old entries piled up. “Delete older than N days” was doable with find but inefficient; “evict LRU until under X MB” required summing sizes from disk.
  • No run history. Couldn’t answer “what’s my cache hit rate” or “which tasks ran in the last hour and how long did they take?”
  • Metadata reads are stat-storms. Each existence check stats <hash>/meta.json. At thousands of entries × multiple tasks per run, this added up.

v10 inverts the model: SQLite holds the metadata index; outputs stay as files on disk. Same model Nx adopted in their 19.x line for the same reasons.

CREATE TABLE entries (
hash TEXT PRIMARY KEY,
project TEXT NOT NULL,
task TEXT NOT NULL,
command TEXT NOT NULL,
exit_code INTEGER NOT NULL,
duration_ms INTEGER NOT NULL,
size_bytes INTEGER NOT NULL,
created_at INTEGER NOT NULL,
accessed_at INTEGER NOT NULL
);
CREATE TABLE runs (...);
CREATE TABLE schema_meta (...);

Why these tables:

  • entries is the cache index: one row per cached output. Drives HEAD checks, restore, and eviction.
  • runs is the run history: one row per task execution (hit or miss, success or failure). Drives stats and debugging.
  • schema_meta carries the schema version. Bumped when the schema changes; we nuke + recreate on mismatch (pre-alpha).

Concurrency:

  • PRAGMA journal_mode = WAL + PRAGMA synchronous = NORMAL — WAL allows concurrent readers + one writer without conflicts.
  • PRAGMA busy_timeout = 5000 — concurrent vx run invocations queue instead of failing with SQLITE_BUSY.

Adds nullable analytics columns to the runs table without changing existing semantics:

ColumnSource
run_idULID stamped by orchestrator per invocation
cpu_msBun.spawn().resourceUsage().cpuTime total
peak_rss_bytesresourceUsage().maxRSS * 1024
wallclock_start_nshrtime.bigint() relative to run t=0
wallclock_end_nssame
cache_hitconvenience boolean (derivable from status)
bytes_uploadedremote-cache push size (LayeredCache)
bytes_downloadedremote-cache pull size on hit

Why nullable: old rows shouldn’t disappear, and the runner / remote layer populates these progressively. Querying with WHERE cpu_ms IS NULL shows you the rows from before the column landed.

Use cases unlocked:

  • --profile Chrome-trace JSON (hrtime spans).
  • --summarize per-run JSON (every analytics field).
  • Direct sqlite3 cache.db queries for slow-task ranking, hit-rate graphs, CI dashboards.

v11 → v12: project package.json folded in

Section titled “v11 → v12: project package.json folded in”

Pre-v12, the cache key incorporated the workspace fingerprint (lockfile + workspace yaml) and each input file’s contents. It did NOT incorporate the project’s package.json bytes directly. A narrow cache.inputs.files: ['src/**'] would miss:

  • A new dependencies entry that’s already in the lockfile (the lockfile changes, so the workspace fingerprint covers it).
  • A scripts.build rewrite that doesn’t affect anything in src/.
  • A version bump.

v12 folds sha256(<projectDir>/package.json) into every task’s cache key as a separate projectPackageJsonHash field on CacheKeyInput. Matches Turbo / Nx’s “implicit dependencies” behaviour. One-line addition in cache.ts:key() + a hashProjectPackageJson helper in orchestrator/execute-task.ts.

v12’s on-disk layout still had a sibling logs/ tree:

<cacheDir>/
├── cache.db
├── <hash>/ # outputs, mixed with metadata
│ └── dist/...
└── logs/
├── <hash>.stdout
└── <hash>.stderr

Two operational annoyances:

  • Eviction was multi-step. rm <hash>/ + rm logs/<hash>.stdout
    • rm logs/<hash>.stderr. Easy to miss one branch.
  • Outputs and metadata were intermingled. <hash>/dist/... mixes user-controlled paths with our internal layout. If a future cache version added a per-entry metadata file under <hash>/, namespace collision would be a real concern (user output named meta.json → collision).

v13 moves to one-directory-per-entry, fully namespaced:

<cacheDir>/<hash>/
├── stdout
├── stderr
└── outputs/<rel paths>

Eviction: rm -rf <hash>/. Future per-entry metadata: <hash>/<file> adds cleanly without colliding with user output paths.

The runner-side &lt;cacheDir&gt;/logs/&lt;run_id&gt;/&lt;project&gt;__&lt;task&gt;.{stdout, stderr} dump from earlier versions was deleted in the same change. Reasoning: successful runs already capture stdout/stderr per cache entry; failures stream live AND surface on TaskOutcome.stderr; CI captures parent stdout natively; structured per-task metadata lives in the runs SQLite table. The sibling dump was pure redundancy.

  • HMAC integrity check on cache entries. A corrupted disk could return wrong bytes; we trust the filesystem.
  • Compression. Typical dist/ is ~1–10 MB per entry; compressing on disk costs more than it saves locally. The remote-cache layer uses tar.gz for transport.
  • Symlink-aware traversal. Bun globs the real tree.
  • LRU eviction during a vx run. Capture is there (size_bytes, accessed_at); auto-evict isn’t. vx cache prune --max-size &lt;X&gt; is user-driven.

Concurrent vx run invocations on the same filesystem

Section titled “Concurrent vx run invocations on the same filesystem”

WAL handles the SQLite side. Output writes use the existing tmpdir + atomic rename pattern (<hash>.tmp-<pid>-<ms>/<hash>/). If two processes race on the same hash, both can write the temp dir; the first to rename wins; the second’s rename fails on EEXIST and we treat the entry as “already written” and proceed. The SQLite upsert is idempotent.

Concurrent invocations across hosts on the same shared filesystem work but the WAL guarantee is per-host. Cross-host coordination is the remote cache’s job — that’s the workstream covered by design/remote-cache.md.

See docs/caching.md § Bumping CACHE_VERSION.

Bumps to date:

  • v9 → v10 (PR #7): SQLite metadata + on-disk outputs introduced.
  • v10 → v11 (PR #19): analytics columns added (nullable, backwards- compatible for reads).
  • v11 → v12 (PR #42): project package.json hash folded in.
  • v12 → v13 (PR #65): unified per-entry layout; run-logs sibling tree removed.

The version is the constant CACHE_VERSION in src/cache/cache.ts. Bumping orphans every previously-stored entry. Pre-alpha tolerates this freely.