finances/migration/README.md
Dane Sabo e446c4097a Migration: rebuild battle-test learnings + opening-balance orphan fix
- build_rebuild_dataset.py: subtract orphan paired-transfer amounts from
  destination card's derived opening; html.unescape descriptions.
- merchant_map.json: +110 auto-tail rules from rebuild long-tail, +20
  recurring rules + 135 auto-cluster acceptances; stripped all cached
  account_ids; Rock Auto -> Z(Mizumi) review:true; Duquesne Light ->
  Utilities; categories stripped from _auto_tail rules per user policy.
- migration/README.md: 'Lessons from the first rebuild' section.
- migration/rebuild_clusters.{json,md}: clustering proposal artifact.
2026-05-25 21:05:38 -04:00

108 lines
5.4 KiB
Markdown

# Firefly rebuild runbook
One-time migration: wipe the CSV-era transactions and rebuild from
FITID-stable QFX so every transaction has a permanent dedup key and a clean
account taxonomy. Read this before running anything in this folder.
## Why a rebuild (not in-place cleanup)
Firefly history is young (everything ~Aug 2025+, ~950 txns, minimal manual
data). Old CSV imports left ~343 fragmented junk expense accounts and no
stable external_ids. A clean rebuild keyed on QFX `FITID` is a better
foundation than reassigning junk in place. Decided 2026-05-17.
## Hard prerequisites (do not skip)
1. **Firefly DB backup.** Destructive, no undo. Do not run the wipe until a
DB dump/snapshot exists.
2. **Exports** (in `../EXPORTS/`, gitignored): Apple/PNC/Costco QFX, Aug 2025
-> now, FITID on 100% of rows. Schwab/Coinbase/Cash (~35 txns) are
CSV-only/manual, handled separately.
## Reconciliation (the trust gate)
Per account: `opening_balance = QFX_ledger - sum(all that account's lines)`.
Classification (transfer vs expense) never changes an account's own balance,
so `opening + sum == ledger` must hold to the cent before trusting the wipe.
Verified: PNC opening $6,866.10, Apple -$4,498.79, Costco -$2,541.57 (all
tie). `rebuild_dryrun.py` recomputes this; re-run after any change.
## Classification rules (PNC = the hub)
- **Transfers** -- ALWAYS owned by the PNC leg: PNC's posting date and PNC's
FITID are authoritative, the card/brokerage counterpart line is paired by
amount (+/- a few days) and dropped. Every transfer lives under PNC, one
consistent date, never double-counted. Pairs: APPLECARD GSBANK -> Apple
Credit Card; CITI AUTOPAY -> Costco Visa Card; SCHWAB MONEYLINK -> Schwab
Stocks/Savings (disambiguate by amount); ATM WITHDRAWAL -> Cash; CARVANA
PAYOUT -> Illiquid Assets; big ATM DEPOSIT -> Coverdell; CAPITAL ONE ->
Capital One (closed). Codified in the skill's `references/transfers.md`.
- **Income/expense**: Pitt salary -> Wages; Duquesne Light -> Utilities:
Electric; Compeer -> Rent; etc.
- **Don't Know**: Venmo/CashApp/Zelle ("poker"), unrecallable checks, unknown
ATM deposits -> the `Don't Know` account, review later. Never guessed.
- **Special accounts**: `Illiquid Assets` (cars; sale = transfer in),
`Don't Know` (catch-all). See the skill's memory / taxonomy notes.
## Investment accounts
Do NOT transaction-import Schwab/Roth/Coverdell/Coinbase (noise, and assets
!= currency). Model as monthly-valued: opening balance + external MoneyLink
transfers (from the PNC side) + one monthly valuation adjustment booked to
`Investment Appreciation` / `Investment: Interest`. Dane supplies the current
value at import; delta = the adjustment. Savings<->Stocks journals are
transfers.
## Execution order
1. `python rebuild_dryrun.py` -> confirm all accounts still reconcile.
2. Build the full normalized dataset (PNC + Apple + Costco, transfers typed,
payments paired/deduped, opening balances set).
3. Drive review via the skill's browser workflow
(`references/review-workflow.md`): `--review-html`, resolve the ~190 tail
merchants in-situ (search-then-ask, <80% => ask), Export `decisions.json`.
4. **Confirm DB backup exists.**
5. Wipe transactions, prune empty junk expense accounts.
6. `--decisions decisions.json --post`. Reconcile final balances against the
derived figures above.
## Files here
- `rebuild_pnc.py` -- PNC classifier + reconciliation (read-only)
- `rebuild_dryrun.py` -- consolidated per-account reconciliation (read-only)
- `pnc_classified.json` -- PNC classification output
- `merchant_clusters.{json,md}` -- cluster proposal (taxonomy bootstrap)
- `mock_firefly.py` -- stdlib mock used for skill eval/testing
- `*review_preview*.html` -- review-UI previews on real data
Nothing here writes to Firefly except the final `--post` in step 6.
## Lessons from the first rebuild (2026-05-20)
Captured here so a second rebuild doesn't re-discover them.
- **Orphan paired transfers**: the PNC->Apple payment from 2025-08-01 has no
Apple-side line (Apple's QFX starts 08-02). Its effect was already in
Apple's derived opening; posting the transfer ALSO crediting Apple
double-counted by $3,218. Fix: `build_rebuild_dataset.py` now subtracts
orphan transfer amounts from the destination card's opening. See
`references/transfers.md` in the skill.
- **Asset accounts require `account_role`** on POST /accounts. `defaultAsset`
works universally.
- **Budgets do not auto-create.** If wiping to scratch, recreate Needs /
Wants / Savings via UI or POST before the import.
- **Wipe via UI leaves stale revenue accounts / categories** (only
transaction-referenced asset accounts go). Prune manually if you want a
truly clean slate.
- **Strip cached `account_id` from `merchant_map.json` before any rebuild.**
Pre-wipe ids are invalid post-wipe. The skill no longer caches to the map
(in-memory only) but old maps may still carry stale ids.
- **Background Python with `nohup ... &` can lose stdout to buffering.** Use
`python -u` for the import step. The first rebuild's log was empty because
Python buffered everything and we mistook it for "ran but did nothing."
- **`error_if_duplicate_hash` is now off** — Firefly's content-hash dedup
was too eager (rejected legit-distinct rows with same date+amt+desc, like
two parking sessions same garage). `external_id` precheck is the only dedup.
- **Wipe by deleting transactions, not by deleting accounts.** Otherwise you
end up with stale ids referenced by merchant_map cache.