- build_rebuild_dataset.py: subtract orphan paired-transfer amounts from
destination card's derived opening; html.unescape descriptions.
- merchant_map.json: +110 auto-tail rules from rebuild long-tail, +20
recurring rules + 135 auto-cluster acceptances; stripped all cached
account_ids; Rock Auto -> Z(Mizumi) review:true; Duquesne Light ->
Utilities; categories stripped from _auto_tail rules per user policy.
- migration/README.md: 'Lessons from the first rebuild' section.
- migration/rebuild_clusters.{json,md}: clustering proposal artifact.
108 lines
5.4 KiB
Markdown
108 lines
5.4 KiB
Markdown
# Firefly rebuild runbook
|
|
|
|
One-time migration: wipe the CSV-era transactions and rebuild from
|
|
FITID-stable QFX so every transaction has a permanent dedup key and a clean
|
|
account taxonomy. Read this before running anything in this folder.
|
|
|
|
## Why a rebuild (not in-place cleanup)
|
|
|
|
Firefly history is young (everything ~Aug 2025+, ~950 txns, minimal manual
|
|
data). Old CSV imports left ~343 fragmented junk expense accounts and no
|
|
stable external_ids. A clean rebuild keyed on QFX `FITID` is a better
|
|
foundation than reassigning junk in place. Decided 2026-05-17.
|
|
|
|
## Hard prerequisites (do not skip)
|
|
|
|
1. **Firefly DB backup.** Destructive, no undo. Do not run the wipe until a
|
|
DB dump/snapshot exists.
|
|
2. **Exports** (in `../EXPORTS/`, gitignored): Apple/PNC/Costco QFX, Aug 2025
|
|
-> now, FITID on 100% of rows. Schwab/Coinbase/Cash (~35 txns) are
|
|
CSV-only/manual, handled separately.
|
|
|
|
## Reconciliation (the trust gate)
|
|
|
|
Per account: `opening_balance = QFX_ledger - sum(all that account's lines)`.
|
|
Classification (transfer vs expense) never changes an account's own balance,
|
|
so `opening + sum == ledger` must hold to the cent before trusting the wipe.
|
|
Verified: PNC opening $6,866.10, Apple -$4,498.79, Costco -$2,541.57 (all
|
|
tie). `rebuild_dryrun.py` recomputes this; re-run after any change.
|
|
|
|
## Classification rules (PNC = the hub)
|
|
|
|
- **Transfers** -- ALWAYS owned by the PNC leg: PNC's posting date and PNC's
|
|
FITID are authoritative, the card/brokerage counterpart line is paired by
|
|
amount (+/- a few days) and dropped. Every transfer lives under PNC, one
|
|
consistent date, never double-counted. Pairs: APPLECARD GSBANK -> Apple
|
|
Credit Card; CITI AUTOPAY -> Costco Visa Card; SCHWAB MONEYLINK -> Schwab
|
|
Stocks/Savings (disambiguate by amount); ATM WITHDRAWAL -> Cash; CARVANA
|
|
PAYOUT -> Illiquid Assets; big ATM DEPOSIT -> Coverdell; CAPITAL ONE ->
|
|
Capital One (closed). Codified in the skill's `references/transfers.md`.
|
|
- **Income/expense**: Pitt salary -> Wages; Duquesne Light -> Utilities:
|
|
Electric; Compeer -> Rent; etc.
|
|
- **Don't Know**: Venmo/CashApp/Zelle ("poker"), unrecallable checks, unknown
|
|
ATM deposits -> the `Don't Know` account, review later. Never guessed.
|
|
- **Special accounts**: `Illiquid Assets` (cars; sale = transfer in),
|
|
`Don't Know` (catch-all). See the skill's memory / taxonomy notes.
|
|
|
|
## Investment accounts
|
|
|
|
Do NOT transaction-import Schwab/Roth/Coverdell/Coinbase (noise, and assets
|
|
!= currency). Model as monthly-valued: opening balance + external MoneyLink
|
|
transfers (from the PNC side) + one monthly valuation adjustment booked to
|
|
`Investment Appreciation` / `Investment: Interest`. Dane supplies the current
|
|
value at import; delta = the adjustment. Savings<->Stocks journals are
|
|
transfers.
|
|
|
|
## Execution order
|
|
|
|
1. `python rebuild_dryrun.py` -> confirm all accounts still reconcile.
|
|
2. Build the full normalized dataset (PNC + Apple + Costco, transfers typed,
|
|
payments paired/deduped, opening balances set).
|
|
3. Drive review via the skill's browser workflow
|
|
(`references/review-workflow.md`): `--review-html`, resolve the ~190 tail
|
|
merchants in-situ (search-then-ask, <80% => ask), Export `decisions.json`.
|
|
4. **Confirm DB backup exists.**
|
|
5. Wipe transactions, prune empty junk expense accounts.
|
|
6. `--decisions decisions.json --post`. Reconcile final balances against the
|
|
derived figures above.
|
|
|
|
## Files here
|
|
|
|
- `rebuild_pnc.py` -- PNC classifier + reconciliation (read-only)
|
|
- `rebuild_dryrun.py` -- consolidated per-account reconciliation (read-only)
|
|
- `pnc_classified.json` -- PNC classification output
|
|
- `merchant_clusters.{json,md}` -- cluster proposal (taxonomy bootstrap)
|
|
- `mock_firefly.py` -- stdlib mock used for skill eval/testing
|
|
- `*review_preview*.html` -- review-UI previews on real data
|
|
|
|
Nothing here writes to Firefly except the final `--post` in step 6.
|
|
|
|
## Lessons from the first rebuild (2026-05-20)
|
|
|
|
Captured here so a second rebuild doesn't re-discover them.
|
|
|
|
- **Orphan paired transfers**: the PNC->Apple payment from 2025-08-01 has no
|
|
Apple-side line (Apple's QFX starts 08-02). Its effect was already in
|
|
Apple's derived opening; posting the transfer ALSO crediting Apple
|
|
double-counted by $3,218. Fix: `build_rebuild_dataset.py` now subtracts
|
|
orphan transfer amounts from the destination card's opening. See
|
|
`references/transfers.md` in the skill.
|
|
- **Asset accounts require `account_role`** on POST /accounts. `defaultAsset`
|
|
works universally.
|
|
- **Budgets do not auto-create.** If wiping to scratch, recreate Needs /
|
|
Wants / Savings via UI or POST before the import.
|
|
- **Wipe via UI leaves stale revenue accounts / categories** (only
|
|
transaction-referenced asset accounts go). Prune manually if you want a
|
|
truly clean slate.
|
|
- **Strip cached `account_id` from `merchant_map.json` before any rebuild.**
|
|
Pre-wipe ids are invalid post-wipe. The skill no longer caches to the map
|
|
(in-memory only) but old maps may still carry stale ids.
|
|
- **Background Python with `nohup ... &` can lose stdout to buffering.** Use
|
|
`python -u` for the import step. The first rebuild's log was empty because
|
|
Python buffered everything and we mistook it for "ran but did nothing."
|
|
- **`error_if_duplicate_hash` is now off** — Firefly's content-hash dedup
|
|
was too eager (rejected legit-distinct rows with same date+amt+desc, like
|
|
two parking sessions same garage). `external_id` precheck is the only dedup.
|
|
- **Wipe by deleting transactions, not by deleting accounts.** Otherwise you
|
|
end up with stale ids referenced by merchant_map cache.
|