HTML-entity / encoding cleanup¶
Runbook for the recurring class of bug that shows up as Rare & Special
in a product title, a category badge, a breadcrumb, or a JSON-LD payload.
The problem¶
Three legacy surfaces keep feeding double-encoded entities into the WordPress database:
- Word / ChatGPT paste into the classic editor — pastes smart quotes
(
“ ” ‘ ’), en-dashes (–), and<span style="mso-…">residue straight intowp_posts.post_contentandpost_excerpt. - Elementor + KC page builders — wrap field values in HTML, then
re-encode them a second time on save, producing
&amp;and&#038;. - qTranslate-X leftovers from a translation plugin uninstalled years
ago — still emits
<!--:en-->…<!--:fr-->markers inside a handful of Elementor widgets.
Downstream, any field that WooCommerce's Store API ships to the Next.js
frontend (product name, category name, short description) is re-encoded a
final time by JSON.stringify when it becomes JSON-LD, and search engines
end up indexing the literal string Rare & Special as the product name.
Audit¶
# From the repo root.
A2_PASSPHRASE='…' python infra/a2/a2.py run \
"wp db export /home/fruitplu/backups/pre-deep-clean-$(date +%Y%m%d-%H%M%S).sql --add-drop-table --quiet"
PYTHONIOENCODING=utf-8 python reports/_pull_backup.py
PYTHONIOENCODING=utf-8 python reports/_deep_entity_audit.py
Output: reports/wp-deep-encoding-2026-04-25.csv — one row per offender,
columns table, primary_key_col, primary_key_val, field, original,
proposed, pattern_matched, serialised_field, risk.
Risk tiers (driven by the scanner, consumed by the cleanup script):
| risk | meaning | handling |
|---|---|---|
cosmetic |
whitespace-only diffs (double space, trailing ws, CRLF) | skipped by default |
low |
plain-text fields (term name, breadcrumb_title, post_title, excerpt) | auto-applied |
med |
HTML/shortcode-carrying fields (post_content, comment_content, desc) | preview + --yes-med |
high |
serialised PHP (wp_options, *_meta). SQL rewrite would break s:N:"…". |
flagged, never touched |
Note: wp_posts.guid is intentionally excluded — it is an immutable
permalink-identifier per WP core and contains & by design.
Cleanup¶
# Scripts live under infra/wp/scripts/. Copy to /tmp on the box and run there.
scp -P 7822 infra/wp/scripts/clean-encoding-deep.sh fruitplu@fruitplug.co.uk:/tmp/
scp -P 7822 infra/wp/scripts/verify-encoding-clean.sh fruitplu@fruitplug.co.uk:/tmp/
# And the CSV the clean script consumes:
scp -P 7822 reports/wp-deep-encoding-2026-04-25.csv \
fruitplu@fruitplug.co.uk:/home/fruitplu/reports/
# Dry-run — reports counts, writes nothing.
A2_PASSPHRASE='…' python infra/a2/a2.py run "bash /tmp/clean-encoding-deep.sh"
# Apply low + med (with preview). Requires --yes-med to actually persist med.
A2_PASSPHRASE='…' python infra/a2/a2.py run \
"bash /tmp/clean-encoding-deep.sh --apply --yes-med"
# Verify.
A2_PASSPHRASE='…' python infra/a2/a2.py run "bash /tmp/verify-encoding-clean.sh"
The cleanup script:
- Refuses to run if the
fp_clean_encodingWP transient is already set (prevents double-runs). - Takes a fresh pre-clean DB dump to
~/backups/pre-encoding-clean-<ts>.sqlbefore the first write. - Writes a per-row log to
~/reports/wp-deep-encoding-writelog-<ts>.md(original vs proposed, truncated to 120 chars). - Skips
high-risk rows entirely; they need the PHP unserialize / re-serialize path, not a raw SQL rewrite.
Frontend defence-in-depth¶
apps/web/lib/seo/structured-data.ts now runs every text field through
decodeSeoEntities() before handing it to JSON.stringify. Covers the
case where a freshly-authored product slips past the DB cleanup — the
JSON-LD payload will still be clean, and the visible HTML is already safe
because React escapes text-node output.
Rollback¶
# Restore the pre-clean dump. Each run takes one automatically.
A2_PASSPHRASE='…' python infra/a2/a2.py run \
"wp db import /home/fruitplu/backups/pre-encoding-clean-<ts>.sql"
If the site was written to between the dump and the rollback (orders, comments, subscription renewals), those writes are lost — coordinate the rollback window with ops.
Related¶
- Changelog entry 2026-04-25
- Scanner:
reports/_deep_entity_audit.py - Clean script:
infra/wp/scripts/clean-encoding-deep.sh - Verify script:
infra/wp/scripts/verify-encoding-clean.sh