Skip to content

HTML-entity / encoding cleanup

Runbook for the recurring class of bug that shows up as Rare & Special in a product title, a category badge, a breadcrumb, or a JSON-LD payload.

The problem

Three legacy surfaces keep feeding double-encoded entities into the WordPress database:

  1. Word / ChatGPT paste into the classic editor — pastes smart quotes (“ ” ‘ ’), en-dashes (), and <span style="mso-…"> residue straight into wp_posts.post_content and post_excerpt.
  2. Elementor + KC page builders — wrap field values in HTML, then re-encode them a second time on save, producing &amp;amp; and &amp;#038;.
  3. qTranslate-X leftovers from a translation plugin uninstalled years ago — still emits <!--:en-->…<!--:fr--> markers inside a handful of Elementor widgets.

Downstream, any field that WooCommerce's Store API ships to the Next.js frontend (product name, category name, short description) is re-encoded a final time by JSON.stringify when it becomes JSON-LD, and search engines end up indexing the literal string Rare &amp; Special as the product name.

Audit

# From the repo root.
A2_PASSPHRASE='…' python infra/a2/a2.py run \
  "wp db export /home/fruitplu/backups/pre-deep-clean-$(date +%Y%m%d-%H%M%S).sql --add-drop-table --quiet"
PYTHONIOENCODING=utf-8 python reports/_pull_backup.py
PYTHONIOENCODING=utf-8 python reports/_deep_entity_audit.py

Output: reports/wp-deep-encoding-2026-04-25.csv — one row per offender, columns table, primary_key_col, primary_key_val, field, original, proposed, pattern_matched, serialised_field, risk.

Risk tiers (driven by the scanner, consumed by the cleanup script):

risk meaning handling
cosmetic whitespace-only diffs (double space, trailing ws, CRLF) skipped by default
low plain-text fields (term name, breadcrumb_title, post_title, excerpt) auto-applied
med HTML/shortcode-carrying fields (post_content, comment_content, desc) preview + --yes-med
high serialised PHP (wp_options, *_meta). SQL rewrite would break s:N:"…". flagged, never touched

Note: wp_posts.guid is intentionally excluded — it is an immutable permalink-identifier per WP core and contains &#038; by design.

Cleanup

# Scripts live under infra/wp/scripts/. Copy to /tmp on the box and run there.
scp -P 7822 infra/wp/scripts/clean-encoding-deep.sh   fruitplu@fruitplug.co.uk:/tmp/
scp -P 7822 infra/wp/scripts/verify-encoding-clean.sh fruitplu@fruitplug.co.uk:/tmp/
# And the CSV the clean script consumes:
scp -P 7822 reports/wp-deep-encoding-2026-04-25.csv \
  fruitplu@fruitplug.co.uk:/home/fruitplu/reports/

# Dry-run — reports counts, writes nothing.
A2_PASSPHRASE='…' python infra/a2/a2.py run "bash /tmp/clean-encoding-deep.sh"

# Apply low + med (with preview). Requires --yes-med to actually persist med.
A2_PASSPHRASE='…' python infra/a2/a2.py run \
  "bash /tmp/clean-encoding-deep.sh --apply --yes-med"

# Verify.
A2_PASSPHRASE='…' python infra/a2/a2.py run "bash /tmp/verify-encoding-clean.sh"

The cleanup script:

  • Refuses to run if the fp_clean_encoding WP transient is already set (prevents double-runs).
  • Takes a fresh pre-clean DB dump to ~/backups/pre-encoding-clean-<ts>.sql before the first write.
  • Writes a per-row log to ~/reports/wp-deep-encoding-writelog-<ts>.md (original vs proposed, truncated to 120 chars).
  • Skips high-risk rows entirely; they need the PHP unserialize / re-serialize path, not a raw SQL rewrite.

Frontend defence-in-depth

apps/web/lib/seo/structured-data.ts now runs every text field through decodeSeoEntities() before handing it to JSON.stringify. Covers the case where a freshly-authored product slips past the DB cleanup — the JSON-LD payload will still be clean, and the visible HTML is already safe because React escapes text-node output.

Rollback

# Restore the pre-clean dump. Each run takes one automatically.
A2_PASSPHRASE='…' python infra/a2/a2.py run \
  "wp db import /home/fruitplu/backups/pre-encoding-clean-<ts>.sql"

If the site was written to between the dump and the rollback (orders, comments, subscription renewals), those writes are lost — coordinate the rollback window with ops.

  • Changelog entry 2026-04-25
  • Scanner: reports/_deep_entity_audit.py
  • Clean script: infra/wp/scripts/clean-encoding-deep.sh
  • Verify script: infra/wp/scripts/verify-encoding-clean.sh