Skip to content

Commit f07605f

Browse files
ramonjdclaude
andcommitted
CSS API: Add WP_CSS_Token_Processor — streaming CSS tokenizer with sanitize() and validate()
Introduces `WP_CSS_Token_Processor`, a new class in `src/wp-includes/css-api/` modelled after `WP_HTML_Tag_Processor`. It tokenizes a CSS string into a typed token stream and exposes two high-level consumers: - `sanitize(): string` — strips unsafe tokens/rules (injection guard, CDO/CDC, bad tokens, disallowed URL schemes, non-allowlisted at-rules) and returns a safe CSS string. Idempotent: sanitize(sanitize($css)) === sanitize($css). - `validate(): true|WP_Error` — returns true if the CSS is safe, or a WP_Error with a specific error code (css_injection, css_html_comment, css_malformed_token, css_unsafe_url, css_disallowed_at_rule) on the first violation found. The primary motivation is fixing the compounding corruption bug (PR WordPress#11104) where wp_kses() — an HTML sanitizer — was applied to CSS, mangling & and > characters used in CSS nesting selectors on each save for users without unfiltered_html. Security policy: - </style anywhere → sanitize() returns ''; validate() returns css_injection error - url() with javascript:, data:, or non-wp_allowed_protocols() scheme → stripped - @import, @charset, @namespace, unknown at-rules → stripped (safety-first) - bad-url-token, bad-string-token → stripped - CDO/CDC (<!-- / -->) → stripped - Null bytes → stripped in constructor Allowed at-rules: @media, @supports, @Keyframes, @-webkit-keyframes, @layer, @container, @font-face. Also adds low-level navigation (next_token, get_token_type, get_token_value, get_block_depth) and non-destructive modification (remove_token, set_token_value, get_updated_css) APIs, plus get_removed_tokens() for sanitize() introspection. Integration with filter_block_kses_value() in blocks.php is a follow-on PR. Includes: - src/wp-includes/css-api/class-wp-css-token-processor.php (~1,250 lines) - src/wp-includes/css-api/README.md - tests/phpunit/tests/css-api/WpCssTokenProcessorTest.php (67 tests) - tests/phpunit/tests/css-api/WpCssTokenSanitizeTest.php (40 tests) - tests/phpunit/tests/css-api/WpCssTokenValidateTest.php (14 tests + data provider) - docs/plans/2026-03-06-wp-css-token-processor-design.md - docs/plans/2026-03-06-wp-css-token-processor.md Fixes #64771 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 56a6768 commit f07605f

File tree

8 files changed

+5273
-0
lines changed

8 files changed

+5273
-0
lines changed
Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
# Design: WP_CSS_Token_Processor
2+
3+
**Date:** 2026-03-06
4+
**Status:** Approved
5+
**Related:** https://github.com/WordPress/wordpress-develop/pull/11104, https://core.trac.wordpress.org/ticket/64771
6+
7+
---
8+
9+
## Background
10+
11+
When a user without `unfiltered_html` (e.g. Author role, or site admins on some multisite configurations) saves a post containing block-level custom CSS (`attrs.style.css`) with `&` or `>` characters, the `filter_block_content()` pipeline corrupts the CSS through a three-step mangling chain:
12+
13+
1. `parse_blocks()` / `json_decode()``\u0026` becomes `&`
14+
2. `filter_block_kses_value()` / `wp_kses()``&` becomes `&amp;`, `>` becomes `&gt;` (KSES treats CSS as HTML)
15+
3. `serialize_block_attributes()` / `json_encode()``&amp;` becomes `\u0026amp;`
16+
17+
Each subsequent save compounds the corruption. The root cause is that `wp_kses()` is an HTML sanitizer being applied to CSS — the wrong tool for the job. This class is the right tool.
18+
19+
---
20+
21+
## Scope
22+
23+
### In scope (this session)
24+
25+
- `WP_CSS_Token_Processor` class — streaming CSS tokenizer
26+
- `sanitize()` instance method — strips unsafe tokens/rules, returns safe CSS string
27+
- `validate()` instance method — returns `true|WP_Error`
28+
- `get_updated_css()` instance method — reconstruct CSS after manual token modifications
29+
- `get_removed_tokens()` instance method — inspection after `sanitize()`
30+
- Low-level navigation and modification methods
31+
- Full inline PHPDoc
32+
- `README.md` in `src/wp-includes/css-api/`
33+
- Full test suite
34+
35+
### Out of scope (follow-on sessions)
36+
37+
- Integration with `filter_block_kses_value()` in `blocks.php`
38+
- `WP_CSS_Processor` — rule/declaration-aware layer (v2)
39+
- Replacing `process_blocks_custom_css()` in `WP_Theme_JSON`
40+
- CSS selector query engine (TODO in `class-wp-block.php:385`)
41+
- Customizer CSS and Global Styles CSS pipeline adoption
42+
43+
---
44+
45+
## Architecture
46+
47+
### Directory structure
48+
49+
```
50+
src/wp-includes/
51+
└── css-api/
52+
├── class-wp-css-token-processor.php
53+
└── README.md
54+
55+
tests/phpunit/tests/
56+
└── css-api/
57+
├── WpCssTokenProcessorTest.php
58+
├── WpCssTokenSanitizeTest.php
59+
└── WpCssTokenValidateTest.php
60+
```
61+
62+
### Component map
63+
64+
```
65+
WP_CSS_Token_Processor — tokenizes a CSS string into a typed token stream
66+
|
67+
| sanitize(): string — strips unsafe tokens/rules, returns safe CSS
68+
| validate(): true|WP_Error — returns true, or WP_Error with reason code
69+
| get_updated_css(): string — reconstruct after manual token modifications
70+
```
71+
72+
The integration point (`filter_block_kses_value()` dispatching to `sanitize()` for `['style','css']` paths) is a follow-on PR and is not part of this session.
73+
74+
---
75+
76+
## `WP_CSS_Token_Processor`
77+
78+
### Design principles
79+
80+
- **Spec-inspired, safety-first** — follows the CSS Syntax Level 3 token vocabulary and structure, but prioritises correctness on security-relevant tokens over completeness. Gaps cause rejection/stripping rather than silent pass-through.
81+
- **Forward-only streaming** — like `WP_HTML_Tag_Processor`, the processor advances a cursor through the input. No backtracking except via bookmarks (v2).
82+
- **Non-destructive modification** — operates on the original string buffer and applies edits on output via `get_updated_css()`.
83+
- **Instance-based API** — consistent with `WP_HTML_Tag_Processor`. Create an instance, call methods, retrieve output.
84+
85+
### Token types
86+
87+
#### Security-critical (must be correct)
88+
89+
| Constant | Examples | Notes |
90+
|---|---|---|
91+
| `WP_CSS_Token_Processor::URL_TOKEN` | `url(foo.png)` | Protocol-filtered against `wp_allowed_protocols()` |
92+
| `WP_CSS_Token_Processor::BAD_URL_TOKEN` | `url(foo bar)` | Malformed URL — stripped |
93+
| `WP_CSS_Token_Processor::STRING_TOKEN` | `"hello"`, `'world'` | Quoted strings |
94+
| `WP_CSS_Token_Processor::BAD_STRING_TOKEN` | Unterminated string | Stripped |
95+
| `WP_CSS_Token_Processor::AT_KEYWORD_TOKEN` | `@media`, `@import` | At-rule allowlist enforced in `sanitize()` |
96+
| `WP_CSS_Token_Processor::OPEN_CURLY_TOKEN` | `{` | Block depth tracking |
97+
| `WP_CSS_Token_Processor::CLOSE_CURLY_TOKEN` | `}` | Block depth tracking |
98+
99+
#### Structurally important
100+
101+
| Constant | Examples |
102+
|---|---|
103+
| `WP_CSS_Token_Processor::IDENT_TOKEN` | `color`, `red`, `sans-serif` |
104+
| `WP_CSS_Token_Processor::FUNCTION_TOKEN` | `calc(`, `var(`, `rgb(` |
105+
| `WP_CSS_Token_Processor::DELIM_TOKEN` | `&`, `>`, `+`, `~`, `*` |
106+
| `WP_CSS_Token_Processor::DIMENSION_TOKEN` | `16px`, `1.5rem`, `100vh` |
107+
| `WP_CSS_Token_Processor::PERCENTAGE_TOKEN` | `50%` |
108+
| `WP_CSS_Token_Processor::NUMBER_TOKEN` | `42`, `1.5` |
109+
| `WP_CSS_Token_Processor::HASH_TOKEN` | `#ff0000`, `#my-id` |
110+
| `WP_CSS_Token_Processor::WHITESPACE_TOKEN` | Preserved in output |
111+
| `WP_CSS_Token_Processor::SEMICOLON_TOKEN` | `;` |
112+
| `WP_CSS_Token_Processor::COLON_TOKEN` | `:` |
113+
| `WP_CSS_Token_Processor::COMMA_TOKEN` | `,` |
114+
115+
#### Stripped unconditionally
116+
117+
| Constant | Reason |
118+
|---|---|
119+
| `WP_CSS_Token_Processor::CDO_TOKEN` | `<!--` — HTML comments have no place in CSS |
120+
| `WP_CSS_Token_Processor::CDC_TOKEN` | `-->` — HTML comments have no place in CSS |
121+
| Null bytes | Stripped in preprocessing, before tokenization |
122+
| `</style` sequence | Injection guard — `sanitize()` returns `''`, `validate()` returns `WP_Error` |
123+
124+
#### Out of scope for v1 (documented gaps — treated as unknown, stripped)
125+
126+
- Unicode range tokens (`U+`)
127+
- Surrogate pair edge cases beyond basic UTF-8
128+
129+
### API surface
130+
131+
#### Construction
132+
133+
```php
134+
$processor = new WP_CSS_Token_Processor( string $css );
135+
```
136+
137+
#### Low-level navigation
138+
139+
```php
140+
$processor->next_token(): bool // Advance cursor. Returns false at EOF.
141+
$processor->get_token_type(): string // Token type constant for current token.
142+
$processor->get_token_value(): string // Raw value of current token.
143+
$processor->get_block_depth(): int // Current { } nesting depth.
144+
```
145+
146+
#### Low-level modification
147+
148+
```php
149+
$processor->set_token_value( string $value ): bool // Replace current token's value.
150+
$processor->remove_token(): bool // Remove current token from output.
151+
```
152+
153+
#### High-level consumers (primary public API)
154+
155+
```php
156+
$processor->sanitize(): string // Strip unsafe tokens/rules. Returns safe CSS string.
157+
$processor->validate(): true|WP_Error // true if safe, WP_Error with code if not.
158+
$processor->get_updated_css(): string // Reconstruct CSS after manual token modifications.
159+
$processor->get_removed_tokens(): array // Log of what was stripped and why, after sanitize().
160+
```
161+
162+
---
163+
164+
## Security Policy
165+
166+
### `sanitize()` — token-level rules
167+
168+
Applied during tokenization, before structural analysis:
169+
170+
| Condition | Action |
171+
|---|---|
172+
| `</style` anywhere in input | Return `''` immediately — do not continue |
173+
| Null bytes | Strip in preprocessing |
174+
| `bad-url-token`, `bad-string-token` | Strip token |
175+
| `CDO-token`, `CDC-token` | Strip token |
176+
| `url-token` with `javascript:` or `data:` | Strip token entirely |
177+
| `url-token` with other disallowed protocol | Replace URL value with `''`, preserve `url()` wrapper |
178+
179+
### `sanitize()` — rule-level rules
180+
181+
Applied during structural traversal, after tokenization:
182+
183+
**At-rule allowlist:**
184+
185+
```
186+
Allowed: @media, @supports, @keyframes, @layer, @container, @font-face
187+
Blocked: @import, @charset, @namespace
188+
Unknown: stripped (safety-first — gaps reject, not pass-through)
189+
```
190+
191+
Strip granularity: declaration fails → drop declaration; rule fails → drop rule; rest of CSS preserved.
192+
193+
### `validate()` rules
194+
195+
Returns `WP_Error` if any of the following are present:
196+
197+
| Condition | Error code |
198+
|---|---|
199+
| `</style` sequence | `css_injection` |
200+
| `bad-url-token` or `bad-string-token` | `css_malformed_token` |
201+
| Disallowed `url()` protocol | `css_unsafe_url` |
202+
| Blocked or unknown at-rule | `css_disallowed_at_rule` |
203+
| Null bytes | `css_null_byte` |
204+
| `CDO-token` / `CDC-token` | `css_html_comment` |
205+
206+
`validate()` passing is a guarantee that `sanitize()` is a no-op on the same input.
207+
208+
### What the security policy explicitly does NOT do
209+
210+
- Does not validate property names or values — authoring intent, not a security concern
211+
- Does not restrict CSS nesting depth
212+
- Does not filter `var()` or custom properties — cannot execute code
213+
- Does not block `expression()` — IE-era only, not worth the complexity
214+
215+
### Idempotency guarantee
216+
217+
`sanitize()` must be idempotent:
218+
219+
```
220+
sanitize( sanitize( $css ) ) === sanitize( $css )
221+
```
222+
223+
This is a hard requirement enforced by the test suite. It directly addresses the compounding corruption bug in PR #11104.
224+
225+
---
226+
227+
## Documentation
228+
229+
### Inline PHPDoc
230+
231+
- Every public method: `@since`, `@param`, `@return`, usage example
232+
- Class docblock: purpose, what it is not, spec reference, usage examples, known gaps
233+
- Security decisions commented with *why*, not just *what*
234+
235+
### README.md
236+
237+
Located at `src/wp-includes/css-api/README.md`. Covers:
238+
239+
- Purpose and scope
240+
- Quick usage examples for `sanitize()` and `validate()`
241+
- Token type reference
242+
- Security policy summary
243+
- Known gaps and future work
244+
245+
---
246+
247+
## Testing
248+
249+
### Test files
250+
251+
```
252+
tests/phpunit/tests/css-api/
253+
├── WpCssTokenProcessorTest.php — tokenizer unit tests
254+
├── WpCssTokenSanitizeTest.php — sanitize() tests
255+
└── WpCssTokenValidateTest.php — validate() tests
256+
```
257+
258+
### Test categories
259+
260+
#### Tokenizer unit tests (`WpCssTokenProcessorTest.php`)
261+
262+
- Each token type in isolation: correct `get_token_type()` and `get_token_value()`
263+
- Token sequences: declaration, qualified rule, nested rule
264+
- Block depth tracking via `get_block_depth()`
265+
- Edge cases: empty input, whitespace-only, single character
266+
- Manual modification: `set_token_value()`, `remove_token()`, `get_updated_css()`
267+
268+
#### Sanitize tests (`WpCssTokenSanitizeTest.php`)
269+
270+
- CSS nesting selectors (`&`, `& > p`, `& + span`) survive unchanged
271+
- Child combinator (`>`) survives unchanged
272+
- Valid at-rules (`@media`, `@supports`, `@keyframes`) survive unchanged
273+
- Blocked at-rule (`@import`) is stripped entirely
274+
- Unknown at-rule is stripped
275+
- `url()` with allowed protocol survives
276+
- `url()` with `javascript:` is stripped entirely
277+
- `url()` with `data:` is stripped entirely
278+
- `bad-url-token` is stripped
279+
- `bad-string-token` is stripped
280+
- `</style` input returns `''`
281+
- Null bytes are stripped
282+
- `CDO` / `CDC` tokens are stripped
283+
- `get_removed_tokens()` is populated after stripping
284+
- `get_removed_tokens()` is empty when nothing is stripped
285+
- **Idempotency**: `sanitize(sanitize($css)) === sanitize($css)` over a broad fixture set
286+
- **Regression fixtures from PR #11104**:
287+
- `color: blue; & p { color: red; }` survives unchanged
288+
- `& > p { margin: 0; }` survives unchanged
289+
- Repeated saves do not compound corruption
290+
291+
#### Validate tests (`WpCssTokenValidateTest.php`)
292+
293+
- Valid CSS returns `true`
294+
- Each blocked condition returns `WP_Error` with the correct error code
295+
- `validate()` passing guarantees `sanitize()` is a no-op (tested over fixture set)
296+
297+
---
298+
299+
## Open questions (deferred)
300+
301+
- Should `get_removed_tokens()` be structured (array of `['token' => ..., 'reason' => ...]`) or flat? TBD during implementation.
302+
- Should the at-rule allowlist be filterable via a WordPress filter hook (like `safe_style_css`)? Likely yes, deferred to implementation.
303+
- Exact `@since` version tag — placeholder `X.X.0` during development.

0 commit comments

Comments
 (0)