Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion spec/09-lexical-structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -542,6 +542,7 @@ A single-quoted string literal is always a constant expression.
<i>dq-simple-escape-sequence</i>
<i>dq-octal-escape-sequence</i>
<i>dq-hexadecimal-escape-sequence</i>
<i>dq-unicode-escape-sequence</i>

<i>dq-simple-escape-sequence:: one of</i>
\" \\ \$ \e \f \n \r \t \v
Expand All @@ -554,6 +555,13 @@ A single-quoted string literal is always a constant expression.
<i>dq-hexadecimal-escape-sequence::</i>
\x <i>hexadecimal-digit hexadecimal-digit<sub>opt</sub></i>
\X <i>hexadecimal-digit hexadecimal-digit<sub>opt</sub></i>

<i>dq-unicode-escape-sequence::</i>
\u{ codepoint-digits }

<i>codepoint-digits::</i>
<i>hexadecimal-digit</i>
<i>hexadecimal-digit codepoint-digits</i>
</pre>

*octal-digit* and *hexadecimal-digit* are defined in [§§](#integer-literals).
Expand Down Expand Up @@ -586,13 +594,28 @@ Escape sequence | Character name | Unicode character
\v | Vertical Tab | U+000B
\ooo | 1–3-digit octal digit value ooo
\xhh or \Xhh | 1–2-digit hexadecimal digit value hh
\u{xxxxxx} | UTF-8 encoding of Unicode codepoint U+xxxxxx | U+xxxxxx

Within a double-quoted string literal, except when recognized as the
start of an escape sequence, a backslash (\\) is retained verbatim.

Within a double-quoted string literal a dollar ($) character not
escaped by a backslash (\\) is handled using a variable substitution rules
described below.
described below.

The `\u{xxxxxx}` escape sequence produces the UTF-8 encoding of the Unicode
codepoint with the hexadecimal number specified within the curly braces.
Implementations MUST NOT allow Unicode codepoints beyond U+10FFFF as this is
outside the range UTF-8 can encode (see
[RFC 3629](http://tools.ietf.org/html/rfc3629#section-3)). If a codepoint

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could mention the WHATWG spec as well, but "RFC 3629" sounds so much more authorative!

larger than U+10FFFF is specified, implementations MUST error.
Implementations MUST pass through `\u` verbatim and not interpret it as an
escape sequence if it is not followed by an opening `{`, but if it is,
implementations MUST produce an error if there is no terminating `}` or the
contents are not a valid codepoint. Implementations MUST support leading zeroes,
but MUST NOT support leading or trailing whitespace for the codepoint between
the opening and terminating braces. Implementations MUST allow Unicode
codepoints that are not Unicode scalar values, such as high and low surrogates.

**Variable substitution**

Expand Down Expand Up @@ -695,6 +718,7 @@ echo "\$myC->p1 = >$myC->p1<\n"; // → $myC->p1 = >2<
<i>hd-simple-escape-sequence</i>
<i>dq-octal-escape-sequence</i>
<i>dq-hexadecimal-escape-sequence</i>
<i>dq-unicode-escape-sequence</i>

<i>hd-simple-escape-sequence:: one of</i>
\\ \$ \e \f \n \r \t \v
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<?php

var_dump("\u{61}"); // ASCII "a" - characters below U+007F just encode as ASCII, as it's UTF-8
var_dump("\u{FF}"); // y with diaeresis
var_dump("\u{ff}"); // case-insensitive
var_dump("\u{2603}"); // Unicode snowman
var_dump("\u{1F602}"); // FACE WITH TEARS OF JOY emoji
var_dump("\u{0000001F602}"); // Leading zeroes permitted
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
string(1) "a"
string(2) "ÿ"
string(2) "ÿ"
string(3) "☃"
string(4) "😂"
string(4) "😂"
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<?php

var_dump("\u{}");
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fatal error: Invalid UTF-8 codepoint escape sequence in %s on line %d
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<?php

var_dunp("\u{blah");
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fatal error: Invalid UTF-8 codepoint escape sequence in %s on line %d
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<?php

var_dump("\u{110000}"); // U+10FFFF + 1
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fatal error: Invalid UTF-8 codepoint escape sequence: Codepoint too large in %s on line %d
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<?php

// These are ignored to avoid breaking JSON string literals
var_dump("\u");
var_dump("\u202e");
var_dump("\ufoobar");
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
string(2) "\u"
string(6) "\u202e"
string(8) "\ufoobar"
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<?php

var_dump("\u{+1F602}");
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fatal error: Invalid UTF-8 codepoint escape sequence in %s on line %d
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<?php

var_dump("\u{-1F602}");
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fatal error: Invalid UTF-8 codepoint escape sequence in %s on line %d
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<?php

// Surrogate pairs are non-well-formed UTF-8 - however, it is sometimes useful
// to be able to produce these (e.g. CESU-8 handling)

var_dump(bin2hex("\u{D801}"));
var_dump(bin2hex("\u{DC00}"));
var_dump(bin2hex("\u{D801}\u{DC00}")); // CESU-8 encoding of U+10400
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
string(6) "eda081"
string(6) "edb080"
string(12) "eda081edb080"
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<?php

var_dump("\u{1F602 }");
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fatal error: Invalid UTF-8 codepoint escape sequence in %s on line %d