-
Notifications
You must be signed in to change notification settings - Fork 8k
RFC: Flexible Heredoc and Nowdoc Syntaxes #2770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7185982 to
24c24fc
Compare
f3798bb to
476ec24
Compare
Zend/zend_language_scanner.l
Outdated
| end++; | ||
| } | ||
| if (!IS_LABEL_START(*end)) { | ||
| if ((spacing & 3) == 3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this magic number 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could change this to be:
if (spacing & USES_TABS && spacing & USES_SPACES) {The while loop a few lines above should probably also be changed then to:
while (YYCURSOR < YYLIMIT && (*YYCURSOR == ' ' || *YYCURSOR == '\t')) {
spacing |= *YYCURSOR == '\t' ? USES_TABS : USES_SPACES;| if (YYCURSOR[-2] == '\r' && YYCURSOR[-1] == '\n') { | ||
| newline = 2; /* Windows newline */ | ||
| } else { | ||
| newline = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no magic numbers, make a define for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually preexisting code (I've just removed the wrapping conditional). The numbers could be changed to macros (for better semantic meaning), but this is kind of tangential to this PR.
|
Zend/tests/flexible-nowdoc-error7.phpt and Zend/tests/flexible-nowdoc-error8.phpt are crashing for me. Here's the valgrind log: Presumably zendlval doesn't get initialized (or is initialized and destroyed) in this error case. |
|
The test-case <?php
{
$FOO = "FOO";
define("FOO", "FOO");
$b = <<<FOO
Test
${
FOO
}
FOO;
var_dump($b);
}fails with because the Similarly <?php
{
$FOO = "FOO";
define("FOO", "FOO");
$b = <<<FOO
Test
${
FOO
}
FOO;
var_dump($b);
}dies with because the indentation is incorrectly determined and assumed invariants no longer hold. |
|
I'm not sure how to fix this issue. We can only determine the indentation when we reach the ending delimiter (while taking into account interpolation, which can contain arbitrary code). At the same time, we already have to strip the indentation when we emit the ENCAPSED_AND_WHITESPACE tokens. |
c36db77 to
46d0549
Compare
|
Thanks for the code review, @nikic, and nice catch! I've now gone for a stateful forward scan instead of the naive approach I had previously (which would not have worked with nested heredocs either). This approach is a lot more robust now, and should also be compatible with my arbitrary expression interpolation RFC (in the off-chance that it passes). Also, thanks @derickr for the code review. Your comments should be resolved now. I've rebased the PR against master to resolve the conflict in the generated language scanner source file. |
|
Ping @nikic (I'm not in any hurry - I just don't want to forget about it) |
|
Sorry for the delay, looking at this patch again now... After squashing and rebasing, I'm seeing a segfault in I'd assume that this is because |
Zend/zend_language_scanner.l
Outdated
| if (EG(exception)) { | ||
| zend_restore_lexical_state(¤t_state); | ||
| RETURN_TOKEN(retval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think returning retval here is correct. I think this exception should actually be discarded and detected at a later time.
|
It would be good to add some tests of how this interacts with token_get_all. In particular:
|
Zend/zend_language_scanner.l
Outdated
| } | ||
| if (YYCURSOR == YYLIMIT) { | ||
| CG(increment_lineno) = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case does not initialize zendlval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (as well as in a couple of other places), thanks!
| heredoc_scan_done: | ||
| yyleng = YYCURSOR - SCNG(yy_text); | ||
| ZVAL_STRINGL(zendlval, yytext, yyleng); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be preferable to keep using yyleng - newline here and save the newline parameter to strip_multiline_string_indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not a good idea to do because indented closing markers will have a new line, and then some whitespace after that. If we just take away newline from yyleng here, it will remove 1 or 2 whitespaces from the end of yytext (or 1 whitespace and \n, leaving potentially just \r on Windows). This will cause a number of complications for strip_multiline_string_indentation, so leaving it as-is would be best.
Zend/zend_language_scanner.l
Outdated
| if (*end == '\n' || *end == '\r') { | ||
| BEGIN(ST_END_HEREDOC); | ||
| is_heredoc = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A direct RETURN_TOKEN(T_START_HEREDOC); would be more obvious here.
Zend/zend_language_scanner.l
Outdated
| RETURN_TOKEN(T_ENCAPSED_AND_WHITESPACE); | ||
| } | ||
| if (*end == ';') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check looks no longer necessary.
eb98935 to
d12c183
Compare
|
@nikic Thanks for the review again! And nice catch on the I have rebased the implementation to give CI a run through. |
Zend/zend_language_scanner.l
Outdated
| } | ||
| if (spacing == (HEREDOC_USING_SPACES | HEREDOC_USING_TABS)) { | ||
| zend_throw_exception(zend_ce_parse_error, "Invalid indentation - tabs and spaces cannot be mixed", 0); | ||
| RETURN_TOKEN(T_ENCAPSED_AND_WHITESPACE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tab/space mixing cases currently result in weird token_get_all results:
$code = <<<CODE
<?php
\t<<<'DOC'
\tXXX
\tDOC;
CODE;
output(token_get_all($code));gives
Line 1: T_OPEN_TAG ('<?php
')
Line 2: T_WHITESPACE (' ')
Line 2: T_START_HEREDOC ('<<<'DOC'
')
Line 3: T_ENCAPSED_AND_WHITESPACE (' ')
Line 3: T_ENCAPSED_AND_WHITESPACE ('DOC;')
Instead the output should be the same as if there were no indentation error, as the indentation only matters for the semantic value, not for the token assignment.
|
I'm wondering if it would be possible to move the handling of the indentation into the parser, rather than using the lexer scan ahead. After all the indentation does not really matter for the tokenization, it's only relevant for the semantic values. On the other hand the parser currently receives the already unescaped strings, so stripping the indentation at that point wouldn't be possible anymore. Hm... |
Yeah, delaying indentation stripping until later will be problematic. Delaying the escaping should be possible, but then ext/tokenizer will output unescaped strings for heredocs. |
|
@tpunt ext/tokenizer should not be stripping anything. Tokens have to be contiguous, there cannot be any character ranges missing in the output. I didn't notice that this was the case before you mentioned it just now... ext/tokenizer also doesn't care about the escaping. It does not expose the semantic values, so it always only sees the escaped values. |
|
@nikic My mistake, regarding the stripping of leading WS for indented closing markers. The following will output unstripped values: output(token_get_all(<<<'OUTER_END'
<?php
echo <<<'INNER_END'
a
INNER_END;
OUTER_END)); |
Fixes segfaults in two error tests
Instead also use indentation in ST_END_HEREDOC.
So we don't get nonesense in the case where there is no ending heredoc marker.
It's not enough to avoid setting it for heredocs in particular. There are also other tokens that might set increment_lineno, and it would remain set after scan ahead completes.
|
This code does not generate an indentation error :( $var = 'Bar';
var_dump(<<<TEST
Foo
$var
TEST); |
|
@nikic Hmm, that doesn't look like it's going to be easy to fix, either :/ I can take a deeper look at this tomorrow. |
|
@nikic I believe I have fixed the problem now. Your snippet was fixed by checking if: the next token was going to be an interpolated variable, there was a trailing newline, and there was an indentation level set. Interpolating an unindented variable without any leading characters (see erroneous test 10) was a little more problematic. I had to throw in the scan ahead (if the first token was a variable interpolation when indentation was set), and I don't see any other way it can be done. The fixes aren't exactly pretty... (There's not much pretty about this patch in general, tbh.) |
| } | ||
| c | ||
| DOC1); | ||
| --EXPECT-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: all tests are missing ?> in the end of the --FILE-- section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed where possible. Thanks.
Zend/zend_language_scanner.l
Outdated
| } | ||
| } | ||
| if (first_token == T_VARIABLE && SCNG(heredoc_indentation)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably has to handle a few more things like '{' and T_DOLLAR_OPEN_CURLY_BRACES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
| c | ||
| END; | ||
| --EXPECTF-- | ||
| Parse error: Invalid body indentation level (expecting an indentation level of at least 5) in %s on line %d No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these errors should check for the exact line number. I think right now we're always reporting on the first line (of the encapsed and whitespace token) rather than on the line where the indendation error occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would be a nice addition - this should be done now.
Zend/zend_language_scanner.l
Outdated
| nowdoc_scan_done: | ||
| yyleng = YYCURSOR - SCNG(yy_text); | ||
| zend_copy_value(zendlval, yytext, yyleng); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Travis segfault is due to the use of zend_copy_value rather than ZVAL_STRINGL here. If yyleng is 1 this is an interned string, which is mprotected if opcache is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip. I noticed the segfault, but couldn't figure out how I had messed something up to do with string interning (didn't even think to enable OPcache...). This should be fixed now.
Looks like this is now part of scan_escape_string, which I changed to no longer be called outside of parser mode, so this now has to happen explicitly.
|
@tpunt So, I think we're both not really happy with how the implementation turned out, but as the alternative (handling this in the parser) would be a larger refactoring I went ahead and merged this as-is in 4887357. We can improve this further in-tree. Thanks for seeing this through, looking forward to using this feature :) |
|
@nikic Yeah, the implementation turned out to be a lot messier and a lot more work than I had envisioned... Thanks for all of the reviews you gave, along with providing a helping hand with the implementation :) |
|
Hey @tpunt! Since problems arose with this new syntax, I would like to propose that you can embed php code within the Heredoc syntax. |
|
@ecreeth I do have an RFC to interpolate arbitrary expressions, but this certainly doesn't (and won't) cover statements. I don't see why you would want to do that either... |

This is the corresponding implementation to my Flexible Heredoc and Nowdoc Syntaxes RFC.