Skip to content

Conversation

@tpunt
Copy link
Contributor

@tpunt tpunt commented Sep 24, 2017

This is the corresponding implementation to my Flexible Heredoc and Nowdoc Syntaxes RFC.

@tpunt tpunt force-pushed the heredoc-nowdoc-indentation branch from 7185982 to 24c24fc Compare September 26, 2017 22:09
@krakjoe krakjoe added the RFC label Sep 27, 2017
@tpunt tpunt force-pushed the heredoc-nowdoc-indentation branch from f3798bb to 476ec24 Compare September 27, 2017 12:02
end++;
}
if (!IS_LABEL_START(*end)) {
if ((spacing & 3) == 3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this magic number 3?

Copy link
Contributor Author

@tpunt tpunt Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change this to be:

if (spacing & USES_TABS && spacing & USES_SPACES) {

The while loop a few lines above should probably also be changed then to:

while (YYCURSOR < YYLIMIT && (*YYCURSOR == ' ' || *YYCURSOR == '\t')) {
	spacing |= *YYCURSOR == '\t' ? USES_TABS : USES_SPACES;

if (YYCURSOR[-2] == '\r' && YYCURSOR[-1] == '\n') {
newline = 2; /* Windows newline */
} else {
newline = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no magic numbers, make a define for them.

Copy link
Contributor Author

@tpunt tpunt Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually preexisting code (I've just removed the wrapping conditional). The numbers could be changed to macros (for better semantic meaning), but this is kind of tangential to this PR.

@nikic nikic self-requested a review November 15, 2017 21:49
@nikic
Copy link
Member

nikic commented Dec 9, 2017

Zend/tests/flexible-nowdoc-error7.phpt and Zend/tests/flexible-nowdoc-error8.phpt are crashing for me. Here's the valgrind log:

==31421== Conditional jump or move depends on uninitialised value(s)
==31421==    at 0xCED5F6: zend_ast_destroy (zend_ast.c:560)
==31421==    by 0xC0B96F: yydestruct (zend_language_parser.y:52)
==31421==    by 0xC11753: zendparse (zend_language_parser.c:7008)
==31421==    by 0xC15676: zend_compile (zend_language_scanner.l:585)
==31421==    by 0xC15A82: compile_file (zend_language_scanner.l:635)
==31421==    by 0x8F8D62: phar_compile_file (phar.c:3339)
==31421==    by 0xC9C7CA: zend_execute_scripts (zend.c:1532)
==31421==    by 0xBCFF71: php_execute_script (main.c:2467)
==31421==    by 0xDA8A7A: do_cli (php_cli.c:1011)
==31421==    by 0xDA9F0C: main (php_cli.c:1404)
==31421==
==31421== Use of uninitialised value of size 8
==31421==    at 0xCED600: zend_ast_destroy (zend_ast.c:564)
==31421==    by 0xC0B96F: yydestruct (zend_language_parser.y:52)
==31421==    by 0xC11753: zendparse (zend_language_parser.c:7008)
==31421==    by 0xC15676: zend_compile (zend_language_scanner.l:585)
==31421==    by 0xC15A82: compile_file (zend_language_scanner.l:635)
==31421==    by 0x8F8D62: phar_compile_file (phar.c:3339)
==31421==    by 0xC9C7CA: zend_execute_scripts (zend.c:1532)
==31421==    by 0xBCFF71: php_execute_script (main.c:2467)
==31421==    by 0xDA8A7A: do_cli (php_cli.c:1011)
==31421==    by 0xDA9F0C: main (php_cli.c:1404)
==31421==
==31421== Invalid read of size 2
==31421==    at 0xCED600: zend_ast_destroy (zend_ast.c:564)
==31421==    by 0xC0B96F: yydestruct (zend_language_parser.y:52)
==31421==    by 0xC11753: zendparse (zend_language_parser.c:7008)
==31421==    by 0xC15676: zend_compile (zend_language_scanner.l:585)
==31421==    by 0xC15A82: compile_file (zend_language_scanner.l:635)
==31421==    by 0x8F8D62: phar_compile_file (phar.c:3339)
==31421==    by 0xC9C7CA: zend_execute_scripts (zend.c:1532)
==31421==    by 0xBCFF71: php_execute_script (main.c:2467)
==31421==    by 0xDA8A7A: do_cli (php_cli.c:1011)
==31421==    by 0xDA9F0C: main (php_cli.c:1404)
==31421==  Address 0x56400000000 is not stack'd, malloc'd or (recently) free'd

Presumably zendlval doesn't get initialized (or is initialized and destroyed) in this error case.

@nikic
Copy link
Member

nikic commented Dec 9, 2017

The test-case

<?php

{
    $FOO = "FOO";
    define("FOO", "FOO");
    $b = <<<FOO
    Test
    ${
        FOO
    }
    FOO;
    var_dump($b);
}

fails with

Parse error: Invalid body indentation level (expecting an indentation level of at least 8) in /home/nikic/php-src/t101.php on line 7

because the FOO that is part of the variable interpolation is incorrectly recognized as the ending delimiter for the purpose of indentation determination.

Similarly

<?php

{
    $FOO = "FOO";
    define("FOO", "FOO");
    $b = <<<FOO
        Test
        ${
        FOO
        }
    FOO;
    var_dump($b);
}

dies with

Fatal error: Allowed memory size of 134217728 bytes exhausted at /home/nikic/php-src/Zend/zend_string.h:136 (tried to allocate 4294967320 bytes) in /home/nikic/php-src/t101.php on line 10

because the indentation is incorrectly determined and assumed invariants no longer hold.

@nikic
Copy link
Member

nikic commented Dec 9, 2017

I'm not sure how to fix this issue. We can only determine the indentation when we reach the ending delimiter (while taking into account interpolation, which can contain arbitrary code). At the same time, we already have to strip the indentation when we emit the ENCAPSED_AND_WHITESPACE tokens.

@nikic nikic removed their request for review December 9, 2017 16:08
@tpunt tpunt force-pushed the heredoc-nowdoc-indentation branch 2 times, most recently from c36db77 to 46d0549 Compare December 11, 2017 16:35
@tpunt
Copy link
Contributor Author

tpunt commented Dec 11, 2017

Thanks for the code review, @nikic, and nice catch! I've now gone for a stateful forward scan instead of the naive approach I had previously (which would not have worked with nested heredocs either). This approach is a lot more robust now, and should also be compatible with my arbitrary expression interpolation RFC (in the off-chance that it passes).

Also, thanks @derickr for the code review. Your comments should be resolved now.

I've rebased the PR against master to resolve the conflict in the generated language scanner source file.

@tpunt
Copy link
Contributor Author

tpunt commented Jan 29, 2018

Ping @nikic (I'm not in any hurry - I just don't want to forget about it)

@nikic
Copy link
Member

nikic commented Mar 10, 2018

Sorry for the delay, looking at this patch again now...

After squashing and rebasing, I'm seeing a segfault in Zend/tests/flexible-nowdoc-error8.php:

==15544== Conditional jump or move depends on uninitialised value(s)
==15544==    at 0xC4EA65: zend_ast_destroy (zend_ast.c:560)
==15544==    by 0xB9B796: yydestruct (zend_language_parser.y:52)
==15544==    by 0xBA108A: zendparse (zend_language_parser.c:7008)
==15544==    by 0xBA32CF: zend_compile (zend_language_scanner.l:587)
==15544==    by 0xBA3C6B: compile_string (zend_language_scanner.l:770)
==15544==    by 0xC6C5B5: zend_include_or_eval (zend_execute.c:3140)
==15544==    by 0xC747BB: ZEND_INCLUDE_OR_EVAL_SPEC_CONST_HANDLER (zend_vm_execute.h:3292)
==15544==    by 0xCED931: execute_ex (zend_vm_execute.h:54908)
==15544==    by 0xCEDA85: zend_execute (zend_vm_execute.h:60114)
==15544==    by 0xC05A30: zend_execute_scripts (zend.c:1541)
==15544==    by 0xB67071: php_execute_script (main.c:2467)
==15544==    by 0xCF0483: do_cli (php_cli.c:1011)

I'd assume that this is because zendlval is not getting populated in one of the error cases.

if (EG(exception)) {
zend_restore_lexical_state(&current_state);
RETURN_TOKEN(retval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think returning retval here is correct. I think this exception should actually be discarded and detected at a later time.

@nikic
Copy link
Member

nikic commented Mar 10, 2018

It would be good to add some tests of how this interacts with token_get_all. In particular:

  • How the various error conditions are handled. token_get_all() and similar functionality will discard the exceptions, but should still generate some kind of reasonable output (where reasonable might just be T_ENCAPSED_AND_WHITESPACE until the end of the code).
  • How the scan-ahead interacts with TOKEN_PARSE mode. I think right now it might actually emit the scan-ahead tokens as well.

}
if (YYCURSOR == YYLIMIT) {
CG(increment_lineno) = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case does not initialize zendlval.

Copy link
Contributor Author

@tpunt tpunt Mar 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (as well as in a couple of other places), thanks!

heredoc_scan_done:
yyleng = YYCURSOR - SCNG(yy_text);
ZVAL_STRINGL(zendlval, yytext, yyleng);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be preferable to keep using yyleng - newline here and save the newline parameter to strip_multiline_string_indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not a good idea to do because indented closing markers will have a new line, and then some whitespace after that. If we just take away newline from yyleng here, it will remove 1 or 2 whitespaces from the end of yytext (or 1 whitespace and \n, leaving potentially just \r on Windows). This will cause a number of complications for strip_multiline_string_indentation, so leaving it as-is would be best.

if (*end == '\n' || *end == '\r') {
BEGIN(ST_END_HEREDOC);
is_heredoc = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A direct RETURN_TOKEN(T_START_HEREDOC); would be more obvious here.

RETURN_TOKEN(T_ENCAPSED_AND_WHITESPACE);
}
if (*end == ';') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check looks no longer necessary.

@tpunt tpunt force-pushed the heredoc-nowdoc-indentation branch from eb98935 to d12c183 Compare March 11, 2018 21:17
@tpunt
Copy link
Contributor Author

tpunt commented Mar 11, 2018

@nikic Thanks for the review again! And nice catch on the TOKEN_PARSE lexing. I believe all of your points have been addressed now.

I have rebased the implementation to give CI a run through.

}
if (spacing == (HEREDOC_USING_SPACES | HEREDOC_USING_TABS)) {
zend_throw_exception(zend_ce_parse_error, "Invalid indentation - tabs and spaces cannot be mixed", 0);
RETURN_TOKEN(T_ENCAPSED_AND_WHITESPACE);
Copy link
Member

@nikic nikic Mar 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tab/space mixing cases currently result in weird token_get_all results:

$code = <<<CODE
<?php
 \t<<<'DOC'
 \tXXX
 \tDOC;
CODE;
output(token_get_all($code));

gives

Line 1: T_OPEN_TAG ('<?php
')
Line 2: T_WHITESPACE (' 	')
Line 2: T_START_HEREDOC ('<<<'DOC'
')
Line 3: T_ENCAPSED_AND_WHITESPACE (' ')
Line 3: T_ENCAPSED_AND_WHITESPACE ('DOC;')

Instead the output should be the same as if there were no indentation error, as the indentation only matters for the semantic value, not for the token assignment.

@nikic
Copy link
Member

nikic commented Mar 12, 2018

I'm wondering if it would be possible to move the handling of the indentation into the parser, rather than using the lexer scan ahead. After all the indentation does not really matter for the tokenization, it's only relevant for the semantic values. On the other hand the parser currently receives the already unescaped strings, so stripping the indentation at that point wouldn't be possible anymore. Hm...

@tpunt
Copy link
Contributor Author

tpunt commented Mar 12, 2018

On the other hand the parser currently receives the already unescaped strings, so stripping the indentation at that point wouldn't be possible anymore. Hm...

Yeah, delaying indentation stripping until later will be problematic. Delaying the escaping should be possible, but then ext/tokenizer will output unescaped strings for heredocs.

@nikic
Copy link
Member

nikic commented Mar 12, 2018

@tpunt ext/tokenizer should not be stripping anything. Tokens have to be contiguous, there cannot be any character ranges missing in the output. I didn't notice that this was the case before you mentioned it just now...

ext/tokenizer also doesn't care about the escaping. It does not expose the semantic values, so it always only sees the escaped values.

@tpunt
Copy link
Contributor Author

tpunt commented Mar 12, 2018

@nikic My mistake, regarding the stripping of leading WS for indented closing markers.

The following will output unstripped values:

output(token_get_all(<<<'OUTER_END'
<?php
echo <<<'INNER_END'
	a
	INNER_END;
OUTER_END));
Line 1: T_OPEN_TAG ('<?php
')
Line 2: T_ECHO ('echo')
Line 2: T_WHITESPACE (' ')
Line 2: T_START_HEREDOC ('<<<'INNER_END'
')
Line 3: T_ENCAPSED_AND_WHITESPACE ('	a
	')
Line 4: T_END_HEREDOC ('INNER_END')

tpunt and others added 9 commits March 20, 2018 12:50
Fixes segfaults in two error tests
Instead also use indentation in ST_END_HEREDOC.
So we don't get nonesense in the case where there is no ending
heredoc marker.
It's not enough to avoid setting it for heredocs in particular.
There are also other tokens that might set increment_lineno, and
it would remain set after scan ahead completes.
@nikic
Copy link
Member

nikic commented Mar 25, 2018

This code does not generate an indentation error :(

$var = 'Bar';
var_dump(<<<TEST
  Foo
$var
  TEST);

@tpunt
Copy link
Contributor Author

tpunt commented Mar 25, 2018

@nikic Hmm, that doesn't look like it's going to be easy to fix, either :/ I can take a deeper look at this tomorrow.

@tpunt
Copy link
Contributor Author

tpunt commented Mar 26, 2018

@nikic I believe I have fixed the problem now. Your snippet was fixed by checking if: the next token was going to be an interpolated variable, there was a trailing newline, and there was an indentation level set. Interpolating an unindented variable without any leading characters (see erroneous test 10) was a little more problematic. I had to throw in the scan ahead (if the first token was a variable interpolation when indentation was set), and I don't see any other way it can be done.

The fixes aren't exactly pretty... (There's not much pretty about this patch in general, tbh.)

}
c
DOC1);
--EXPECT--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: all tests are missing ?> in the end of the --FILE-- section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed where possible. Thanks.

}
}
if (first_token == T_VARIABLE && SCNG(heredoc_indentation)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably has to handle a few more things like '{' and T_DOLLAR_OPEN_CURLY_BRACES.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

c
END;
--EXPECTF--
Parse error: Invalid body indentation level (expecting an indentation level of at least 5) in %s on line %d No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these errors should check for the exact line number. I think right now we're always reporting on the first line (of the encapsed and whitespace token) rather than on the line where the indendation error occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would be a nice addition - this should be done now.

nowdoc_scan_done:
yyleng = YYCURSOR - SCNG(yy_text);
zend_copy_value(zendlval, yytext, yyleng);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis segfault is due to the use of zend_copy_value rather than ZVAL_STRINGL here. If yyleng is 1 this is an interned string, which is mprotected if opcache is used.

Copy link
Contributor Author

@tpunt tpunt Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip. I noticed the segfault, but couldn't figure out how I had messed something up to do with string interning (didn't even think to enable OPcache...). This should be fixed now.

@nikic
Copy link
Member

nikic commented Apr 13, 2018

@tpunt So, I think we're both not really happy with how the implementation turned out, but as the alternative (handling this in the parser) would be a larger refactoring I went ahead and merged this as-is in 4887357. We can improve this further in-tree.

Thanks for seeing this through, looking forward to using this feature :)

@nikic nikic closed this Apr 13, 2018
@tpunt
Copy link
Contributor Author

tpunt commented Apr 14, 2018

@nikic Yeah, the implementation turned out to be a lot messier and a lot more work than I had envisioned... Thanks for all of the reviews you gave, along with providing a helping hand with the implementation :)

@ecreeth
Copy link

ecreeth commented Aug 20, 2019

Hey @tpunt! Since problems arose with this new syntax, I would like to propose that you can embed php code within the Heredoc syntax.

image

@tpunt
Copy link
Contributor Author

tpunt commented Aug 23, 2019

@ecreeth I do have an RFC to interpolate arbitrary expressions, but this certainly doesn't (and won't) cover statements. I don't see why you would want to do that either...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants