RFC: Flexible Heredoc and Nowdoc Syntaxes #2770

tpunt · 2017-09-24T18:06:29Z

This is the corresponding implementation to my Flexible Heredoc and Nowdoc Syntaxes RFC.

derickr · 2017-10-13T10:18:55Z

Zend/zend_language_scanner.l

-			end++;
-		}
+		if (!IS_LABEL_START(*end)) {
+			if ((spacing & 3) == 3) {


What is this magic number 3?

I could change this to be:

if (spacing & USES_TABS && spacing & USES_SPACES) {

The while loop a few lines above should probably also be changed then to:

while (YYCURSOR < YYLIMIT && (*YYCURSOR == ' ' || *YYCURSOR == '\t')) { spacing |= *YYCURSOR == '\t' ? USES_TABS : USES_SPACES;

derickr · 2017-10-13T10:19:21Z

Zend/zend_language_scanner.l

+					if (YYCURSOR[-2] == '\r' && YYCURSOR[-1] == '\n') {
+						newline = 2; /* Windows newline */
+					} else {
+						newline = 1;


Please no magic numbers, make a define for them.

This is actually preexisting code (I've just removed the wrapping conditional). The numbers could be changed to macros (for better semantic meaning), but this is kind of tangential to this PR.

nikic · 2017-12-09T15:58:07Z

Zend/tests/flexible-nowdoc-error7.phpt and Zend/tests/flexible-nowdoc-error8.phpt are crashing for me. Here's the valgrind log:

==31421== Conditional jump or move depends on uninitialised value(s)
==31421==    at 0xCED5F6: zend_ast_destroy (zend_ast.c:560)
==31421==    by 0xC0B96F: yydestruct (zend_language_parser.y:52)
==31421==    by 0xC11753: zendparse (zend_language_parser.c:7008)
==31421==    by 0xC15676: zend_compile (zend_language_scanner.l:585)
==31421==    by 0xC15A82: compile_file (zend_language_scanner.l:635)
==31421==    by 0x8F8D62: phar_compile_file (phar.c:3339)
==31421==    by 0xC9C7CA: zend_execute_scripts (zend.c:1532)
==31421==    by 0xBCFF71: php_execute_script (main.c:2467)
==31421==    by 0xDA8A7A: do_cli (php_cli.c:1011)
==31421==    by 0xDA9F0C: main (php_cli.c:1404)
==31421==
==31421== Use of uninitialised value of size 8
==31421==    at 0xCED600: zend_ast_destroy (zend_ast.c:564)
==31421==    by 0xC0B96F: yydestruct (zend_language_parser.y:52)
==31421==    by 0xC11753: zendparse (zend_language_parser.c:7008)
==31421==    by 0xC15676: zend_compile (zend_language_scanner.l:585)
==31421==    by 0xC15A82: compile_file (zend_language_scanner.l:635)
==31421==    by 0x8F8D62: phar_compile_file (phar.c:3339)
==31421==    by 0xC9C7CA: zend_execute_scripts (zend.c:1532)
==31421==    by 0xBCFF71: php_execute_script (main.c:2467)
==31421==    by 0xDA8A7A: do_cli (php_cli.c:1011)
==31421==    by 0xDA9F0C: main (php_cli.c:1404)
==31421==
==31421== Invalid read of size 2
==31421==    at 0xCED600: zend_ast_destroy (zend_ast.c:564)
==31421==    by 0xC0B96F: yydestruct (zend_language_parser.y:52)
==31421==    by 0xC11753: zendparse (zend_language_parser.c:7008)
==31421==    by 0xC15676: zend_compile (zend_language_scanner.l:585)
==31421==    by 0xC15A82: compile_file (zend_language_scanner.l:635)
==31421==    by 0x8F8D62: phar_compile_file (phar.c:3339)
==31421==    by 0xC9C7CA: zend_execute_scripts (zend.c:1532)
==31421==    by 0xBCFF71: php_execute_script (main.c:2467)
==31421==    by 0xDA8A7A: do_cli (php_cli.c:1011)
==31421==    by 0xDA9F0C: main (php_cli.c:1404)
==31421==  Address 0x56400000000 is not stack'd, malloc'd or (recently) free'd

Presumably zendlval doesn't get initialized (or is initialized and destroyed) in this error case.

nikic · 2017-12-09T16:04:17Z

The test-case

<?php

{
    $FOO = "FOO";
    define("FOO", "FOO");
    $b = <<<FOO
    Test
    ${
        FOO
    }
    FOO;
    var_dump($b);
}

fails with

Parse error: Invalid body indentation level (expecting an indentation level of at least 8) in /home/nikic/php-src/t101.php on line 7

because the FOO that is part of the variable interpolation is incorrectly recognized as the ending delimiter for the purpose of indentation determination.

Similarly

<?php

{
    $FOO = "FOO";
    define("FOO", "FOO");
    $b = <<<FOO
        Test
        ${
        FOO
        }
    FOO;
    var_dump($b);
}

dies with

Fatal error: Allowed memory size of 134217728 bytes exhausted at /home/nikic/php-src/Zend/zend_string.h:136 (tried to allocate 4294967320 bytes) in /home/nikic/php-src/t101.php on line 10

because the indentation is incorrectly determined and assumed invariants no longer hold.

nikic · 2017-12-09T16:07:52Z

I'm not sure how to fix this issue. We can only determine the indentation when we reach the ending delimiter (while taking into account interpolation, which can contain arbitrary code). At the same time, we already have to strip the indentation when we emit the ENCAPSED_AND_WHITESPACE tokens.

tpunt · 2017-12-11T17:29:34Z

Thanks for the code review, @nikic, and nice catch! I've now gone for a stateful forward scan instead of the naive approach I had previously (which would not have worked with nested heredocs either). This approach is a lot more robust now, and should also be compatible with my arbitrary expression interpolation RFC (in the off-chance that it passes).

Also, thanks @derickr for the code review. Your comments should be resolved now.

I've rebased the PR against master to resolve the conflict in the generated language scanner source file.

tpunt · 2018-01-29T20:25:16Z

Ping @nikic (I'm not in any hurry - I just don't want to forget about it)

nikic · 2018-03-10T14:41:35Z

Sorry for the delay, looking at this patch again now...

After squashing and rebasing, I'm seeing a segfault in Zend/tests/flexible-nowdoc-error8.php:

==15544== Conditional jump or move depends on uninitialised value(s)
==15544==    at 0xC4EA65: zend_ast_destroy (zend_ast.c:560)
==15544==    by 0xB9B796: yydestruct (zend_language_parser.y:52)
==15544==    by 0xBA108A: zendparse (zend_language_parser.c:7008)
==15544==    by 0xBA32CF: zend_compile (zend_language_scanner.l:587)
==15544==    by 0xBA3C6B: compile_string (zend_language_scanner.l:770)
==15544==    by 0xC6C5B5: zend_include_or_eval (zend_execute.c:3140)
==15544==    by 0xC747BB: ZEND_INCLUDE_OR_EVAL_SPEC_CONST_HANDLER (zend_vm_execute.h:3292)
==15544==    by 0xCED931: execute_ex (zend_vm_execute.h:54908)
==15544==    by 0xCEDA85: zend_execute (zend_vm_execute.h:60114)
==15544==    by 0xC05A30: zend_execute_scripts (zend.c:1541)
==15544==    by 0xB67071: php_execute_script (main.c:2467)
==15544==    by 0xCF0483: do_cli (php_cli.c:1011)

I'd assume that this is because zendlval is not getting populated in one of the error cases.

nikic · 2018-03-10T14:45:18Z

Zend/zend_language_scanner.l

+
+			if (EG(exception)) {
+				zend_restore_lexical_state(&current_state);
+				RETURN_TOKEN(retval);


I don't think returning retval here is correct. I think this exception should actually be discarded and detected at a later time.

nikic · 2018-03-10T14:47:18Z

It would be good to add some tests of how this interacts with token_get_all. In particular:

How the various error conditions are handled. token_get_all() and similar functionality will discard the exceptions, but should still generate some kind of reasonable output (where reasonable might just be T_ENCAPSED_AND_WHITESPACE until the end of the code).
How the scan-ahead interacts with TOKEN_PARSE mode. I think right now it might actually emit the scan-ahead tokens as well.

nikic · 2018-03-10T14:55:24Z

Zend/zend_language_scanner.l

+				}
+
+				if (YYCURSOR == YYLIMIT) {
+					CG(increment_lineno) = 1;


This case does not initialize zendlval.

Fixed (as well as in a couple of other places), thanks!

nikic · 2018-03-10T15:00:32Z

Zend/zend_language_scanner.l


 heredoc_scan_done:
 	yyleng = YYCURSOR - SCNG(yy_text);
+	ZVAL_STRINGL(zendlval, yytext, yyleng);


I think it would be preferable to keep using yyleng - newline here and save the newline parameter to strip_multiline_string_indentation.

This is probably not a good idea to do because indented closing markers will have a new line, and then some whitespace after that. If we just take away newline from yyleng here, it will remove 1 or 2 whitespaces from the end of yytext (or 1 whitespace and \n, leaving potentially just \r on Windows). This will cause a number of complications for strip_multiline_string_indentation, so leaving it as-is would be best.

nikic · 2018-03-10T15:09:27Z

Zend/zend_language_scanner.l


-		if (*end == '\n' || *end == '\r') {
 			BEGIN(ST_END_HEREDOC);
+			is_heredoc = 0;


A direct RETURN_TOKEN(T_START_HEREDOC); would be more obvious here.

nikic · 2018-03-10T15:11:38Z

Zend/zend_language_scanner.l

+				RETURN_TOKEN(T_ENCAPSED_AND_WHITESPACE);
+			}
+
+			if (*end == ';') {


This check looks no longer necessary.

tpunt · 2018-03-11T21:18:14Z

@nikic Thanks for the review again! And nice catch on the TOKEN_PARSE lexing. I believe all of your points have been addressed now.

I have rebased the implementation to give CI a run through.

nikic · 2018-03-11T22:45:30Z

Zend/zend_language_scanner.l

-						}
+					if (spacing == (HEREDOC_USING_SPACES | HEREDOC_USING_TABS)) {
+						zend_throw_exception(zend_ce_parse_error, "Invalid indentation - tabs and spaces cannot be mixed", 0);
+						RETURN_TOKEN(T_ENCAPSED_AND_WHITESPACE);


The tab/space mixing cases currently result in weird token_get_all results:

$code = <<<CODE <?php \t<<<'DOC' \tXXX \tDOC; CODE; output(token_get_all($code));

gives

Line 1: T_OPEN_TAG ('<?php ') Line 2: T_WHITESPACE (' ') Line 2: T_START_HEREDOC ('<<<'DOC' ') Line 3: T_ENCAPSED_AND_WHITESPACE (' ') Line 3: T_ENCAPSED_AND_WHITESPACE ('DOC;')

Instead the output should be the same as if there were no indentation error, as the indentation only matters for the semantic value, not for the token assignment.

nikic · 2018-03-12T00:07:20Z

I'm wondering if it would be possible to move the handling of the indentation into the parser, rather than using the lexer scan ahead. After all the indentation does not really matter for the tokenization, it's only relevant for the semantic values. On the other hand the parser currently receives the already unescaped strings, so stripping the indentation at that point wouldn't be possible anymore. Hm...

tpunt · 2018-03-12T10:31:14Z

On the other hand the parser currently receives the already unescaped strings, so stripping the indentation at that point wouldn't be possible anymore. Hm...

Yeah, delaying indentation stripping until later will be problematic. Delaying the escaping should be possible, but then ext/tokenizer will output unescaped strings for heredocs.

nikic · 2018-03-12T10:36:15Z

@tpunt ext/tokenizer should not be stripping anything. Tokens have to be contiguous, there cannot be any character ranges missing in the output. I didn't notice that this was the case before you mentioned it just now...

ext/tokenizer also doesn't care about the escaping. It does not expose the semantic values, so it always only sees the escaped values.

tpunt · 2018-03-12T10:40:32Z

@nikic My mistake, regarding the stripping of leading WS for indented closing markers.

The following will output unstripped values:

output(token_get_all(<<<'OUTER_END'
<?php
echo <<<'INNER_END'
	a
	INNER_END;
OUTER_END));

Line 1: T_OPEN_TAG ('<?php
')
Line 2: T_ECHO ('echo')
Line 2: T_WHITESPACE (' ')
Line 2: T_START_HEREDOC ('<<<'INNER_END'
')
Line 3: T_ENCAPSED_AND_WHITESPACE ('	a
	')
Line 4: T_END_HEREDOC ('INNER_END')

Fixes segfaults in two error tests

Instead also use indentation in ST_END_HEREDOC.

So we don't get nonesense in the case where there is no ending heredoc marker.

It's not enough to avoid setting it for heredocs in particular. There are also other tokens that might set increment_lineno, and it would remain set after scan ahead completes.

nikic · 2018-03-25T15:34:16Z

This code does not generate an indentation error :(

$var = 'Bar';
var_dump(<<<TEST
  Foo
$var
  TEST);

tpunt · 2018-03-25T15:51:48Z

@nikic Hmm, that doesn't look like it's going to be easy to fix, either :/ I can take a deeper look at this tomorrow.

…ntation...

tpunt · 2018-03-26T21:24:35Z

@nikic I believe I have fixed the problem now. Your snippet was fixed by checking if: the next token was going to be an interpolated variable, there was a trailing newline, and there was an indentation level set. Interpolating an unindented variable without any leading characters (see erroneous test 10) was a little more problematic. I had to throw in the scan ahead (if the first token was a variable interpolation when indentation was set), and I don't see any other way it can be done.

The fixes aren't exactly pretty... (There's not much pretty about this patch in general, tbh.)

carusogabriel · 2018-03-28T10:27:06Z

Zend/tests/flexible-heredoc-complex-test1.phpt

+    }
+    c
+    DOC1);
+--EXPECT--


Minor: all tests are missing ?> in the end of the --FILE-- section

Fixed where possible. Thanks.

nikic · 2018-03-29T09:46:36Z

Zend/zend_language_scanner.l

+			}
+		}
+
+		if (first_token == T_VARIABLE && SCNG(heredoc_indentation)) {


This probably has to handle a few more things like '{' and T_DOLLAR_OPEN_CURLY_BRACES.

Fixed, thanks.

nikic · 2018-03-29T09:50:54Z

Zend/tests/flexible-nowdoc-error4.phpt

+    c
+     END;
+--EXPECTF--
+Parse error: Invalid body indentation level (expecting an indentation level of at least 5) in %s on line %d


I think these errors should check for the exact line number. I think right now we're always reporting on the first line (of the encapsed and whitespace token) rather than on the line where the indendation error occurs.

Yes, this would be a nice addition - this should be done now.

nikic · 2018-03-29T10:06:36Z

Zend/zend_language_scanner.l


 nowdoc_scan_done:
 	yyleng = YYCURSOR - SCNG(yy_text);
+	zend_copy_value(zendlval, yytext, yyleng);


Travis segfault is due to the use of zend_copy_value rather than ZVAL_STRINGL here. If yyleng is 1 this is an interned string, which is mprotected if opcache is used.

Thanks for the tip. I noticed the segfault, but couldn't figure out how I had messed something up to do with string interning (didn't even think to enable OPcache...). This should be fixed now.

Looks like this is now part of scan_escape_string, which I changed to no longer be called outside of parser mode, so this now has to happen explicitly.

nikic · 2018-04-13T19:59:16Z

@tpunt So, I think we're both not really happy with how the implementation turned out, but as the alternative (handling this in the parser) would be a larger refactoring I went ahead and merged this as-is in 4887357. We can improve this further in-tree.

Thanks for seeing this through, looking forward to using this feature :)

tpunt · 2018-04-14T08:42:50Z

@nikic Yeah, the implementation turned out to be a lot messier and a lot more work than I had envisioned... Thanks for all of the reviews you gave, along with providing a helping hand with the implementation :)

ecreeth · 2019-08-20T17:59:35Z

Hey @tpunt! Since problems arose with this new syntax, I would like to propose that you can embed php code within the Heredoc syntax.

tpunt · 2019-08-23T18:39:23Z

@ecreeth I do have an RFC to interpolate arbitrary expressions, but this certainly doesn't (and won't) cover statements. I don't see why you would want to do that either...

tpunt force-pushed the heredoc-nowdoc-indentation branch from 7185982 to 24c24fc Compare September 26, 2017 22:09

krakjoe added the RFC label Sep 27, 2017

tpunt force-pushed the heredoc-nowdoc-indentation branch from f3798bb to 476ec24 Compare September 27, 2017 12:02

derickr requested changes Oct 13, 2017

View reviewed changes

nikic self-requested a review November 15, 2017 21:49

nikic removed their request for review December 9, 2017 16:08

nikic added the Waiting on Author label Dec 9, 2017

tpunt force-pushed the heredoc-nowdoc-indentation branch 2 times, most recently from c36db77 to 46d0549 Compare December 11, 2017 16:35

nikic added Waiting on Review and removed Waiting on Author labels Feb 3, 2018

nikic reviewed Mar 10, 2018

View reviewed changes

tpunt force-pushed the heredoc-nowdoc-indentation branch from eb98935 to d12c183 Compare March 11, 2018 21:17

nikic reviewed Mar 11, 2018

View reviewed changes

tpunt and others added 9 commits March 20, 2018 12:50

Finish tokenizer updates with latest master changes

24a8b91

Use the right macro

4a12f44

Fixes segfaults in two error tests

Don't adjust heredoc label length

6e56536

Instead also use indentation in ST_END_HEREDOC.

Add one more test case

6d680f3

Move function defs above macros

ab02291

Run all tokenizer tests w/ and w/o TOKEN_PARSE

16939ad

Make sure to reset lookahead state

9f5435f

So we don't get nonesense in the case where there is no ending heredoc marker.

Always reset increment_lineno after scan ahead

1990cb2

It's not enough to avoid setting it for heredocs in particular. There are also other tokens that might set increment_lineno, and it would remain set after scan ahead completes.

Correctly handle unterminated heredoc with trailing WS

82c156c

tpunt added 2 commits March 26, 2018 22:02

Fix unindented interpolated variable problem

b252ee6

Unindented interpolated variables are only invalid when there is inde…

6ad9bf6

…ntation...

carusogabriel suggested changes Mar 28, 2018

View reviewed changes

nikic reviewed Mar 29, 2018

View reviewed changes

tpunt and others added 6 commits April 9, 2018 16:20

Cater for other variable interpolation types

2cb4531

Terminate test scripts with ?> (where possible)

ea719d5

use ZEND_STRINGL instead of zend_copy_value

0057e4b

Show erroneous line in exception error message

2e8826e

Return T_ERROR on error in parser mode

1c22f73

Fix newline counting in tokenizer

5aee65a

Looks like this is now part of scan_escape_string, which I changed to no longer be called outside of parser mode, so this now has to happen explicitly.

nikic removed the Waiting on Review label Apr 13, 2018

nikic closed this Apr 13, 2018

RFC: Flexible Heredoc and Nowdoc Syntaxes #2770

RFC: Flexible Heredoc and Nowdoc Syntaxes #2770

Uh oh!

Conversation

tpunt commented Sep 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tpunt Oct 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tpunt Oct 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikic commented Dec 9, 2017

Uh oh!

nikic commented Dec 9, 2017

Uh oh!

nikic commented Dec 9, 2017

Uh oh!

tpunt commented Dec 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tpunt commented Jan 29, 2018

Uh oh!

nikic commented Mar 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikic commented Mar 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tpunt Mar 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tpunt commented Mar 11, 2018

Uh oh!

nikic Mar 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikic commented Mar 12, 2018

Uh oh!

tpunt commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tpunt commented Mar 12, 2018

Uh oh!

nikic commented Mar 25, 2018

Uh oh!

tpunt commented Mar 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tpunt commented Mar 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tpunt Oct 13, 2017 •

edited

Loading

tpunt Oct 13, 2017 •

edited

Loading

tpunt commented Dec 11, 2017 •

edited

Loading

tpunt Mar 11, 2018 •

edited

Loading

nikic Mar 11, 2018 •

edited

Loading

tpunt commented Mar 12, 2018 •

edited

Loading

nikic commented Mar 12, 2018 •

edited

Loading

tpunt commented Mar 25, 2018 •

edited

Loading

tpunt Apr 9, 2018 •

edited

Loading