Reduce allocations in Get-Content cmdlet #8103

iSazonov · 2018-10-22T14:14:27Z

PR Summary

Current implementation do huge extra allocations for every line of a file and at worst for every char (!). The fix exclude the extra allocations by moving initializations in a constructor and using Span for temporary buffer.

The Boyer-Moore string search algorithm was optimized. Use simple mapping Unicode chars to byte. It works well for common text files with chars from one-two languages.

You could review commit by commit.

PR Checklist

anmenaga · 2018-10-22T19:08:46Z

src/System.Management.Automation/namespaces/FileSystemContentStream.cs

-        /// <param name="isRawStream">
-        /// Indicates raw stream.
-        /// </param>
-        public FileSystemContentReaderWriter(


Why was this removed?

It is unneeded and used internal constructor and we can safely remove it. (Otherwise we have to fix it to follow our changes in another constructor with delimiter parameter.)

PaulHigin

Overall this looks good to me. I agree that the offset Dictionary could be a perf issue for simple delimiter characters. Have you performed any tests to verify perf gains? I imagine they would be significant, in some case, by switching to use Span.

PaulHigin · 2018-10-23T20:23:59Z

src/System.Management.Automation/namespaces/FileSystemContentStream.cs

+                    blocks.Add(contentRead);
+                }
+
+                return _reader.Peek() != -1;


Doesn't this always have to be -1, when reading a raw stream to the end?

PaulHigin · 2018-10-23T21:35:55Z

src/System.Management.Automation/namespaces/FileSystemContentStream.cs

+                // we can read to generate another possible match.
+                // If we read more characters than this, we risk consuming
+                // more of the stream than we need.
+                _offsetDictionary = new Dictionary<char, int>(_delimiter.Length);


This looks a little bit dangerous. If _usingDelimiter is set to true anywhere else, _currentLineContent, _offsetDictionary will be null during later reads. I feel we should at least have an assert where these are used to ensure they are not null and as a warning if the code somehow gets changed.

If somebody make the wrong change our tests will be crashed with null reference exception and we'll never merge the code. I don't know where we could set the assert and how it can help more then the null reference exception.

iSazonov · 2018-10-24T06:40:31Z

I agree that the offset Dictionary could be a perf issue for simple delimiter characters.

I don't found how quickly fix this. Will investigate.

Have you performed any tests to verify perf gains?

No. I accidentally discovered this by reading code. Once I realized that we allocate in heap for every char in a worst case and for every line I did not see the point to run PerfView.

PaulHigin · 2018-10-24T17:21:47Z

src/System.Management.Automation/namespaces/FileSystemContentStream.cs

I think this can be

_offsetDictionary[lowByte] = _delimiter.Length - i - 1;

For duplicate chars the last index will end up in _offsetDictionary[lowByte]

Also, do we need to be worried about char collisions using the low byte?

Also, do we need to be worried about char collisions using the low byte?

There is a comment in the code. If we have a file with chars from many languages we'll have many collisions and lost performance. In practice, we can expect that files contains chars from one-two languages. In the common case we'll have no collisions and best performance.

I think this can be

Thanks! Good catch!

Fixed.

PaulHigin · 2018-10-24T17:23:15Z

src/System.Management.Automation/namespaces/FileSystemContentStream.cs

currentChar can be defined below where it is used.

PaulHigin

LGTM

iSazonov · 2018-10-25T18:14:28Z

@anmenaga @SteveL-MSFT Have you any comments?

anmenaga · 2018-10-25T19:19:08Z

Get-Content is a frequently used cmdlet. I would like to see a green [feature] test run for this. Thanks.

iSazonov added 6 commits October 22, 2018 16:17

Simplify main loop in ReadDelimited() by bringing out raw stream code

29705ce

Remove unused constructor

d0528f3

Allocate offset lookups dictionary only for non-default delimiter

afef290

Allocate buffer for current line once in constructor

004cce2

Use Span buffer to significantly reduce allocations

8a2dc06

Fix comment

d6f6c11

iSazonov requested review from BrucePay and anmenaga as code owners October 22, 2018 14:14

iSazonov requested review from SteveL-MSFT and adityapatwardhan October 22, 2018 14:15

iSazonov self-assigned this Oct 22, 2018

anmenaga reviewed Oct 22, 2018

View reviewed changes

Fix Style issues

7c2dc6f

PaulHigin suggested changes Oct 23, 2018

View reviewed changes

Address feedback

20274dc

iSazonov requested review from TravisEz13 and daxian-dbw as code owners October 24, 2018 11:18

Optimize Boyer-Moore

7848137

PaulHigin reviewed Oct 24, 2018

View reviewed changes

Fix _offsetDictionary init

07af4ca

iSazonov force-pushed the optimizations-in-getcontent2 branch from 4a7488a to 07af4ca Compare October 25, 2018 04:52

iSazonov added 2 commits October 25, 2018 09:55

Fix var init

244890f

Fix style issue

f0c0898

PaulHigin approved these changes Oct 25, 2018

View reviewed changes

[Feature]

09db90e

iSazonov merged commit 0e19042 into PowerShell:master Oct 26, 2018

iSazonov deleted the optimizations-in-getcontent2 branch October 26, 2018 03:56

Reduce allocations in Get-Content cmdlet #8103

Reduce allocations in Get-Content cmdlet #8103

Uh oh!

Conversation

iSazonov commented Oct 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

PR Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaulHigin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iSazonov commented Oct 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iSazonov Oct 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaulHigin left a comment

Choose a reason for hiding this comment

Uh oh!

iSazonov commented Oct 25, 2018

Uh oh!

anmenaga commented Oct 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iSazonov commented Oct 22, 2018 •

edited

Loading

iSazonov commented Oct 24, 2018 •

edited

Loading

iSazonov Oct 25, 2018 •

edited

Loading