Skip to content

Conversation

@iSazonov
Copy link
Collaborator

@iSazonov iSazonov commented Oct 22, 2018

PR Summary

Current implementation do huge extra allocations for every line of a file and at worst for every char (!). The fix exclude the extra allocations by moving initializations in a constructor and using Span for temporary buffer.

The Boyer-Moore string search algorithm was optimized. Use simple mapping Unicode chars to byte. It works well for common text files with chars from one-two languages.

You could review commit by commit.

PR Checklist

/// <param name="isRawStream">
/// Indicates raw stream.
/// </param>
public FileSystemContentReaderWriter(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unneeded and used internal constructor and we can safely remove it. (Otherwise we have to fix it to follow our changes in another constructor with delimiter parameter.)

Copy link
Contributor

@PaulHigin PaulHigin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. I agree that the offset Dictionary could be a perf issue for simple delimiter characters. Have you performed any tests to verify perf gains? I imagine they would be significant, in some case, by switching to use Span.

blocks.Add(contentRead);
}

return _reader.Peek() != -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this always have to be -1, when reading a raw stream to the end?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// we can read to generate another possible match.
// If we read more characters than this, we risk consuming
// more of the stream than we need.
_offsetDictionary = new Dictionary<char, int>(_delimiter.Length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little bit dangerous. If _usingDelimiter is set to true anywhere else, _currentLineContent, _offsetDictionary will be null during later reads. I feel we should at least have an assert where these are used to ensure they are not null and as a warning if the code somehow gets changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If somebody make the wrong change our tests will be crashed with null reference exception and we'll never merge the code. I don't know where we could set the assert and how it can help more then the null reference exception.

@iSazonov
Copy link
Collaborator Author

iSazonov commented Oct 24, 2018

I agree that the offset Dictionary could be a perf issue for simple delimiter characters.

I don't found how quickly fix this. Will investigate.

Have you performed any tests to verify perf gains?

No. I accidentally discovered this by reading code. Once I realized that we allocate in heap for every char in a worst case and for every line I did not see the point to run PerfView.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be

_offsetDictionary[lowByte] = _delimiter.Length - i - 1;

For duplicate chars the last index will end up in _offsetDictionary[lowByte]

Also, do we need to be worried about char collisions using the low byte?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we need to be worried about char collisions using the low byte?

There is a comment in the code. If we have a file with chars from many languages we'll have many collisions and lost performance. In practice, we can expect that files contains chars from one-two languages. In the common case we'll have no collisions and best performance.

Copy link
Collaborator Author

@iSazonov iSazonov Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be

Thanks! Good catch!

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currentChar can be defined below where it is used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@iSazonov iSazonov force-pushed the optimizations-in-getcontent2 branch from 4a7488a to 07af4ca Compare October 25, 2018 04:52
Copy link
Contributor

@PaulHigin PaulHigin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@iSazonov
Copy link
Collaborator Author

@anmenaga @SteveL-MSFT Have you any comments?

@anmenaga
Copy link

Get-Content is a frequently used cmdlet. I would like to see a green [feature] test run for this. Thanks.

@iSazonov iSazonov merged commit 0e19042 into PowerShell:master Oct 26, 2018
@iSazonov iSazonov deleted the optimizations-in-getcontent2 branch October 26, 2018 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants