0

Usually when I search a file with grep, the search is done sequentially. Is it possible to perform a non-sequential search or a parallel search? Or for example, a search between line l1 and line l2 without having to go through the first l1-1 lines?

8
  • Is bash your language of preference for this? Commented May 3, 2015 at 1:02
  • Correct, I use only bash on the terminal and when scripting. Commented May 3, 2015 at 1:06
  • How big is your file and are your lines the same size? If you have lines that are all the same size you can do fixed byte offsets which will be much faster. Commented May 3, 2015 at 1:39
  • A file can be the size of a book, let's say up to 1000 pages or a bit more. Commented May 3, 2015 at 1:41
  • That's tiny for a computer. You are very unlikely to see actual improved performance by parallelizing the task. Commented May 3, 2015 at 1:43

3 Answers 3

1

You can use tail -n +N file | grep to begin a grep at a given line offset.

You can combine head with tail to search over just a fixed range.

However, this still must scan the file for end of line characters.

In general, sequential reads are the fastest reads for disks. Trying to do a parallel search will most likely cause random disk seeks and perform worse.

For what it is worth, a typical book contains about 200 words per page. At a typical 5 letters per word, you're looking at about 1kb per page, so 1000 pages would still be 1MB. A standard desktop hard drive can easily read that in a fraction of a second.

You can't speed up disk read throughput this way. In fact, I can almost guarantee you are not saturating your disk read rate right now for a file that small. You can use iostat to confirm.

If your file is completely ASCII, you may be able to speed things up by setting you locale to the C locale to avoid doing any type of Unicode translation.

If you need to do multiple searches over the same file, it would be worthwhile to build a reverse index to do the search. For code there are tools like exuberant ctags that can do that for you. Otherwise, you're probably looking at building a custom tool. There are tools for doing general text search over large corpuses, but that's probably overkill for you. You could even load the file into a database like Postgresql that supports full text search and have it build an index for you.

Padding the lines to a fixed record length is not necessarily going to solve your problem. As I mentioned before, I don't think you have an IO throughout issue, you could see that yourself by simply moving the file to a temporary ram disk that you create. That removes all potential IO. If that's still not fast enough for you then you're going to have to pursue an entirely different solution.

Sign up to request clarification or add additional context in comments.

Comments

1

if your lines are fixed length, you can use dd to read a particular section of the file:

dd if=myfile.txt bs=<line_leght> count=<lines_to_read> skip=<start_line> | other_commands

Note that dd will read from disk using the block size specified for input (bs). That might be slow and could be batched, by reading a group of lines at once so that you pull from disk at least 4kb. In this case you want to look at skip_bytes and count_bytes flags to be able to start and end at lines that are not multiple of your block size. Another interesting option is the output block size obs, which could benefit from being either the same of input or a single line.

Comments

0

The simple answer is: you can't. What you want contradicts itself: You don't want to scan the entire file, but you want to know where each line ends. You can't know where each line ends without actually scanning the file. QED ;)

5 Comments

Fair enough. Suppose I make the file have 90 character lines by appending any necessary empty characters. What commands would I need to do to scan between lines l1 to l2.
In bash, you'd use head and tail as suggested before :) But they do the read anyway, so it wouldn't be more optimal than you might think. If you want this to be super efficient, you'd probably be better of writing a simple C program for it yourself. Basic file I/O isn't that hard to do, especially if it's only reading. Then you can seek (i.e. move the file pointer) within an open file without actually reading the data.
Does that require fixed line lengths?
It depends on how accurate you want it to be. Otherwise you'll always get back to the posed issue of scanning for newlines.
Accurate?? What's that about?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.