Searching a file non-sequentially

Question

Usually when I search a file with grep, the search is done sequentially. Is it possible to perform a non-sequential search or a parallel search? Or for example, a search between line l1 and line l2 without having to go through the first l1-1 lines?

Correct, I use only bash on the terminal and when scripting. — Zeus
– Zeus, Commented May 3, 2015 at 1:06
How big is your file and are your lines the same size? If you have lines that are all the same size you can do fixed byte offsets which will be much faster. — b4hand
– b4hand, Commented May 3, 2015 at 1:39
A file can be the size of a book, let's say up to 1000 pages or a bit more. — Zeus
– Zeus, Commented May 3, 2015 at 1:41
That's tiny for a computer. You are very unlikely to see actual improved performance by parallelizing the task. — b4hand
– b4hand, Commented May 3, 2015 at 1:43

b4hand · Accepted Answer · 2015-05-03 05:20:16Z

You can use tail -n +N file | grep to begin a grep at a given line offset.

You can combine head with tail to search over just a fixed range.

However, this still must scan the file for end of line characters.

In general, sequential reads are the fastest reads for disks. Trying to do a parallel search will most likely cause random disk seeks and perform worse.

For what it is worth, a typical book contains about 200 words per page. At a typical 5 letters per word, you're looking at about 1kb per page, so 1000 pages would still be 1MB. A standard desktop hard drive can easily read that in a fraction of a second.

You can't speed up disk read throughput this way. In fact, I can almost guarantee you are not saturating your disk read rate right now for a file that small. You can use iostat to confirm.

If your file is completely ASCII, you may be able to speed things up by setting you locale to the C locale to avoid doing any type of Unicode translation.

If you need to do multiple searches over the same file, it would be worthwhile to build a reverse index to do the search. For code there are tools like exuberant ctags that can do that for you. Otherwise, you're probably looking at building a custom tool. There are tools for doing general text search over large corpuses, but that's probably overkill for you. You could even load the file into a database like Postgresql that supports full text search and have it build an index for you.

Padding the lines to a fixed record length is not necessarily going to solve your problem. As I mentioned before, I don't think you have an IO throughout issue, you could see that yourself by simply moving the file to a temporary ram disk that you create. That removes all potential IO. If that's still not fast enough for you then you're going to have to pursue an entirely different solution.

Andrea Ratto · Accepted Answer · 2015-05-03 06:17:40Z

1

if your lines are fixed length, you can use dd to read a particular section of the file:

dd if=myfile.txt bs=<line_leght> count=<lines_to_read> skip=<start_line> | other_commands

Note that dd will read from disk using the block size specified for input (bs). That might be slow and could be batched, by reading a group of lines at once so that you pull from disk at least 4kb. In this case you want to look at skip_bytes and count_bytes flags to be able to start and end at lines that are not multiple of your block size. Another interesting option is the output block size obs, which could benefit from being either the same of input or a single line.

edited May 3, 2015 at 6:17

answered May 3, 2015 at 5:55

Andrea Ratto

8451 gold badge11 silver badges24 bronze badges

Comments

Gerard van Helden · Accepted Answer · 2015-05-03 02:11:11Z

0

The simple answer is: you can't. What you want contradicts itself: You don't want to scan the entire file, but you want to know where each line ends. You can't know where each line ends without actually scanning the file. QED ;)

answered May 3, 2015 at 2:11

Gerard van Helden

1,61210 silver badges13 bronze badges

5 Comments

Zeus Over a year ago

Fair enough. Suppose I make the file have 90 character lines by appending any necessary empty characters. What commands would I need to do to scan between lines l1 to l2.

Gerard van Helden Over a year ago

In bash, you'd use head and tail as suggested before :) But they do the read anyway, so it wouldn't be more optimal than you might think. If you want this to be super efficient, you'd probably be better of writing a simple C program for it yourself. Basic file I/O isn't that hard to do, especially if it's only reading. Then you can seek (i.e. move the file pointer) within an open file without actually reading the data.

Zeus Over a year ago

Does that require fixed line lengths?

Gerard van Helden Over a year ago

It depends on how accurate you want it to be. Otherwise you'll always get back to the posed issue of scanning for newlines.

Zeus Over a year ago

Accurate?? What's that about?

Collectives™ on Stack Overflow

Searching a file non-sequentially

3 Answers 3

Comments

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related