Usually when I search a file with grep, the search is done sequentially. Is it possible to perform a non-sequential search or a parallel search? Or for example, a search between line l1 and line l2 without having to go through the first l1-1 lines?
-
Is bash your language of preference for this?Patrick Roberts– Patrick Roberts2015-05-03 01:02:51 +00:00Commented May 3, 2015 at 1:02
-
Correct, I use only bash on the terminal and when scripting.Zeus– Zeus2015-05-03 01:06:36 +00:00Commented May 3, 2015 at 1:06
-
How big is your file and are your lines the same size? If you have lines that are all the same size you can do fixed byte offsets which will be much faster.b4hand– b4hand2015-05-03 01:39:06 +00:00Commented May 3, 2015 at 1:39
-
A file can be the size of a book, let's say up to 1000 pages or a bit more.Zeus– Zeus2015-05-03 01:41:28 +00:00Commented May 3, 2015 at 1:41
-
That's tiny for a computer. You are very unlikely to see actual improved performance by parallelizing the task.b4hand– b4hand2015-05-03 01:43:13 +00:00Commented May 3, 2015 at 1:43
3 Answers
You can use tail -n +N file | grep to begin a grep at a given line offset.
You can combine head with tail to search over just a fixed range.
However, this still must scan the file for end of line characters.
In general, sequential reads are the fastest reads for disks. Trying to do a parallel search will most likely cause random disk seeks and perform worse.
For what it is worth, a typical book contains about 200 words per page. At a typical 5 letters per word, you're looking at about 1kb per page, so 1000 pages would still be 1MB. A standard desktop hard drive can easily read that in a fraction of a second.
You can't speed up disk read throughput this way. In fact, I can almost guarantee you are not saturating your disk read rate right now for a file that small. You can use iostat to confirm.
If your file is completely ASCII, you may be able to speed things up by setting you locale to the C locale to avoid doing any type of Unicode translation.
If you need to do multiple searches over the same file, it would be worthwhile to build a reverse index to do the search. For code there are tools like exuberant ctags that can do that for you. Otherwise, you're probably looking at building a custom tool. There are tools for doing general text search over large corpuses, but that's probably overkill for you. You could even load the file into a database like Postgresql that supports full text search and have it build an index for you.
Padding the lines to a fixed record length is not necessarily going to solve your problem. As I mentioned before, I don't think you have an IO throughout issue, you could see that yourself by simply moving the file to a temporary ram disk that you create. That removes all potential IO. If that's still not fast enough for you then you're going to have to pursue an entirely different solution.
Comments
if your lines are fixed length, you can use dd to read a particular section of the file:
dd if=myfile.txt bs=<line_leght> count=<lines_to_read> skip=<start_line> | other_commands
Note that dd will read from disk using the block size specified for input (bs). That might be slow and could be batched, by reading a group of lines at once so that you pull from disk at least 4kb. In this case you want to look at skip_bytes and count_bytes flags to be able to start and end at lines that are not multiple of your block size.
Another interesting option is the output block size obs, which could benefit from being either the same of input or a single line.
Comments
The simple answer is: you can't. What you want contradicts itself: You don't want to scan the entire file, but you want to know where each line ends. You can't know where each line ends without actually scanning the file. QED ;)
5 Comments
l1 to l2.head and tail as suggested before :) But they do the read anyway, so it wouldn't be more optimal than you might think. If you want this to be super efficient, you'd probably be better of writing a simple C program for it yourself. Basic file I/O isn't that hard to do, especially if it's only reading. Then you can seek (i.e. move the file pointer) within an open file without actually reading the data.