I'm a complete pyparsing newbie, and am trying to parse a large file with multi-line blocks describing archive files and their contents.
I'm currently at the stage where I'm able to parse a single item (no starting newline, this hardcoded test data approximates reading in a real file):
import pyparsing as pp
one_archive = \
"""archive (
name "something wicked this way comes.zip"
file ( name wicked.exe size 140084 date 2022/12/24 23:32:00 crc B2CF5E58 )
file ( name readme.txt size 1704 date 2022/12/24 23:32:00 crc 37F73AEE )
)
"""
pp.ParserElement.set_default_whitespace_chars(' \t')
EOL = pp.LineEnd().suppress()
start_of_archive_block = pp.LineStart() + pp.Keyword('archive (') + EOL
end_of_archive_block = pp.LineStart() + ')' + EOL
archive_filename = pp.LineStart() \
+ pp.Keyword('name').suppress() \
+ pp.Literal('"').suppress() \
+ pp.SkipTo(pp.Literal('"')).set_results_name("archive_name") \
+ pp.Literal('"').suppress() \
+ EOL
field_elem = pp.Keyword('name').suppress() + pp.SkipTo(pp.Literal(' size')).set_results_name("filename") \
^ pp.Keyword('size').suppress() + pp.SkipTo(pp.Literal(' date')).set_results_name("size") \
^ pp.Keyword('date').suppress() + pp.SkipTo(pp.Literal(' crc')).set_results_name("date") \
^ pp.Keyword('crc').suppress() + pp.SkipTo(pp.Literal(' )')).set_results_name("crc")
fields = field_elem * 4
filerow = pp.LineStart() \
+ pp.Literal('file (').suppress() \
+ fields \
+ pp.Literal(')').suppress() \
+ EOL
archive = start_of_archive_block.suppress() \
+ archive_filename \
+ pp.OneOrMore(pp.Group(filerow)) \
+ end_of_archive_block.suppress()
archive.parse_string(one_archive, parse_all=True)
The result is a ParseResults object with all the data I need from that single archive. (For some reason, the trailing newline in the input string causes no problems, despite me doing nothing to actively handle it.)
However, try as I might, I cannot get from this point to a point where I could parse the following, more realistic data. The new features I need to handle are:
- a single
file_metadatablock that starts the file (I don't need it in my parsing results, it can be skipped entirely) - multiple
archiveitems - newlines between the
archiveitems
realistic_data = \
"""
file_metadata (
description: blah blah etc.
author: john doe
version: 0.99
)
archive (
name "something wicked this way comes.zip"
file ( name wicked.exe size 140084 date 2022/12/24 23:32:00 crc B2CF5E58 )
file ( name readme.txt size 1704 date 2022/12/24 23:32:00 crc 37F73AEE )
)
archive (
name "naughty or nice.zip"
file ( name naughty.exe size 187232 date 2021/8/4 10:19:55 crc 638BC6AA )
file ( name nice.exe size 298234 date 2021/8/4 10:19:56 crc 99FD31AE )
file ( name whatever.jpg size 25603 date 2021/8/5 11:03:09 crc ABFAC314 )
)
"""
I've been semi-randomly trying a variety of things, but I have large fundamental gaps in my understanding of how pyparsing works, so they're not worth itemizing here. Someone who knows what they're doing can probably immediately see what to do here.
My ultimate goal is to parse all of these archive items and store them in a database.
What's the solution?