I am trying to parse a file using the amazing python library pyparsing but I am having a lot of problems...
The file I am trying to parse is something like:
sectionOne:
list:
- XXitem
- XXanotherItem
key1: value1
product: milk
release: now
subSection:
skey : sval
slist:
- XXitem
mods:
- XXone
- XXtwo
version: last
sectionTwo:
base: base-0.1
config: config-7.0-7
As you can see is an indented configuration file, and this is more or less how I have tried to define the grammar
- The file can have one or more sections
- Each section is formed by a section name and a section content.
- Each section have an indented content
- Each section content can have one or more pairs of key/value or a subsection.
- Each value can be just a single word or a list of items.
- A list of items is a group of one or more items.
- Each item is an HYPHEN + a name starting with 'XX'
I have tried to create this grammar using pyparsing but with no success.
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
pair = pyparsing.Group(key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
section_value = pyparsing.OneOrMore(pair | section)
section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section << pyparsing.Group(section_name + section_content)
parser = pyparsing.OneOrMore(section)
def main():
try:
with open('simple.info', 'r') as content_file:
content = content_file.read()
print "content:\n", content
print "\n"
result = parser.parseString(content)
print "result1:\n", result
print "len", len(result)
pprint.pprint(result.asList())
except pyparsing.ParseException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
except pyparsing.ParseFatalException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
if __name__ == '__main__':
main()
This is the result :
result1:
[['sectionOne', [[['list', ['XXitem', 'XXanotherItem']], ['key1', 'value1'], ['product', 'milk'], ['release', 'now'], ['subSection', [[['skey', 'sval'], ['slist', ['XXitem']], ['mods', ['XXone', 'XXtwo']], ['version', 'last']]]]]]], ['sectionTwo', [[['base', 'base-0.1'], ['config', 'config-7.0-7']]]]]
len 2
[
['sectionOne',
[[
['list', ['XXitem', 'XXanotherItem']],
['key1', 'value1'],
['product', 'milk'],
['release', 'now'],
['subSection',
[[
['skey', 'sval'],
['slist', ['XXitem']],
['mods', ['XXone', 'XXtwo']],
['version', 'last']
]]
]
]]
],
['sectionTwo',
[[
['base', 'base-0.1'],
['config', 'config-7.0-7']
]]
]
]
As you can see I have two main problems:
1.- Each section content is nested twice into a list
2.- the key "version" is parsed inside the "subSection" when it belongs to the "sectionOne"
My real target is to be able to get a structure of python nested dictionaries with the keys and values to easily extract the info for each field, but the pyparsing.Dict is something obscure to me.
Could anyone please help me ?
Thanks in advance
( sorry for the long post )