Cannot parse correctly this file with pyparsing

Question

I am trying to parse a file using the amazing python library pyparsing but I am having a lot of problems...

The file I am trying to parse is something like:

sectionOne:
  list:
  - XXitem
  - XXanotherItem
  key1: value1
  product: milk
  release: now
  subSection:
    skey : sval
    slist:
    - XXitem
  mods:
  - XXone
  - XXtwo
  version: last
sectionTwo:
  base: base-0.1
  config: config-7.0-7

As you can see is an indented configuration file, and this is more or less how I have tried to define the grammar

The file can have one or more sections
Each section is formed by a section name and a section content.
Each section have an indented content
Each section content can have one or more pairs of key/value or a subsection.
Each value can be just a single word or a list of items.
A list of items is a group of one or more items.
Each item is an HYPHEN + a name starting with 'XX'

I have tried to create this grammar using pyparsing but with no success.

import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")

list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))

key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)

pair = pyparsing.Group(key + value)

indentStack = [1]

section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
section_value = pyparsing.OneOrMore(pair | section)
section_content = pyparsing.indentedBlock(section_value, indentStack, True)

section << pyparsing.Group(section_name + section_content)

parser = pyparsing.OneOrMore(section)

def main():
    try:
        with open('simple.info', 'r') as content_file:
            content = content_file.read()

            print "content:\n", content
            print "\n"
            result = parser.parseString(content)
            print "result1:\n", result
            print "len", len(result)

            pprint.pprint(result.asList())
    except pyparsing.ParseException, err:
        print err.line
        print " " * (err.column - 1) + "^"
        print err
    except pyparsing.ParseFatalException, err:
        print err.line
        print " " * (err.column - 1) + "^"
        print err


if __name__ == '__main__':
    main()

This is the result :

result1:
  [['sectionOne', [[['list', ['XXitem', 'XXanotherItem']], ['key1', 'value1'], ['product', 'milk'], ['release', 'now'], ['subSection', [[['skey', 'sval'], ['slist', ['XXitem']], ['mods', ['XXone', 'XXtwo']], ['version', 'last']]]]]]], ['sectionTwo', [[['base', 'base-0.1'], ['config', 'config-7.0-7']]]]]
  len 2
  [
     ['sectionOne',
     [[
        ['list', ['XXitem', 'XXanotherItem']],
        ['key1', 'value1'],
        ['product', 'milk'],
        ['release', 'now'],
        ['subSection',
           [[
              ['skey', 'sval'],
              ['slist', ['XXitem']],
              ['mods', ['XXone', 'XXtwo']],
              ['version', 'last']
           ]]
        ]
     ]]
     ],
     ['sectionTwo', 
     [[
        ['base', 'base-0.1'], 
        ['config', 'config-7.0-7']
     ]]
     ]
  ]

As you can see I have two main problems:

1.- Each section content is nested twice into a list

2.- the key "version" is parsed inside the "subSection" when it belongs to the "sectionOne"

My real target is to be able to get a structure of python nested dictionaries with the keys and values to easily extract the info for each field, but the pyparsing.Dict is something obscure to me.

Could anyone please help me ?

Thanks in advance

( sorry for the long post )

Your config format looks like YAML. Can't you just use PyYAML instead of parsing "by hand"? — Nikita Nemkin
– Nikita Nemkin, Commented Jun 26, 2013 at 8:33
Thanks for your comment @NikitaNemkin. Yes I could use PyYAM, in fact that is what we are using right now, but the files I have to parse comes from a stakeholder which tend to introduce minor modifications, so we want to develop some in-house parser to be able to change it accordingly — thamurath
– thamurath, Commented Jun 26, 2013 at 12:45

PaulMcG · Accepted Answer · 2013-06-29 15:51:09Z

You really are pretty close - congrats, indented parsers are not the easiest to write with pyparsing.

Look at the commented changes. Those marked with 'A' are changes to fix your two stated problems. Those marked with 'B' add Dict constructs so that you can access the parsed data as a nested structure using the names in the config.

The biggest culprit is that indentedBlock does some extra Group'ing for you, which gets in the way of Dict's name-value associations. Using ungroup to peel that away lets Dict see the underlying pairs.

Best of luck with pyparsing!

import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")

list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))

key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)

#~ A: pair = pyparsing.Group(key + value)
pair = (key + value)

indentStack = [1]

section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
#~ A: section_value = pyparsing.OneOrMore(pair | section)
section_value = (pair | section)

#~ B: section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section_content = pyparsing.Dict(pyparsing.ungroup(pyparsing.indentedBlock(section_value, indentStack, True)))

#~ A: section << Group(section_name + section_content)
section << (section_name + section_content)

#~ B: parser = pyparsing.OneOrMore(section)
parser = pyparsing.Dict(pyparsing.OneOrMore(pyparsing.Group(section)))

Now instead of pprint(result.asList()) you can write:

print (result.dump())

to show the Dict hierarchy:

[['sectionOne', ['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- sectionOne: [['list', ['XXitem', 'XXanotherItem']], ... etc. ...
  - key1: value1
  - list: ['XXitem', 'XXanotherItem']
  - mods: ['XXone', 'XXtwo']
  - product: milk
  - release: now
  - subSection: [['skey', 'sval'], ['slist', ['XXitem']]]
    - skey: sval
    - slist: ['XXitem']
  - version: last
- sectionTwo: [['base', 'base-0.1'], ['config', 'config-7.0-7']]
  - base: base-0.1
  - config: config-7.0-7

allowing you to write statements like:

print (result.sectionTwo.base)

Thanks for your quickly response. It works perfectly right now. Your library is very interesting, just a little lack of doc. Thanks again and a very nice work!

Collectives™ on Stack Overflow

Cannot parse correctly this file with pyparsing

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related