Skip to content

Genbank LOCUS line parsing #833

@jamesmorris

Description

@jamesmorris

Hi,

I have been working on upgrading the version of Biojava used in a code base that was previously using version 1.7.

Most of the biojava code was used for working with genbank files, I have gone about altering this code to be compatible with version 5 of biojava.

During this work I have come across an issue with the locus line parsing in GenbankSequenceParser class.

This became apparent because tests previously passing using 1.7 are now failing using 5.1 due to "Bad locus line" exceptions when trying to parse the unchanged test genbank files.

These test failures are being caused by a few changes in the way the locus line is parsed:
Units (bp/aa) - is required to be lower case
Topology (linear/circular) - is required to be lower case
Strand (ds-/ss-/ms-) - is required to be lower case
There is also a requirement of at least one white space character after the units even though all the following fields are captured optionally.

Before I submit a pull request with a change to the Pattern that captures information in the locus line to correct these issues I wanted to check if there was any reason for the case sensitivity?

Thanks,
James

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions