Add an EMBL file parser to BioJava #621#713
Conversation
josemduarte
left a comment
There was a problem hiding this comment.
Thanks for the pull request!
It looks to me like the EmblParser class shouldn't be called like that but rather EmblRecord or something similar, since it is the top level container for the parsed information. I'd also rather have the parse method as a static taking the file as argument and returning the EmblRecord. Javadocs explaining how the parser should be used would be nice.
| public class EmblId { | ||
|
|
||
|
|
||
| private String PrimaryAccession; |
There was a problem hiding this comment.
By convention, variable names in java should start with lower case.
There was a problem hiding this comment.
sorry for that.
| this.file = file; | ||
| } | ||
|
|
||
| public void parse() { |
There was a problem hiding this comment.
This method should throw an IOException instead of catching it. Otherwise the user would not be able to handle it properly in their own application.
|
|
||
| @Test | ||
| public void test(){ | ||
| File file = new File("/home/pslpt219/Desktop/Homo.dat"); |
There was a problem hiding this comment.
This won't work for any other user. You will need to add the file to the main/resources folder and read it from there. See how it is done elsewhere in biojava. If the file is very big, please gzip it before adding it.
There was a problem hiding this comment.
thank you for the comment i have moved the file into the src/test/resources.
| } | ||
|
|
||
| @Test | ||
| public void test(){ |
There was a problem hiding this comment.
You will need to assert something in the test, e.g. whether fields are not null and checking that values read for the fields are as expected.
There was a problem hiding this comment.
sorry for that i didn't write full tests for the class but i'm working to write cases to cover more percentage of the code.
|
|
||
| /** | ||
| * This class should parse the data of embl file identification | ||
| * @author Noor Aldeen Al Mbaidin |
There was a problem hiding this comment.
Could you add @since 5.0.0 tag to all the new classes here? Thanks!
|
@josemduarte |
josemduarte
left a comment
There was a problem hiding this comment.
Thanks a lot for the changes, they look good! Just one more small issue with the test. We can merge then unless someone else wants to comment something
|
|
||
| @Test(expected = IllegalArgumentException.class) | ||
| public void givenDirectoryWhenProcessEmblFileThenThrowException() throws IOException { | ||
| File file = new File("./src/test/resources"); |
There was a problem hiding this comment.
This is not quite right. See for instance how it is done here:
The important thing is to use getResourceAsStream which will work in all situations (in jar packaging for instance).
| private String databaseCrossReference; | ||
| private String assemblyHeader; | ||
| private String assemblyInformation; | ||
| private String CON; |
There was a problem hiding this comment.
This should be lower case
There was a problem hiding this comment.
I renamed as constructedSequence.
| */ | ||
| public class EmblReader { | ||
|
|
||
| private StringBuilder sequence = new StringBuilder(""); |
There was a problem hiding this comment.
This could be removed, it doesn't seem to get used
|
|
||
| import java.io.*; | ||
| import java.util.LinkedList; | ||
| import java.util.List; |
| @@ -0,0 +1,108 @@ | |||
| package org.biojava.nbio.core.sequence.io.embl; | |||
There was a problem hiding this comment.
All source code in biojava requires the license header
| private String taxonomicDivision; | ||
| private String sequenceLength; | ||
|
|
||
| public EmblId() { |
There was a problem hiding this comment.
This no-arg public constructor can be removed
| * @since 5.0.0 | ||
| * @author Noor Aldeen Al Mbaidin | ||
| */ | ||
| public class EmblId { |
There was a problem hiding this comment.
If all the fields were marked final and only settable in the constructor, this class could be marked @Immutable
| @@ -0,0 +1,147 @@ | |||
| package org.biojava.nbio.core.sequence.io.embl; | |||
There was a problem hiding this comment.
All source code in biojava requires the license header
|
|
||
| private StringBuilder sequence = new StringBuilder(""); | ||
|
|
||
| public EmblReader() { |
There was a problem hiding this comment.
This no-arg public constructor can be removed
| String lineInfo; | ||
| try (BufferedReader bufferedReader = new BufferedReader(fileReader)) { | ||
| while ((line = bufferedReader.readLine()) != null) { | ||
| lineInfo = line.substring(0, 2); |
There was a problem hiding this comment.
What is the difference between lineInfo and lineIdentifier here? Note also that there is no check if line is long enough for this substring call not to fail.
There was a problem hiding this comment.
Sorry this was a mistake , I have added validation for line length.
| @@ -0,0 +1,291 @@ | |||
| package org.biojava.nbio.core.sequence.io.embl; | |||
There was a problem hiding this comment.
All source code in biojava requires the license header
| @@ -0,0 +1,134 @@ | |||
| package org.biojava.nbio.core.sequence.io.embl; | |||
There was a problem hiding this comment.
All source code in biojava requires the license header
| @@ -0,0 +1,47 @@ | |||
| package org.biojava.nbio.core.sequence.io.embl; | |||
There was a problem hiding this comment.
All source code in biojava requires the license header
|
@heuermh could you approve the latest changes so that we can merge? |
|
@heuermh may i ask you to review the changes ? |
|
Dear Noor Aldeen Al Mbaidi,
Thank you for your work, I may be a potential heavy user of your parser.
I see that you have crafted an EmblRecord class. If I can add my humble
opinion, it would be great if you could convey the information parsed in
the standard data structure of biojava for this things, i.e. the sequence
class (to be more precise, classes inheriting AbstractSequence). In this
way all the sequence downstream processing out there will greatly benefit
of your work.
An easy parallelism can be made with Genbank sequence parsing. Look at
/biojava-core/src/main/java/org/biojava/nbio/core/sequence/io/GenbankSequenceParser.java
which is triggered by
core/src/main/java/org/biojava/nbio/core/sequence/loader/GenbankProxySequenceReader.java
Unfortunately, I know that this would be more challenging and may go
beyhond what you meant, but it would be more consistent with the framework
provided.
Anyway, I really appreciate the already done work, thank you again.
Paolo
Il giorno mer 20 dic 2017 alle ore 07:40 Noor Aldeen Al Mbaidin <
notifications@github.com> ha scritto:
… @heuermh <https://github.com/heuermh> may i ask you to review the changes
?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#713 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABwuq_gKrA7O4PFQ2khPtodKf7q2nh71ks5tCKv3gaJpZM4QljWO>
.
|
|
Dear @paolopavan |
|
Great, wonderful news ! Remember to have a look to the proxy reader
mechanism (eg GenbankProxySequenceReader) to link your work consistently
with the project design.
Il giorno mer 27 dic 2017 alle 08:56 Noor Aldeen Al Mbaidin <
notifications@github.com> ha scritto:
… Dear @paolopavan <https://github.com/paolopavan>
thank you for your opinion, I will try to use the AbstractSequence class
and upload the changes as soon as possible.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#713 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABwuq9i32FkV3aEHqb1VIYDZROswrBIMks5tEfg4gaJpZM4QljWO>
.
|
|
I'm going to go ahead and merge this since @heuermh's requested changes have been satisfied. @paolopavan's suggestion to return sequence objects is a good idea. We've seen that this can be a bit tricky to do in a type-safe manner (#354), so I think it's worth submitting that as a second PR. |
License headers have been added. All requested changes are satisfied.
|
@sbliven I have further concerns but as you say those could be addressed in a future PR. |
This is an implementation for first version for embl parser feature with issue #621 Add an EMBL file parser to BioJava.