Comments (15)
I'm happy to review this.
from joss-reviews.
I've now added a "Contributing" section to the README, which additionally links to the other github repositories whose code is used here.
Regarding automated tests, I do agree they'd start to be useful now that the tool is seeing wider use. I'll start adding them as I do further work, but right now don't have the time to draft a comprehensive battery from the start.
from joss-reviews.
/ cc @openjournals/joss-reviewers - would anyone be willing to review this submission?
If you would like to review this submission then please comment on this thread so that others know you're doing a review (so as not to duplicate effort). Something as simple as :hand: I am reviewing this
will suffice.
Reviewer instructions
- Please work through the checklist at the start of this issue.
- If you need any further guidance/clarification take a look at the reviewer guidelines here http://joss.theoj.org/about#reviewer_guidelines
- Please make a publication recommendation at the end of your review
Any questions, please ask for help by commenting on this issue!
from joss-reviews.
I'm teaching a class at a linguistics summer school all next week so I won't be able to get to this until the week after. I'm sorry for the delay and happy to yield to another reviewer if they want to review it in the meantime.
from joss-reviews.
Installation
There's no specific installation section just a section called "Running". It would be nice if we could just rename that section to "Installation" even though there's nothing really to install since that's what people generally look for. And it would be nice to have that section contain the actual steps needed to run LAS on the example referenced later.
Another point is that the las
binary that you download from the releases page needs to be chmod'ed 755 before it can run at least on my mac.
Functionality
I downloaded the file and tested on some Hindi text and it seems to work fine. However, as an aside, the web service inaccurately detects the same text as German for some reason (even though the underlying language detector says hi
(denoting Hindi) but for some reason it gets over-ruled by the language recognizer?
Update: Actually, I just noticed that the command line tool created .language
file which also misidentifies the Hindi text as German.
Another problem I had was that when I called las recognize test.txt
on my Hindi source file instead of las identify test.txt
, I get the following exception.
10:15:03.846 [main] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[hi:0.8542185083632905], DetectedLanguage[mr:0.14570001916332004]]
Exception in thread "main" java.lang.IllegalArgumentException: in must not be null!
at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:179)
at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:125)
at fi.seco.lexical.combined.CombinedLexicalAnalysisService.getTokenizer(CombinedLexicalAnalysisService.java:125)
at fi.seco.lexical.combined.CombinedLexicalAnalysisService.recognize(CombinedLexicalAnalysisService.java:617)
at LASCommandLineTool$.recognize(LASCommandLineTool.scala:299)
at LASCommandLineTool$$anonfun$main$9$$anonfun$apply$6.apply(LASCommandLineTool.scala:219)
at LASCommandLineTool$$anonfun$main$9$$anonfun$apply$6.apply(LASCommandLineTool.scala:218)
at scala.collection.immutable.List.foreach(List.scala:381)
at LASCommandLineTool$$anonfun$main$9.apply(LASCommandLineTool.scala:218)
at LASCommandLineTool$$anonfun$main$9.apply(LASCommandLineTool.scala:209)
at scala.collection.immutable.List.foreach(List.scala:381)
at LASCommandLineTool$.main(LASCommandLineTool.scala:209)
at LASCommandLineTool.main(LASCommandLineTool.scala)
Is it simply because the LAS recognizer doesn't actually support Hindi (according to the help message)? If so, a more interpretable exception might be more useful here.
Performance
Although there are no specific claims made about the performance, the README does seem to imply that re-running after the first time should be much faster. I didn't really experience that. For example, running las identify
on the same single sentence text file took about 25-30 seconds no matter how many times I re-ran it.
Documentation
This section seems lacking in general.
There isn't really a statement of need. It would be nice to have something that explains what the tool is useful for and also all of the modes of analysis it provides (recognition, identification etc.).
The package is self-contained so there's no need for dependencies, I guess. However, there's no information for people who are interested in contributing. There also don't seem to be any automated tests to run.
There's an example which is useful but it seems quite contrived. It would be nice to have actual examples for all of the analysis modes.
Software paper
The paper actually does have actual text motivating the need for this software. This text can be used to add the statement of need to the repository README.
Recommendation
Accept with minor changes. I think a more detailed README that explains the motivation, has better examples, and better installation/run instructions is all this package needs. It should also emphasize more prominently the fact that it's a wrapper around other existing projects.
from joss-reviews.
This software is essentially a 300+ line script that wraps existing Java tools. I would like to ask the author to confirm here that he thinks this software is a valuable contribution to research. I want to point out that once published this work will be visible for a long time. Adding an example of having used this tool in research would be a good idea.
from joss-reviews.
From @desilinguist:
Installation
I've reworded the section to be "Installation and running".
However, as an aside, the web service inaccurately detects the same text as German for some reason (even though the underlying language detector says hi (denoting Hindi) but for some reason it gets over-ruled by the language recognizer?
The tool employs a simple voting mechanism to tally scores between recognizers. In this case, probably because the other tools don't understand Hindi, it gets outvoted by the poorer best guesses of the other methods. (For such cases, the results of the individual recognizers are included in the results, so you can add your own post-processing selector that is better tuned to your usage scenario.)
Another problem I had was that when I called las recognize test.txt on my Hindi source file instead of las identify test.txt, I get the following exception.
This is because recognize
is actually a separate (and previously undocumented) function that counts the number of words a particular language processor recognizes, and it doesn't support Hindi. However, you also encountered a bug that caused an exception to be thrown instead of handling this in a better manner, which I've now fixed.
The reason for this functionality is so I could estimate the number of OCR errors in automatically scanned historical newspapers. I've added a section to the README to document this.
Although there are no specific claims made about the performance, the README does seem to imply that re-running after the first time should be much faster. I didn't really experience that.
This was a miscommunication. What the README means to say is that at each run, initial load up takes a significant time, so one should configure their workflow so that as much data as possible is funneled to the tool in a single run (e.g. giving the tool multiple files to process on the command line instead of running the tool once for every file). I've tried to make this more clear in the README.
Documentation
I've copied into the README the statement of need as well as the note about it being first and foremost a wrapper from the paper.
Regarding the examples, while they are contrived, I do want to keep them short for the README, and a full worked-through example for all the different modes of operation would in my opinion be too much. What I've done is added a second simple example for each mode of operation that isn't the one Finnish example (the Finnish ones being both important and already understandable for the primary audience of the tool).
I've also added additional information on how to run each mode to their respective sections.
The above changes are visible in this commit.
From @pjotrp:
This software is essentially a 300+ line script that wraps existing Java tools.
While I do think the worth of the tool is precisely in packaging existing tools together and tuning them to work with each other, it actually isn't just 300 lines. I submitted the (repository containing the) command line version, because I think it is the one most usable by the intended audience. 5500 more lines of code are contained in the seco-lexicalanalysis repository, which contains the code common to this command line version and the web service version (seco-lexicalanalysis-play). They in turn refer to seco-hfst, which has 1500 additional LOC.
In addition, as the README and paper both point out, while I view the integration in general useful, the real core contribution of the work is in the in-depth work on integrating and expanding the Finnish pipeline included in the tool, where for example the diff of our omorfi fork is ~12000 lines against the base version.
Adding an example of having used this tool in research would be a good idea.
The second paragraph of the paper already deals with this, referring to three academic articles that use the tool. The reason I submitted the tool in the first place was also that I had a request from an outside user, who wanted to know how they could refer to the tool in their own article. So, to state it explicitly, I do think this is a valuable contribution to research.
from joss-reviews.
@jiemakel thanks so much for addressing most of my comments! I think the changes you have made are quite useful. There are still no automated tests and still not information on how a new developer can contribute. I think some tests should be added but I am not going to make that issue a blocker for publication. It would be nice for the README to contain a little blurb about how one would go about contributing as a developer etc.
As far as the utility of the tool goes, as someone who works in the NLP (natural language processing) field, I think las
is definitely going to be useful to people in the field.
from joss-reviews.
@jiemakel thanks for the clarification. If you have not done so, could you add some pointers to the other work in the paper or the README, so it is clear that this is a major effort?
from joss-reviews.
@desilinguist Could this be moved forward?
from joss-reviews.
@desilinguist - This looks like this is in reasonable shape to accept. Can you confirm?
from joss-reviews.
I'm happy to accept this. Sorry for the delay.
from joss-reviews.
I'm happy to accept this. Sorry for the delay.
@jiemakel - at this point could you make an archive of the reviewed software in Zenodo/figshare/other service and update this thread with the DOI of the archive? I can then move forward with accepting the submission.
from joss-reviews.
The DOI for the latest version is: 10.5281/zenodo.160256
from joss-reviews.
Thanks again for the review @desilinguist!
@jiemakel your paper is now accepted into JOSS. Your DOI is http://dx.doi.org/10.21105/joss.00035
from joss-reviews.
Related Issues (20)
- [PRE REVIEW]: shapfs: A Python package for feature selection by shapley value HOT 27
- [PRE REVIEW]: Defisheye: A Python Package for Fast Correction of Fisheye Distortion in Images HOT 22
- [PRE REVIEW]: nnogada: neural networks optimized by genetic algorithms for data analysis HOT 8
- [PRE REVIEW]: DataScribe: An Omeka S module for structured data transcription HOT 10
- [REVIEW]: PII-Codex: a Python library for PII detection, categorization, and severity assessment HOT 17
- [REVIEW]: Software Design and User Interface of ESPnet-SE++: Speech Enhancement for Robust Speech Processing HOT 5
- [PRE REVIEW]: bayes-toolbox: A Python package for Bayesian statistics HOT 14
- [PRE REVIEW]: PxMCMC: A Python package for proximal Markov Chain Monte Carlo HOT 7
- [PRE REVIEW]: Assessing Warmth and Competence With the warmthcompetence R Package HOT 9
- [REVIEW]: CoReCon: an open, community-powered collection of Reionization constraints HOT 6
- [REVIEW]: G2Aero: A Python package for separable shape tensors HOT 7
- [REVIEW]: Swiftest: An N-body Integrator for Gravitational Systems HOT 9
- [REVIEW]: Ethical Smart Grid: a Gym environment for learning ethical behaviours HOT 5
- [REVIEW]: SleepECG: a Python package for sleep staging based on heart rate HOT 8
- [PRE REVIEW]: Foundry-ML - Software and Services to Simplify Access to Machine Learning Datasets in Materials Science HOT 11
- [PRE REVIEW]: OmniTrax: A deep learning-driven multi-animal tracking and pose-estimation add-on for Blender HOT 17
- [PRE REVIEW]: xinvert: A Python package for inversion problems in geophysical fluid dynamics HOT 16
- [REVIEW]: xclim: xarray-based climate data analytics HOT 8
- [PRE REVIEW]: SyntheticEddyMethod.jl: A Julia package for the creation of inlet flow conditions for LES HOT 11
- [REVIEW]: EDP: a program for projecting electron densities from VASP onto planes HOT 13
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from joss-reviews.