Submitting author: <a class="user-mention notranslate" data-hovercard

This software is essentially a 300+ line that wraps existing Java tools. I woul

From <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[REVIEW]: LAS: an integrated language analysis tool for multiple languages about joss-reviews HOT 15 CLOSED

openjournals commented on July 20, 2024

[REVIEW]: LAS: an integrated language analysis tool for multiple languages

from joss-reviews.

Comments (15)

desilinguist commented on July 20, 2024 1

I'm happy to review this.

from joss-reviews.

jiemakel commented on July 20, 2024 1

I've now added a "Contributing" section to the README, which additionally links to the other github repositories whose code is used here.

Regarding automated tests, I do agree they'd start to be useful now that the tool is seeing wider use. I'll start adding them as I do further work, but right now don't have the time to draft a comprehensive battery from the start.

from joss-reviews.

arfon commented on July 20, 2024

/ cc @openjournals/joss-reviewers - would anyone be willing to review this submission?

If you would like to review this submission then please comment on this thread so that others know you're doing a review (so as not to duplicate effort). Something as simple as :hand: I am reviewing this will suffice.

Reviewer instructions

Please work through the checklist at the start of this issue.
If you need any further guidance/clarification take a look at the reviewer guidelines here http://joss.theoj.org/about#reviewer_guidelines
Please make a publication recommendation at the end of your review

Any questions, please ask for help by commenting on this issue! 🚀

from joss-reviews.

desilinguist commented on July 20, 2024

I'm teaching a class at a linguistics summer school all next week so I won't be able to get to this until the week after. I'm sorry for the delay and happy to yield to another reviewer if they want to review it in the meantime.

from joss-reviews.

desilinguist commented on July 20, 2024

Installation

There's no specific installation section just a section called "Running". It would be nice if we could just rename that section to "Installation" even though there's nothing really to install since that's what people generally look for. And it would be nice to have that section contain the actual steps needed to run LAS on the example referenced later.

Another point is that the las binary that you download from the releases page needs to be chmod'ed 755 before it can run at least on my mac.

Functionality

I downloaded the file and tested on some Hindi text and it seems to work fine. However, as an aside, the web service inaccurately detects the same text as German for some reason (even though the underlying language detector says hi (denoting Hindi) but for some reason it gets over-ruled by the language recognizer?

Update: Actually, I just noticed that the command line tool created .language file which also misidentifies the Hindi text as German.

Another problem I had was that when I called las recognize test.txt on my Hindi source file instead of las identify test.txt, I get the following exception.

10:15:03.846 [main] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[hi:0.8542185083632905], DetectedLanguage[mr:0.14570001916332004]]
Exception in thread "main" java.lang.IllegalArgumentException: in must not be null!
    at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:179)
    at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:125)
    at fi.seco.lexical.combined.CombinedLexicalAnalysisService.getTokenizer(CombinedLexicalAnalysisService.java:125)
    at fi.seco.lexical.combined.CombinedLexicalAnalysisService.recognize(CombinedLexicalAnalysisService.java:617)
    at LASCommandLineTool$.recognize(LASCommandLineTool.scala:299)
    at LASCommandLineTool$$anonfun$main$9$$anonfun$apply$6.apply(LASCommandLineTool.scala:219)
    at LASCommandLineTool$$anonfun$main$9$$anonfun$apply$6.apply(LASCommandLineTool.scala:218)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at LASCommandLineTool$$anonfun$main$9.apply(LASCommandLineTool.scala:218)
    at LASCommandLineTool$$anonfun$main$9.apply(LASCommandLineTool.scala:209)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at LASCommandLineTool$.main(LASCommandLineTool.scala:209)
    at LASCommandLineTool.main(LASCommandLineTool.scala)

Is it simply because the LAS recognizer doesn't actually support Hindi (according to the help message)? If so, a more interpretable exception might be more useful here.

Performance

Although there are no specific claims made about the performance, the README does seem to imply that re-running after the first time should be much faster. I didn't really experience that. For example, running las identify on the same single sentence text file took about 25-30 seconds no matter how many times I re-ran it.

Documentation

This section seems lacking in general.

There isn't really a statement of need. It would be nice to have something that explains what the tool is useful for and also all of the modes of analysis it provides (recognition, identification etc.).

The package is self-contained so there's no need for dependencies, I guess. However, there's no information for people who are interested in contributing. There also don't seem to be any automated tests to run.

There's an example which is useful but it seems quite contrived. It would be nice to have actual examples for all of the analysis modes.

Software paper

The paper actually does have actual text motivating the need for this software. This text can be used to add the statement of need to the repository README.

Recommendation

Accept with minor changes. I think a more detailed README that explains the motivation, has better examples, and better installation/run instructions is all this package needs. It should also emphasize more prominently the fact that it's a wrapper around other existing projects.

from joss-reviews.

pjotrp commented on July 20, 2024

This software is essentially a 300+ line script that wraps existing Java tools. I would like to ask the author to confirm here that he thinks this software is a valuable contribution to research. I want to point out that once published this work will be visible for a long time. Adding an example of having used this tool in research would be a good idea.

from joss-reviews.

jiemakel commented on July 20, 2024

From @desilinguist:

Installation

I've reworded the section to be "Installation and running".

However, as an aside, the web service inaccurately detects the same text as German for some reason (even though the underlying language detector says hi (denoting Hindi) but for some reason it gets over-ruled by the language recognizer?

The tool employs a simple voting mechanism to tally scores between recognizers. In this case, probably because the other tools don't understand Hindi, it gets outvoted by the poorer best guesses of the other methods. (For such cases, the results of the individual recognizers are included in the results, so you can add your own post-processing selector that is better tuned to your usage scenario.)

Another problem I had was that when I called las recognize test.txt on my Hindi source file instead of las identify test.txt, I get the following exception.

This is because recognize is actually a separate (and previously undocumented) function that counts the number of words a particular language processor recognizes, and it doesn't support Hindi. However, you also encountered a bug that caused an exception to be thrown instead of handling this in a better manner, which I've now fixed.

The reason for this functionality is so I could estimate the number of OCR errors in automatically scanned historical newspapers. I've added a section to the README to document this.

Although there are no specific claims made about the performance, the README does seem to imply that re-running after the first time should be much faster. I didn't really experience that.

This was a miscommunication. What the README means to say is that at each run, initial load up takes a significant time, so one should configure their workflow so that as much data as possible is funneled to the tool in a single run (e.g. giving the tool multiple files to process on the command line instead of running the tool once for every file). I've tried to make this more clear in the README.

Documentation

I've copied into the README the statement of need as well as the note about it being first and foremost a wrapper from the paper.

Regarding the examples, while they are contrived, I do want to keep them short for the README, and a full worked-through example for all the different modes of operation would in my opinion be too much. What I've done is added a second simple example for each mode of operation that isn't the one Finnish example (the Finnish ones being both important and already understandable for the primary audience of the tool).

I've also added additional information on how to run each mode to their respective sections.

The above changes are visible in this commit.

From @pjotrp:

This software is essentially a 300+ line script that wraps existing Java tools.

While I do think the worth of the tool is precisely in packaging existing tools together and tuning them to work with each other, it actually isn't just 300 lines. I submitted the (repository containing the) command line version, because I think it is the one most usable by the intended audience. 5500 more lines of code are contained in the seco-lexicalanalysis repository, which contains the code common to this command line version and the web service version (seco-lexicalanalysis-play). They in turn refer to seco-hfst, which has 1500 additional LOC.

In addition, as the README and paper both point out, while I view the integration in general useful, the real core contribution of the work is in the in-depth work on integrating and expanding the Finnish pipeline included in the tool, where for example the diff of our omorfi fork is ~12000 lines against the base version.

Adding an example of having used this tool in research would be a good idea.

The second paragraph of the paper already deals with this, referring to three academic articles that use the tool. The reason I submitted the tool in the first place was also that I had a request from an outside user, who wanted to know how they could refer to the tool in their own article. So, to state it explicitly, I do think this is a valuable contribution to research.

from joss-reviews.

desilinguist commented on July 20, 2024

@jiemakel thanks so much for addressing most of my comments! I think the changes you have made are quite useful. There are still no automated tests and still not information on how a new developer can contribute. I think some tests should be added but I am not going to make that issue a blocker for publication. It would be nice for the README to contain a little blurb about how one would go about contributing as a developer etc.

As far as the utility of the tool goes, as someone who works in the NLP (natural language processing) field, I think las is definitely going to be useful to people in the field.

from joss-reviews.

pjotrp commented on July 20, 2024

@jiemakel thanks for the clarification. If you have not done so, could you add some pointers to the other work in the paper or the README, so it is clear that this is a major effort?

from joss-reviews.

jiemakel commented on July 20, 2024

@desilinguist Could this be moved forward?

from joss-reviews.

arfon commented on July 20, 2024

@desilinguist - This looks like this is in reasonable shape to accept. Can you confirm?

from joss-reviews.

desilinguist commented on July 20, 2024

I'm happy to accept this. Sorry for the delay.

from joss-reviews.

arfon commented on July 20, 2024

I'm happy to accept this. Sorry for the delay.

✨ thanks for the update @desilinguist.

@jiemakel - at this point could you make an archive of the reviewed software in Zenodo/figshare/other service and update this thread with the DOI of the archive? I can then move forward with accepting the submission.

from joss-reviews.

jiemakel commented on July 20, 2024

The DOI for the latest version is: 10.5281/zenodo.160256

from joss-reviews.

arfon commented on July 20, 2024

Thanks again for the review @desilinguist!

@jiemakel your paper is now accepted into JOSS. Your DOI is http://dx.doi.org/10.21105/joss.00035 🎉 🚀 💥

from joss-reviews.

[REVIEW]: LAS: an integrated language analysis tool for multiple languages about joss-reviews HOT 15 CLOSED

Comments (15)

Installation

Functionality

Performance

Documentation

Software paper

Recommendation

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent