Comments (2)
NLP pipelines use antiword R package, for example the
readtext::readtext() function. readtext() has a general parameter "...":
...: additional arguments passed through to low-level file reading
function, such as ‘file’, ‘fread’, etc. Useful for
specifying an input encoding option, which is specified in
the same was as it would be give to ‘iconv’. See the
Encoding section of file for details.
This could be used to pass -w 0 to antiword. Currently, a workaround to readtext(file=full_path_ext)
is this:
if(file_ext=="doc"){
file_text<-system2(command = "antiword",
args = paste0(" -w 0 ", full_path_ext), stdout = TRUE )
} else {
file_text<-readtext(file=full_path_ext)
}
This workaround does not always work as help(system2) says:
Output lines of more than 8095 bytes will be split.
The problem with readtext and antiword without the w parameter is that NLP pipelines often fail with additional line breaks added by antiword.
For example, take the sentence "I spoke to Mr. Walking in the street, I saw him." If antiword splits the sentence after Mr. with a new line, most NLP pipelines will believe there are 2 sentences:
- "I spoke to Mr."
- "Walking in the street, I saw him."
In the original sentence, "Mr. Walking" is a named entity recognised by NLP. In the second sentence, "walking" is the verb "to walk" of which "I" is the subject.
from antiword.
Can you send a PR?
from antiword.
Related Issues (4)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from antiword.