Giter VIP home page Giter VIP logo

Comments (2)

mayeulk avatar mayeulk commented on June 24, 2024

NLP pipelines use antiword R package, for example the
readtext::readtext() function. readtext() has a general parameter "...":

...: additional arguments passed through to low-level file reading
          function, such as ‘file’, ‘fread’, etc.  Useful for
          specifying an input encoding option, which is specified in
          the same was as it would be give to ‘iconv’.  See the
          Encoding section of file for details.

This could be used to pass -w 0 to antiword. Currently, a workaround to readtext(file=full_path_ext) is this:

 if(file_ext=="doc"){
    file_text<-system2(command = "antiword", 
       args = paste0(" -w 0 ", full_path_ext), stdout = TRUE )  
    } else {
    file_text<-readtext(file=full_path_ext)
    }

This workaround does not always work as help(system2) says:

Output lines of more than 8095 bytes will be split.

The problem with readtext and antiword without the w parameter is that NLP pipelines often fail with additional line breaks added by antiword.
For example, take the sentence "I spoke to Mr. Walking in the street, I saw him." If antiword splits the sentence after Mr. with a new line, most NLP pipelines will believe there are 2 sentences:

  1. "I spoke to Mr."
  2. "Walking in the street, I saw him."
    In the original sentence, "Mr. Walking" is a named entity recognised by NLP. In the second sentence, "walking" is the verb "to walk" of which "I" is the subject.

from antiword.

jeroen avatar jeroen commented on June 24, 2024

Can you send a PR?

from antiword.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.