Giter VIP home page Giter VIP logo

cl-nlp's Introduction

Build Status Documentation Status

CL-NLP -- a Lisp NLP toolkit

Brief description

Eventually, CL-NLP will provide a comprehensive and extensible set of tools to solve natural language processing problems in Common Lisp.

The goals of the project include the following:

  • support for constructing arbitrary NLP pipelines on top of it
  • support for easy and fast experimentation and development of new models and approaches
  • serve as a good framework for teaching NLP concepts

It comprises of a number of utility/horizontal and end-user/vertical modules that implement the basic functions and provide a way to add own extensions and models.

The utility layer includes:

  • tools for transforming raw natural language text, as well as various corpora into a form suitable for further processing
  • basic support for language modelling
  • support for a number of linguistic concepts
  • support for working with machine learning models and a number of training algorithms

The end-user layer will provide:

  • POS taggers
  • constituency parsers
  • dependency parsers
  • other stuff (will be added step-by-step, suggestions are welcome)

How to start working with CL-NLP

The project has already reached a stage of usefulness for the primary author: for instance, it supports my current language modelling experiments by providing easy access to treebanks and other utilities.

Yet, it is far from being production-ready. So, if you want to use it for production tasks, expect to bleed on the bleeding edge.

Otherwise, if you want to contribute to developing the toolkit, you're very welcome. Here are a few write-ups to give you the sense of the project and to help get started:

You'll also, probably, need to track the latest version of RUTILS from git.

For CL-NLP to reach v.0.1 that may be considered suitable for limited use by non-contributors, the following things should be finished (work-in-progress):

  • implement a comprehensive test-suite and fix all bugs encountered in the process
  • describe available models and their quality metrics

Technical notes

Dependencies

For development:

License

The license of CL-NLP is Apache 2.0.

Specific models may have different license due to the limitations of the dataset they are built with. Please see a <model>.license file accompanying each model for details.

(c) 2013-2014, Vsevolod Dyomkin [email protected]

cl-nlp's People

Contributors

dkochmanski avatar dmsurti avatar html avatar proger avatar sudodoki avatar vseloved avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cl-nlp's Issues

Installation issue

During installation of cl-nlp, the process fails at this step...have cut out previous status messages, retaining immediate section.

; Loading "cl-nlp"
[package nlp.util]................................
[package nlp.core]................................
[package nlp.corpora].............................
[package nlp.tagging].............................
[package nlp.parsing].............................
[package nlp.generation]..........................
[package nlp-user]......
debugger invoked on a SB-INT:SIMPLE-FILE-ERROR:
failed to find the TRUENAME of /Users/maheshcr/tools/cl-nlp/src/core/general.lisp:
No such file or directory

Could you please help?

Add developer documentation, initial version

Cover the following in the initial developer documentation:

  1. Introduction to CL-NLP, refer to README.md for introductory articles.
  2. Set up.
  3. Travis CI integration and running tests, with current state of tests.
  4. Contributing to instructions.
  5. Overall design/architecture
  6. Tokenizers with design details, source/test code pointers, quick REPL run details

Doesn’t build on LispWorks 7

When I try to load cl-nlp on LispWorks 7, I get the following:

CL-USER 1 > (ql:quickload :cl-nlp)
To load "cl-nlp":
  Load 1 ASDF system:
    cl-nlp
; Loading "cl-nlp"
;;; Checking for wide character support... yes, using code points.
;;; Checking for wide character support... yes, using code points.
;;; Building Closure with CHARACTER RUNES
..

**++++ Error in NLP.UTIL:UNIQ: 
  The variable #:|table1111365| is unbound.
; *** 1 error detected, no fasl file produced.

What could be wrong?

Lispworks issues with chars.lisp

I've found a couple of compatibility issues with the chars.lisp file in src/utils/ and LispWorks

the first was in the +WHITE-CHARS+ param, LispWorks uses #\NO-BREAK-SPACE so I did:

(defparameter +white-chars+
  '(#\Space #\Tab #\Newline #\Return #\Linefeed
    ;; lispworks uses #\no-break-space
    #+(and lispworks unicode) #\no-break-space
    #+(or (and sbcl sb-unicode) (and allegro ics) (and clisp i18n)
    (and openmcl openmcl-unicode-strings))
    #\no-break_space
    )
  "Chars considered WHITESPACE.")

I expect there may be a better way to do this that fits with your project coding standards but I leave that integration to you. Example used for fix: link to CLSQL project

Once I put that in the compile got further into the file and I found a character encoding issue. Some of the quotation characters are multi-byte characters that LispWorks can't read properly. Emacs appears to have no problem displaying them, but when opened in Lispworks it doesn't display them properly and the compiler can't read the characters. LispWorks uses UTF-16 internally, and if there are char-codes for the characters you are using that are the same across UTF-8/16 that might work. There may also be a more elegant solution but I don't know enough about how LispWorks is treating the characters to figure anything else out.

I may just switch over to sbcl to test out this project. What lisp implementation are you developing in?

--Eric

Add CI support

  1. Initially start off with Travis CI using SBCL 64 bit only.
  2. Later on expand to more lisps.
  3. Finally also use test-cl-grid.

dict-lemmatizer fails to build

dict-lemmatizer fails to build

Error below Kindly resolve most urgently

;;; Error:
;;; in file dict-lemmatizer.lisp, position 709
;;; at (DEFMETHOD LEMMATIZE ...)
;;; * The macro form (DEFMETHOD LEMMATIZE ((LEMMATIZER MEM-DICT) WORD &OPTIONAL POS) (UNLESS (LOOKUP (SMART-SLOT-VALUE LEMMATIZER 'WORDS) WORD) (RETURN-FROM LEMMATIZE WORD)) (LET ((POSS (POS-TAGS LEMMATIZER WORD) :TEST 'EQUALP)) (IF-IT (OR (MEMBER POS POSS) (UNLESS POS (MEMBER-IF (RUTILS.READTABLE::TRIVIAL-POSITIONAL-LAMBDA (= 2 (LENGTH (PRINC-TO-STRING (? % 0))))) POSS))) (VALUES WORD IT) (WITH ((WORD-POS PRESENT? (IF POS (COND-IT ((? (SMART-SLOT-VALUE LEMMATIZER 'FORMS) (WORD/POS WORD POS)) (VALUES IT T)) ((? (SMART-SLOT-VALUE LEMMATIZER 'FORMS) (WORD/POS WORD (FIRST (MKLIST POS)))) (VALUES (ARGMAX 'IDENTITY IT :KEY (RUTILS.READTABLE::TRIVIAL-POSITIONAL-LAMBDA (PRECEDENCE LEMMATIZER (? % 0 0)))) T))) (|GET#| WORD (SMART-SLOT-VALUE LEMMATIZER 'FORMS))))) (:= WORD-POS (REMOVE-DUPLICATES WORD-POS :TEST 'EQUALP)) (IF PRESENT? (VALUES (? WORD-POS 0 0) (? WORD-POS 0 1) (REST WORD-POS)) (VALUES NIL NIL (MAPCAR (RUTILS.READTABLE::TRIVIAL-POSITIONAL-LAMBDA (PAIR WORD %)) POSS))))))) was not expanded successfully.
;;; Error detected:
;;; In form
;;; (LET ((POSS (POS-TAGS LEMMATIZER WORD) :TEST 'EQUALP)) (IF-IT (OR (MEMBER POS POSS) (UNLESS POS (MEMBER-IF (RUTILS.READTABLE::TRIVIAL-POSITIONAL-LAMBDA (= 2 (LENGTH (PRINC-TO-STRING (? % 0))))) POSS))) (VALUES WORD IT) (WITH ((WORD-POS PRESENT? (IF POS (COND-IT ((? (SMART-SLOT-VALUE LEMMATIZER 'FORMS) (WORD/POS WORD POS)) (VALUES IT T)) ((? (SMART-SLOT-VALUE LEMMATIZER 'FORMS) (WORD/POS WORD (FIRST (MKLIST POS)))) (VALUES (ARGMAX 'IDENTITY IT :KEY (RUTILS.READTABLE::TRIVIAL-POSITIONAL-LAMBDA (PRECEDENCE LEMMATIZER (? % 0 0)))) T))) (|GET#| WORD (SMART-SLOT-VALUE LEMMATIZER 'FORMS))))) (:= WORD-POS (REMOVE-DUPLICATES WORD-POS :TEST 'EQUALP)) (IF PRESENT? (VALUES (? WORD-POS 0 0) (? WORD-POS 0 1) (REST WORD-POS)) (VALUES NIL NIL (MAPCAR (RUTILS.READTABLE::TRIVIAL-POSITIONAL-LAMBDA (PAIR WORD %)) POSS))))))
;;; LET: Ill formed declaration.
Condition of type: COMPILE-FILE-ERROR
COMPILE-FILE-ERROR while compiling #<cl-source-file "cl-nlp" "src" "lexics" "dict-lemmatizer">

Available restarts:

  1. (RETRY) Retry compiling #<cl-source-file "cl-nlp" "src" "lexics" "dict-lemmatizer">.
  2. (ACCEPT) Continue, treating compiling #<cl-source-file "cl-nlp" "src" "lexics" "dict-lemmatizer"> as having been successful.
  3. (RETRY) Retry ASDF operation.
  4. (CLEAR-CONFIGURATION-AND-RETRY) Retry ASDF operation after resetting the configuration.
  5. (ABORT) Give up on "cl-nlp"
  6. (REGISTER-LOCAL-PROJECTS) Register local projects and try again.
  7. (RESTART-TOPLEVEL) Go back to Top-Level REPL.

v.1.0 checklist

For v.1.0

  • Sort out all exports
  • Restore GREEN status in Travis (some API have changed and made some tests obsolete), write more tests
  • Finish pprint-syntax etc.
  • Implement NER tagger for English (using embeddings and FNN)
  • Implement Dependency parser for English (using stack-buffer-parser)
  • Implement Constituency parser for English (using stack-buffer-parser)
  • In the process, try to use Logistic Regression alongside AvgPerceptron, fix possible issues in the implementation
  • Improve getting dictionary data from Wiktionary (test on several langs)
  • Implement HTTP API
  • Documentation: general usage scenarios (main pipeline), multilang support, API, working with dictionaries, embeddings, corpora, useful utilities

Post-1.0 experiments

  • Implement NNSE word embeddings
  • Implement AMR parsing (using stack-buffer-parser)
  • Implement Anchor topic model
  • Maybe, use AROW instead of AvgPerceptron (see cl-online-learning)
  • Implement classifier calibration

Following the instructions on writing a POS tagger results in error on text-tokens (CCL::UNDEFINED-FUNCTION-CALL).

On docs/user-guide/examples/eng-pos-tagger.md are given some instructions that fail:

The following code:

NLP> (let ((words-dist #h(equal))
       (map-corpus :ptb-tagged (corpus-file "ptb/TAGGED/POS/WSJ")
                   #`(dolist (sent (text-tokens %))
                       (dolist (tok sent)
                         (unless (in# (token-word tok) words-dist)
                           (:= (get# (token-word tok) words-dist) #h()))
                         (:+ (get# (token-tag tok)
                                   (get# (token-word tok) words-dist)
                                   0))))
                   :ext "POS")
       words-dist)
#<HASH-TABLE :TEST EQUAL :COUNT 51457 {10467E6543}>
NLP> (reduce #'+ (mapcan #'ht-vals (ht-vals *)))
1289201

... apears to be two separate forms: the let and the reduce form.

  • The let form appears to be unbalanced. It lacks one parenthesis.
  • If we add a closing parenthesis on (words-dist #h(equal))), there appears two errors:
NLP> (let ((words-dist #h(equal)))
       (map-corpus :ptb-tagged (corpus-file "ptb/TAGGED/POS/WSJ")
                   #`(dolist (sent (text-tokens %))
                       (dolist (tok sent)
                         (unless (in# (token-word tok) words-dist)
                           (:= (get# (token-word tok) words-dist) #h()))
                         (:+ (get# (token-tag tok)
                                   (get# (token-word tok) words-dist)
                                   0))))
                   :ext "POS")
       words-dist)

The first error is that there is no file WSJ under corpora/ptb/TAGGED/POS/

But if we change it to an existing corpora under corpora/, as "onf-wsj":

NLP> (let ((words-dist #h(equal)))
       (map-corpus :ptb-tagged (corpus-file "onf-wsj")
                   #`(dolist (sent (text-tokens %))
                       (dolist (tok sent)
                         (unless (in# (token-word tok) words-dist)
                           (:= (get# (token-word tok) words-dist) #h()))
                         (:+ (get# (token-tag tok)
                                   (get# (token-word tok) words-dist)
                                   0))))
                   :ext "POS")
       words-dist)

Then CCL:UNDEFINED-FUNCTION-CALL is spawned. There is no such function.

Any clues?

I'm using Clozure Common Lisp 1.10. Under SBCL, it made a thread-error by just running the first let. Using Windows 8 64-bit.

can't load using quicklisp

I had the following problem after clone the repo and tried to load it using quicklisp:

failed to find the TRUENAME of /Users/arademaker/quicklisp/local-projects/cl-nlp/src/corpora/util.lisp:
No such file or directory
[Condition of type SB-INT:SIMPLE-FILE-ERROR]

In fact, the util.lisp is not in the corpora directory. Any idea?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.