Giter VIP home page Giter VIP logo

capfalcnlp's Introduction

capfalcnlp

Requirements

Python 3.7

Install

pip install -r requirements.txt
pip install -e .
python -m spacy download fr_core_news_md

Usage

$ python cli.py --input-file example_text.txt
[{'char_offset': 2, 'detected_type': 'Rare', 'text': 'intelligence'},
 {'char_offset': 15, 'detected_type': 'Rare', 'text': 'artificielle'},
 {'char_offset': 29, 'detected_type': 'Rare', 'text': 'IA'},
 {'char_offset': 29, 'detected_type': 'Accronyme', 'text': 'IA'},
 {'char_offset': 140, 'detected_type': 'Rare', 'text': 'simuler'},
 {'char_offset': 150, 'detected_type': 'Rare', 'text': 'intelligence'},
 {'char_offset': 180, 'detected_type': 'Rare', 'text': 'correspond'},
 {'char_offset': 255, 'detected_type': 'Rare', 'text': 'discipline'},
 {'char_offset': 266, 'detected_type': 'Rare', 'text': 'autonome'},
 {'char_offset': 275, 'detected_type': 'Rare', 'text': 'constituée2'},
 {'char_offset': 298, 'detected_type': 'Rare', 'text': 'instances'},
 {'char_offset': 322, 'detected_type': 'Rare', 'text': 'CNIL'},
 {'char_offset': 322, 'detected_type': 'Accronyme', 'text': 'CNIL'},
 {'char_offset': 328, 'detected_type': 'Rare', 'text': 'relevant'},
 {'char_offset': 381, 'detected_type': 'Rare', 'text': 'IA'},
 {'char_offset': 381, 'detected_type': 'Accronyme', 'text': 'IA'},
 {'char_offset': 385, 'detected_type': 'Rare', 'text': 'introduisent'},
 {'char_offset': 424, 'detected_type': 'Rare', 'text': 'mythe'},
 {'char_offset': 457, 'detected_type': 'Rare', 'text': 'classée'},
 {'char_offset': 493, 'detected_type': 'Rare', 'text': 'cognitives'},
 {'char_offset': 526, 'detected_type': 'Rare', 'text': 'neurobiologie'},
 {'char_offset': 540, 'detected_type': 'Rare', 'text': 'computationnelle'},
 {'char_offset': 575, 'detected_type': 'Rare', 'text': 'aux'},
 {'char_offset': 587, 'detected_type': 'Rare', 'text': 'neuronaux'},
 {'char_offset': 587, 'detected_type': 'Emprunt Anglais', 'text': 'neuronaux'},
 {'char_offset': 612, 'detected_type': 'Rare', 'text': 'mathématique'},
 {'char_offset': 637, 'detected_type': 'Rare', 'text': 'mathématiques'},
 {'char_offset': 725, 'detected_type': 'Rare', 'text': 'résolution'},
 {'char_offset': 757, 'detected_type': 'Rare', 'text': 'complexité'},
 {'char_offset': 779, 'detected_type': 'Rare', 'text': 'algorithmique'},
 {'char_offset': 813, 'detected_type': 'Rare', 'text': 'désigne'},
 {'char_offset': 863, 'detected_type': 'Rare', 'text': 'imitant'},
 {'char_offset': 940, 'detected_type': 'Rare', 'text': 'cognitives'},
 {'char_offset': 0,
  'detected_type': 'Phrase Longue',
  'text': "L'intelligence artificielle (IA) est « l'ensemble des théories et des techniques mises en œuvre en vue de réaliser des machines capables de simuler l'intelligence humaine »."},
 {'char_offset': 449,
  'detected_type': 'Phrase Longue',
  'text': 'Souvent classée dans le groupe des sciences cognitives, elle fait appel à la neurobiologie computationnelle (particulièrement aux réseaux neuronaux), à la logique mathématique (partie des mathématiques et de la philosophie) et à l'informatique."},
 {'char_offset': 794,
  'detected_type': 'Phrase Longue',
  'text': 'Par extension elle désigne, dans le langage courant, les dispositifs imitant ou remplaçant l'homme dans certaines mises en œuvre de ses fonctions cognitives.'}]

capfalcnlp's People

Contributors

louismartin avatar psawa avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

capfalcnlp's Issues

Error message saying punkt is missing

In my environment :

When I run the example : python cli.py --input-file example_text.txt

I getthe following error :

Traceback (most recent call last):
  File "cli.py", line 14, in <module>
    detections = get_detections(text)
  File "/home/codatalab/capfalcnlp/capfalcnlp/features.py", line 197, in get_detections
    for sentence in get_long_sentences(text):
  File "/home/codatalab/capfalcnlp/capfalcnlp/features.py", line 184, in get_long_sentences
    sentences = split_in_sentences(text)
  File "/home/codatalab/capfalcnlp/capfalcnlp/processing.py", line 74, in split_in_sentences
    return _split_in_sentences_nltk(text, **kwargs)
  File "/home/codatalab/capfalcnlp/capfalcnlp/processing.py", line 69, in _split_in_sentences_nltk
    return get_nltk_sentence_tokenizer(**kwargs).tokenize(text)
  File "/home/codatalab/capfalcnlp/capfalcnlp/processing.py", line 58, in get_nltk_sentence_tokenizer
    return nltk.data.load(f'tokenizers/punkt/{language}.pickle')
  File "/home/codatalab/.local/lib/python3.7/site-packages/nltk/data.py", line 750, in load
    opened_resource = _open(resource_url)
  File "/home/codatalab/.local/lib/python3.7/site-packages/nltk/data.py", line 875, in _open
    return find(path_, path + [""]).open()
  File "/home/codatalab/.local/lib/python3.7/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/french.pickle

  Searched in:
    - '/home/codatalab/nltk_data'
    - '/opt/conda/envs/capfalcnlp/nltk_data'
    - '/opt/conda/envs/capfalcnlp/share/nltk_data'
    - '/opt/conda/envs/capfalcnlp/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

Adding nltk.download('punkt') on line 50 of processing.py solves the issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.