Giter VIP home page Giter VIP logo

spacy-phrases's Introduction

spacy-phrases

This repository contains spaCy's rule-based matching & noun-chunk extraction scripts.

Google colabs

Colab Description
make_dataset Transform document-based dataset into a paragraph/sentence-based one
noun_chunks Extract noun phrases using spaCy's noun_chunks attribute
dep_matcher Match documents/sentences on dependecy tree

Installation (locally)

To use or contribute to this repository, first checkout the code. Then create a new virtual environment:

Windows

$ git clone https://github.com/hcss-utils/spacy-phrases.git
$ cd spacy-phrases
$ python -m venv env 
$ . env/Scripts/activate
$ pip install -r requirements.txt

MacOS / Linux

$ git clone https://github.com/hcss-utils/spacy-phrases.git
$ cd spacy-phrases
$ python3 -m venv env 
$ . env/bin/activate
$ pip install -r requirements.txt

Usage

Data transformation

As in some cases we want to have a couple of 'versions' (document-, paragraph-, and sentence-based) of our corpora, there are a scripts/make_dataset.py that transforms document-based datasets into a paragraph/sentence-based ones and scripts/process.py that handles text preprocessing.

Data transformation

To prepare dataset, run python scripts/make_dataset.py:

Usage: make_dataset.py [OPTIONS] INPUT_TABLE OUTPUT_TABLE

  Typer app that processes datasets.

Arguments:
  INPUT_TABLE   [required]
  OUTPUT_TABLE  [required]

Options:
  --lang [en|ru]                  sentecizer's base model  [default:
                                  Languages.EN]
  --docs-max-length INTEGER       Doc's max length.  [default: 2000000]
  --paragraph / --sentence        [default: sentence]
  --text TEXT                     [default: fulltext]
  --uuid TEXT                     [default: uuid]
  --lemmatize / --no-lemmatize    [default: no-lemmatize]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Matching phrases

We've developed two different approaches to extracting noun phrases:

  • our first guess was to use Doc's noun_chunks attribute (we iterate over noun_chunks and keep those that fit out criteria). But this approach isn't perfect and doesn't for work ru models.
  • we then moved to Rule-based matching which is more flexible as long as you write accurate patterns (and works for both en and ru models).
Noun_chunks

To extract phrases using noun_chunks approach, run python scripts/noun_chunks.py:

Usage: noun_chunks.py [OPTIONS] INPUT_TABLE OUTPUT_JSONL

  Extract noun phrases using spaCy.

Arguments:
  INPUT_TABLE   [required]
  OUTPUT_JSONL  [required]

Options:
  --model TEXT                    [default: en_core_web_sm]
  --docs-max-length INTEGER       [default: 2000000]
  --batch-size INTEGER            [default: 50]
  --text-field TEXT               [default: fulltext]
  --uuid-field TEXT               [default: uuid]
  --pattern TEXT                  [default: influenc]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Dependency Matcher

To extract phrases using Dependency Matcher approach, run python scripts/dep_matcher.py:

Usage: dep_matcher.py [OPTIONS] INPUT_TABLE PATTERNS OUTPUT_JSONL

  Match dependencies using spaCy's dependency matcher.

Arguments:
  INPUT_TABLE   Input table containing text & metadata  [required]
  PATTERNS      Directory or a single pattern file with rules  [required]
  OUTPUT_JSONL  Output JSONLines file where matches will be stored  [required]

Options:
  --model TEXT                    SpaCy model's name  [default:
                                  en_core_web_sm]
  --docs-max-length INTEGER       Doc's max length.  [default: 2000000]
  --text-field TEXT               [default: fulltext]
  --uuid-field TEXT               [default: uuid]
  --batch-size INTEGER            [default: 50]
  --context-depth INTEGER
  --merge-entities / --no-merge-entities
                                  [default: no-merge-entities]
  --merge-noun-chunks / --no-merge-noun-chunks
                                  [default: no-merge-noun-chunks]
  --keep-sentence / --no-keep-sentence
                                  [default: no-keep-sentence]
  --keep-fulltext / --no-keep-fulltext
                                  [default: no-keep-fulltext]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Counting

Once phrases/matches extracted, you could transform them into a usable format, or/and count their frequencies:

spacy-phrases's People

Contributors

dependabot[bot] avatar hcss-stratbase avatar hp0404 avatar

Stargazers

 avatar

spacy-phrases's Issues

nlp's pipeline

how additional pipelines (merge_entities, merge_noun_chunks) would influence the outcomes of dependency matching?

pattern filtering

Continuing with #5
As corpora preprocessing appeared to have components that require a lot of rewriting or adding new scripts altogether, I decided to discard 'assigning uuid' part and move filtering sentences based on pattern file to a separate issue, closing the previous one.

count frequencies

add script to process output .jsonl files & save aggregated frequencies

add lemmatizer

so others could build lemmatized versions of our corpora

pipeline's robustness

using Influence's project as an example, make sure the whole pipeline is robust enough

  • make paragraph- and sentence- versions of document-based dataset (make_dataset)
    • add uuid lookup to skip preprocessed ones (as in dep_matcher)
  • clean each of those versions (process)
    • add uuid lookup to skip preprocessed ones (as in dep_matcher)
    • also keep unit_id column as we'd like to join on document_id, unit_id
  • identify relevant 'units' (document/paragraph/sentence) within a 'text' using dependency tree (dep_matcher)
  • transform 'relevant matches' into usable tabular final form (add new notebook, as with counts)
  • update colabs (or make one giant colab)

corpora preprocessing

steps:

  • transform document-based dataset into sentence-based
  • assign uuids
  • add option to filter sentences based on pattern file

end goal: automate data prep step, reduce number of sentences that we feed into dependency parser

uuid lookups

while working on #20 , I noticed sort of a 'bug'/opportunity for enhancement: lookup doesn't work (as expected) when you have infrequent matches (writes).
I did implement uuid lookup [1] that saved time as we could skip data we had processed before; but it only works if you assume that each row is logged, e.g.

  • while transforming or cleaning dataset, I update another table on the fly, row by row (naturally storing each uuid in the table)
  • if I stop for some reason but want to resume later on, the script collects uuids stored in the table and skips them once matched (so we continue from the latest row)

The problem raises once we start 'matching' 'rare' patterns, e.g.

  • assume that we have 10 entries in the table, but only 1, 2, and 10 entries have relevant matches
  • the scripts goes through the first 7 rows before it stops, storing info from the row 1 and 2
  • once resumed, we continue from the 2nd row rather than from the 8ths as nothing was logged after the 2nd as there were no relevant info
  • in the end we end up with 3 entries anyway but since lookup wasn't efficient, we waste time going through irrelevant rows we already visited

[1]

def read_processed_data(path: Path) -> Set[str]:
"""Collect processed uuids."""
seen: Set[str] = set()
if not path.exists():
return seen
with path.open("r", encoding="utf-8") as lines:
for line in lines:
data = json.loads(line)
for uuid in data.keys():
seen.add(uuid)
return seen

multiple patterns

build matcher using patterns that are either passed as a single file or non-empty dir

capture "paragraph"

  • ideally, find a way to grab a full paragraph within which a match occurs
  • realistically, grab relevant sentence but also the previous ones (i-N) and the next one (i+N)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.