nlp's pipeline

spacy-phrases

This repository contains spaCy's rule-based matching & noun-chunk extraction scripts.

Google colabs

Colab	Description
`make_dataset`	Transform document-based dataset into a paragraph/sentence-based one
`noun_chunks`	Extract noun phrases using spaCy's noun_chunks attribute
`dep_matcher`	Match documents/sentences on dependecy tree

Installation (locally)

To use or contribute to this repository, first checkout the code. Then create a new virtual environment:

Windows

$ git clone https://github.com/hcss-utils/spacy-phrases.git
$ cd spacy-phrases
$ python -m venv env 
$ . env/Scripts/activate
$ pip install -r requirements.txt

MacOS / Linux

$ git clone https://github.com/hcss-utils/spacy-phrases.git
$ cd spacy-phrases
$ python3 -m venv env 
$ . env/bin/activate
$ pip install -r requirements.txt

Usage

Data transformation

As in some cases we want to have a couple of 'versions' (document-, paragraph-, and sentence-based) of our corpora, there are a scripts/make_dataset.py that transforms document-based datasets into a paragraph/sentence-based ones and scripts/process.py that handles text preprocessing.

Data transformation

To prepare dataset, run python scripts/make_dataset.py:

Usage: make_dataset.py [OPTIONS] INPUT_TABLE OUTPUT_TABLE

  Typer app that processes datasets.

Arguments:
  INPUT_TABLE   [required]
  OUTPUT_TABLE  [required]

Options:
  --lang [en|ru]                  sentecizer's base model  [default:
                                  Languages.EN]
  --docs-max-length INTEGER       Doc's max length.  [default: 2000000]
  --paragraph / --sentence        [default: sentence]
  --text TEXT                     [default: fulltext]
  --uuid TEXT                     [default: uuid]
  --lemmatize / --no-lemmatize    [default: no-lemmatize]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Matching phrases

We've developed two different approaches to extracting noun phrases:

our first guess was to use Doc's noun_chunks attribute (we iterate over noun_chunks and keep those that fit out criteria). But this approach isn't perfect and doesn't for work ru models.
we then moved to Rule-based matching which is more flexible as long as you write accurate patterns (and works for both en and ru models).

Noun_chunks

To extract phrases using noun_chunks approach, run python scripts/noun_chunks.py:

Usage: noun_chunks.py [OPTIONS] INPUT_TABLE OUTPUT_JSONL

  Extract noun phrases using spaCy.

Arguments:
  INPUT_TABLE   [required]
  OUTPUT_JSONL  [required]

Options:
  --model TEXT                    [default: en_core_web_sm]
  --docs-max-length INTEGER       [default: 2000000]
  --batch-size INTEGER            [default: 50]
  --text-field TEXT               [default: fulltext]
  --uuid-field TEXT               [default: uuid]
  --pattern TEXT                  [default: influenc]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Dependency Matcher

To extract phrases using Dependency Matcher approach, run python scripts/dep_matcher.py:

Usage: dep_matcher.py [OPTIONS] INPUT_TABLE PATTERNS OUTPUT_JSONL

  Match dependencies using spaCy's dependency matcher.

Arguments:
  INPUT_TABLE   Input table containing text & metadata  [required]
  PATTERNS      Directory or a single pattern file with rules  [required]
  OUTPUT_JSONL  Output JSONLines file where matches will be stored  [required]

Options:
  --model TEXT                    SpaCy model's name  [default:
                                  en_core_web_sm]
  --docs-max-length INTEGER       Doc's max length.  [default: 2000000]
  --text-field TEXT               [default: fulltext]
  --uuid-field TEXT               [default: uuid]
  --batch-size INTEGER            [default: 50]
  --context-depth INTEGER
  --merge-entities / --no-merge-entities
                                  [default: no-merge-entities]
  --merge-noun-chunks / --no-merge-noun-chunks
                                  [default: no-merge-noun-chunks]
  --keep-sentence / --no-keep-sentence
                                  [default: no-keep-sentence]
  --keep-fulltext / --no-keep-fulltext
                                  [default: no-keep-fulltext]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Counting

Once phrases/matches extracted, you could transform them into a usable format, or/and count their frequencies:

to extract phrases from matches (process rule-based matching output), see notebooks/count-matcher-phrases.ipynb
to count extacted phrases, see notebooks/count-noun-chunk-phrases.ipynb

	def read_processed_data(path: Path) -> Set[str]:
	"""Collect processed uuids."""
	seen: Set[str] = set()
	if not path.exists():
	return seen
	with path.open("r", encoding="utf-8") as lines:
	for line in lines:
	data = json.loads(line)
	for uuid in data.keys():
	seen.add(uuid)
	return seen

hcss-utils / spacy-phrases Goto Github PK

spacy-phrases's Introduction

spacy-phrases

Google colabs

Installation (locally)

Usage

Data transformation

Matching phrases

Counting

spacy-phrases's People

Contributors

Stargazers

spacy-phrases's Issues

Recommend Projects

Recommend Topics

Recommend Org