Giter VIP home page Giter VIP logo

1-term-matching's Introduction

1. Term matching and identification

Objective

  1. There is a text document (a few pages of plain text)
  2. There is a database of terms: single or multiple words (a dictionary); large

The objective is to identify all terms in the document (mind that querying each word from the document separately with the database is not an option).

Measure performance and scalability of chosen approaches.

Secondary objective

Provide a fuzzy match to deal with inflection.

Guidelines to keep in mind

  • Must run under Linux and macOS.
  • The repository must be self-sufficient, all data and instructions on setting up the dependencies included.
  • Keep setup minimal.
  • Documentation in markdown included.
  • Results easy to evaluate.
  • Self-evaluation of the project included.

Project realization

The project started with a research on possible solutions. A summary of that reasearch is located in the brainstorming_solutions file. The remaining part of the research and the work was documented in the research_log. In the workplan file can be found a section with todos. In case of unclarities it might be useful as an extra resource. The benchmarks of tested solutions were summarized in the report

Python dependencies

Please use either pip install or acquire them from other prefered repositories (e.g. Manjaro has packages for all of those in its repositories). The initialize.sh script will run pip install.

  • Python 3
  • wikipedia
  • psycopg2
  • pdftotext - depends on poppler library

Project structure

The project is organized into 4 main categories:

  • benchamrks - containing tests that can be run collected into sets (name 'collection' may appear too in the project)
  • research - containing mainly text files
  • results - storing markdown files that summarize the project
  • setup - containg scripts that will initialize the database, generate dictionaries, insert texts and sample pdfs that can be efficiently parsed to such texts (see workplan for details on planned restructurization)

Initialize

The easiest way to start using the repository is to run the initialize.sh script and wait for results.

Running benchmarks

Benchmarks are most often called here tests. It is best to execute them using initialize script but new ones depending on different sets of data can be created. Each test execution is logged to the database and can be used later by analyzing queries which examples can be found in the report.

Using scripts

  • ./setup/dictionaries/wikigraph.py - generates a dictionary following links from a start page untill a specified depth is reached. It takes as arguments:
    1. Start page
    2. Search depth
    3. Dictionary table name
    4. Language prefix of wikipedia (optional, "en" recommended anyway)
  • ./setup/texts/parse_pdf.py - parses a PDF file and inserts it to the database. It takes as arguments:
    1. Filename or path to the PDF file
    2. Document title
    3. Number of size variations to create (optional but "3" required for the tests to work correctly)
  • ./setup/dictionaries/generate_postgresdict.py - creates a file that can be used a text search dictionary based on an existing table which is meant to be used as a dictionary
    1. The name of a table with the dictionary
    2. The path to the directory storing Postgres' text search dictionaries (try: `pg_config --sharedir`/tsearch_data)
  • ./setup/dictionaries/configure_dicts.sh - creates and alters a text search configuration of given name. It takes as arguments:
    1. Name of a dictionary for which the configuration is to be altered
    2. Name of a new configuration (optional, "dicts_config" by default)
  • ./benchmarks/test_sets/set* - runs a chosen collection of sets.
    1. A base name of a dictionary to be used with an assumption that two other dictionaries with suffix _small and _medium also exist
    2. A title of a text/document with an assumption that it was inserted to the database with a parse_pdf.py script which resulted in the title being suffixed: _0, _1, _2 (3 size variations)
    3. An index. It varies the resultant test_id. It is not necessary to be able to distinguish tests by the used data but it can be used to make it simpler. (optional)

Example usage of almost all of the scripts can be found in the data_sets directory which stores not mentioned here prepare_data scripts that utilize the documented scripts.

1-term-matching's People

Contributors

adhalianna avatar sebiernst avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.