Giter VIP home page Giter VIP logo

herelles's Introduction

Herelles project

Automatic protocol for the constitution of spatiotemporal and thematic corpora for the Herelles project.

alt tag

This protocol is designed for the constitution of spatio-temporal and thematic corpora:

  • Build a corpus by specifying the thematic keyword, the spatial footprint (e.g, city or territory name).
  • Build automatically associated some metadata for each collected document such as the :
    • date of publication or modification, title, absolute spatiale named entities provided in the title of the document, pertinence of the document regarding the thematic (this is automatically evaluated during the scrapping...), * url, * text, * Etc.

How does it work ?

=======Input=======

  • vocabulary of thematic concepts (VC) : set of concepts that relies to a specific thematic (e.g. Urbanisation)

  • spatial footprint : name of spatiale entity, can be a city name or a country.

=======Ouput=======

  • Corpora : In .jsonl file containing the collected corpus

=======Document Scoring process=======

  • To ensure the quality the collected document, we have set up an automated evaluation, based on the similarity measure based on Transformers model (we used DistilBert). This similarity measure is computed for each document, with an extended concept vocabulary. Why an extended vocabulary of concept ? We propose to use an extended vocabulary of concepts for each thematic during the collection, in order to be able to take into account documents of a societal nature, whose content contains a less formal language (socially oriented or language of the majority), generally coming from forums, blogs, etc. The extended concept is obtained by generating synonyms for each term in the initial vocabulary using WordNet.

  • The most relevant documents are those with the highest similarity scores.

Quick start

Then clone the project:

git clone https://github.com/rdius/herelles.git

# You can specify your own vocabulary of concepts in the ./terms directory

Install the requiered packages:

pip install -r requirements.txt
# Change the vocabulary of concept file and the spatial extent to your own in main.py.

from src.collector import scrapper

vcUrb = "./terms/urbanisme.txt"
vcRisq = "./terms/risque.txt"
spatial_extent = 'Montpellier'

# start scrapping
if __name__ == '__main__':
   print("scrapping...")
   scrapper(spatial_extent, vcRisq)

Visualising the corpus

streamlit run thecob_app.py #run the application

Specify the query parameters

alt tag

Display a sample of the queried corpus. YOu can download the entire corpus by clicking on the link.

alt tag

Visual statistics on the collected corpus

alt tag

herelles's People

Contributors

rdius avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.