Giter VIP home page Giter VIP logo

occcanine's Introduction

Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE

Christian Møller Dahl, Torben Johansen, Christian Vedel, University of Southern Denmark


Welcome to the GitHub repository for OccCANINE, a tool designed to transform occupational descriptions into standardized HISCO (Historical International Standard Classification of Occupations) codes automatically. Developed by Christian Møller Dahl, Torben Johansen and Christian Vedel from the University of Southern Denmark, this tool leverages the power of a finetuned language model to process and classify occupational descriptions with high accuracy, precision, and recall.

Paper: https://arxiv.org/abs/2402.13604

Huggingface: Christianvel/OccCANINE

Slides: Breaking the HISCO Barrier

How to use OccCANINE: YouTube video

How to cite (click to expand)

Dahl, C. M., Johansen, T., Vedel, C. (2024). Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE. arxiv.org/abs/2402.13604

@misc{OccC2024breaking,
    title={Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE}, 
    author={Christian Møller Dahl and Torben Johansen and Christian Vedel},
    year={2024},
    eprint={2402.13604},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
    url={https://arxiv.org/abs/2402.13604}
}

Getting started

  • See the colab notebook for a demonstration of OccCANINE
  • A step-by-step installation guide can be found in GETTING_STARTED.md
  • Run python predict.py --fn-in path/to/input/data.csv --col occ1 --fn-out path/to/output/data.csv --language [lang] in the command line to get HISCO codes for all the descriptions found in the occ1 column in the inputted data. See predict.py for details.
  • To see a simple script which reads data and uses OccCANINE to obtain HISCO codes see PREDICT_HISCOs.py.

Overview

This repository provides everything needed to generate automatic HISCO codes from occupational descriptions using OccCANINE. It also provides replication files for all steps from raw training data to the final trained model.

Structure

  • Data_cleaning_scripts: Contains R scripts for processing raw data from 'Data/Raw_data' into a format suitable for training, which is then stored in 'Data/Training_data', 'Data/Validation_data', and 'Data/Test_data'.
  • histocc: Contains Python scripts for training OccCANINE and using the already finetuned version of it.
  • Model_evaluation_scripts: Contains a mix of R and Python scripts which generates model evaluation statistics and plots of these, which are found in the associated paper.

histocc folder

The histocc folder contains all the code used for training and application of OccCANINE.

  • Data/: Contains 'key.csv' and which maps integer codes (generated by OccCANINE) to HISCO codes based off definitions by https://github.com/cedarfoundation/hisco. It also contains toydata to use when trying out OccCANINE for the first time.
  • model_assets.py: Defines the unlerying pytorch model
  • attacker.py: Defines text attack procedure used for text augmentation in training.
  • trainer.py: Defines training procedures.
  • dataLoader.py: Defines how data is loaded and fed to the model in training.
  • prediction_assets.py: Functions and classes to use OccCANINE. This also contains the 'OccCANINE' class, which serves as the main user interface in most cases.

Model_evaluation_scripts folder

The Model_evaluation_scripts folder contains all the code used to generate the model evaluation results shown in the paper.

Python scripts

  • n001_Predict_eval.py: Runs predictions on 1 million validation observations.
  • n002_Copenhagen_burial.py: Runs predictions on 200 observations from the Copenhagen Burial Records from Link Lives
  • n003_Training_ship_data.py: Runs predictions on 200 observations of parent's occupations from the Indefatigable training ship
  • n004_Dutch_familiegeld.py: Runs predictions on 200 observations of occupations in the Dutch familiegeld
  • n005_Swedish_strikes.py: Runs predictions on 200 observations of the profession of Swedish strikes

R scripts

  • 000_Functions.R: Contains functions used in evaluation.
  • 001_Generate_eval_stats.R: Generates accuracy, precision, etc. for validation data across various subgroups.
  • 002_Nature_of_mistakes.R: Returns plots and statistics which generate insights into the nature of mistakes, when OccCANINE disagrees with the validaiton data.
  • 101_Eval_illustrations.R: Generates most of the illustrations and statistics shown in the paper.
  • 102_Embeddings_visualisation.R: This makes the embedding t-sne illustrations.

Data Cleaning

Scripts for data cleaning are located in 'Data_cleaning_scripts' and should be run in numeric order as indicated by the script names.

  • 000_Function.R: Contains functions shared across all data cleaning scripts.
  • 001_Assets_for_cleaning.R: Generates assets for data cleaning, such as the encoding of HISCO to a 1 to N encoding.
  • 00[x]_...R (where x>1): Cleans individual data sources, saving one file 'Clean_....Rdata' for each source.
  • 101_Train_test_val_split.R: Ensures consistency and saves training, validation, and test data.

Data Cleaning Process

  1. Sanity Check: Manual verification of data content and consistency.
  2. Extracting Relevant Data: Extraction of relevant variables, keeping both 'raw' and 'cleaned' occupational descriptions as separate observations when available.
  3. Combinations: Synthetic creation of descriptions representing more than one occupation by combining descriptions with the respective language's word for 'and'.
  4. Filtering: Removal of observations with invalid HISCO codes based on the 'hisco' R library.

Structure of Training Data

The training data is structured with variables including the year of observation, a unique ID for every observation, the occupational description string, HISCO codes, integer codes for HISCO, the language of the description, and a string indicating the data split (train, val1, val2, or test).

occcanine's People

Contributors

christianvedels avatar cmdahl avatar torbensdjohansen avatar tsdj-jfm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.