Giter VIP home page Giter VIP logo

potec's Introduction

PoTeC - Potsdam Textbook Corpus

This repository contains the Potsdam Textbook Corpus (PoTeC) which is a natural reading eye-tracking corpus. Four groups of participants (expert/beginner level students of physics and biology) read 12 short texts taken from textbooks of physics and biology while their eye movements were monitored. The final dataset contains the reading data for 75 participants each reading all 12 texts. The study follows a 2x2x2 fully-crossed factorial design:

  • Factor 1: Study discipline of participant with the levels either physics or biology
  • Factor 2: Study level of participant with the levels either beginner or expert
  • Factor 3: Text domain with the levels either physics or biology
Physics Biology
Beginner 12 16
Expert 20 27

Both factors are quasi-experimental and manipulated between subjects. The readers' text comprehension as well as their background knowledge on the topics presented in the texts were assessed by multiple-choice questions.

More information is found in the following README'S:

Download the data

The data files are stored in an OSF repository. If the repository has been cloned, they can be downloaded and extracted automatically using the following script:

# or python3
python download_data_files.py

# OR to extract the files directly
python download_data_files.py --extract

Alternatively, they can be downloaded manually from the OSF repository and extracted into the respective folders.

pymovements integration

PoTeC is integrated into the pymovements package. The package allows to easily download the raw data and further process it. The following code snippet shows how to download the data:

# pip install pymovements
import pymovements as pm

dataset = pm.Dataset('PoTeC', path='data/PoTeC')

dataset.download()

Note on reading in the data files

The German text p3 includes the word "null". If e.g. the word features are read using pandas, the word "null" is interpreted as a NA value. In order to avoid this behavior the command can be used with the following arguments:

import pandas as pd
pd.read_csv('word_features_p3.tsv', sep='\t',  
            keep_default_na=False,
            na_values=['#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
                       '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NaN', 'None', 'n/a',
                       'nan', '']
            )

Data Overview

The data that was used to create the corpus and that was obtained during the experiments is made available in various stages. The data is stored in respective sub folders each of which contains a README that provides more information about the data and how to use it. For a detailed description of the data types, format and content, please refer to the CODEBOOK.

This repository contains the following data:

  • Eye-tracking data
    • raw eye-tracking data
    • preprocessed eye-tracking data
  • Stimuli
    • stimuli texts
    • text and background questions
  • Anonymized participant data
  • Scripts (in Python)
    • scripts to preprocess the data
    • additional scripts that have been used to process the data further

The scripts were run using Python 3.9 with the dependencies specified in the requirements.txt file.

Technical set-up

The experiment was run with the following technical set-up:

Setting Value
Technical set-up Eye-tracking device Eyelink 1000, dektop mounted camera system with a 35 mm lens
Sampling rate 1000 Hz
Monitor size 47.5x30 cm, 22 inch
Monitor resolution 1680x1050 pixels
Eye-to-screen distance 61 cm
Eye-to-camera distance 65 cm
Experiment software Experiment Builder software provided by SR Research
Stimulus presentation Background color Black
Font color White
Font size 18
Font Courier
Stimulus size On average 158 words shown on multiple lines on one page
Number of characters per visual angle (middle of screen) 2.8 characters per degree of visual angle
Spacing

Stimuli

Please note that the full stimuli texts are not yet available. Contact deborahnoemie.jakobi(at)uzh.ch for more information.

Stimuli Annotation

The stimuli have been manually annoted with part-of-speech tags and other linguistic information. The annotations are described in a separate file: ANNOTATION.

Citation

@misc{potec,
    url={\url{https://github.com/DiLi-Lab/PoTeC}},
    author={Jakobi, Deborah N. and Kern, Thomas and Reich, David R. and Haller, Patrick and J\"ager, Lena A.},
    title={{PoTeC}: A {German} Naturalistic Eye-tracking-while-reading Corpus},
    year={2024},
    note={under review}
}

Repository Structure

PoTeC-data
├── CODEBOOK.md
├── README.md
├── requirements.txt
├── additional_scripts
│   ├── ADDITIONAL_SCRIPTS.md
│   ├── compute_reading_measures.py
│   ├── generate_scanpaths.py
│   ├── merge_reading_measures.py
│   ├── create_codebook_tables.py
│   ├── surprisal.py
│   ├── get_surprisal.py
│   ├── merge_fixations_and_coordinates.py
│   ├── merge_scanpaths.py
│   ├── analyses.R
│   ├── run_bayesian_models.R
│   ├── run_freq_models.R
│   ├── all_colls_description.csv
│   └── all_codebook_texts.csv
├── eyetracking_data
│   ├── EYETRACKING_DATA.md
│   ├── original_uncorrected_fixation_report.txt
│   ├── fixations
│   │   └── ...
│   ├── fixations_uncorrected
│   │   └── ...
│   ├── asc_files
│   │   └── ...
│   ├── raw_data 
│   │   └── ...
│   ├── reader_merged
│   │   └── ...
│   ├── reading_measures
│   │   └── ...
│   ├── scanpaths
│   │   └── ...
│   └── scanpaths_merged
│       └── ...
├── participants
│   ├── PARTICIPANTS.md
│   └── participant_data.tsv
├── preprocessing_scripts
│   ├── PREPROCESSING_SCRIPTS.md
│   ├── char_index_to_word_index.py
│   ├── create_word_aoi_limits.py
│   ├── correct_fixations.py
│   ├── split_fixation_report.py
│   ├── asc_to_csv.py
│   ├── aoi_to_word.tsv
│   ├── sent_limits.json
│   └── word_limits.json
└── stimuli
    ├── ANNOTATION.md
    ├── STIMULI.md
    ├── practice_items.txt
    ├── dependency_trees_manually_corrected.tsv
    ├── aoi_texts
    │   └── ...
    ├── stimuli
        ├── stimuli.bib
    │   ├── items.tsv
    │   └── stimuli.tsv
    └── word_features
        └── ...

potec's People

Contributors

hallerp avatar lenajaeger avatar m-a-huber avatar siqube avatar thedebbister avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

saphjra

potec's Issues

couldn't test split_fixation_report.py

split_fixation_report.py

Used to split the fixation report created by the Data Viewer into individual fixations files for each text and reader.
In addition, it creates a file containing all reader IDs (RECORDING_SESSION_LABEL) participants/readerIDs.txt.
Note: as we do not provide the corrected fixation report (see preprocessing pipeline), this script cannot be used
to recreate the fixation files provided in the repository!

delete raw data part of preprocessing/README.md

parse_asc_files.py

TODO
Old comment: This script extracts the lines with samples (timestamp, x-screen coordinate, y-screen coordinate, pupil diameter)
from Eyelink 100 raw data files (previously converted from edf to ascii).
The data was recorded with different scripts; in some sesssions, practice trials (e.g. session 1.asc) were recorded,
in most sessions not. In the sessions where practice trials were recorded, all trial variables TRIAL_VAR were written
12 times rather than once to the data (always after the eye mov samples have been written).
This script handles both kinds of data files.

rename reading-measures_definitions.md

we should rename it:

reading-measures_definitions.md => READING-MEASURES_DEFINITIONS.md

and maybe move it s.t. we only have CODEBOOK.md and README.md on the top-level and reading measures where THEY FUCKING BELONG. (I guess eyetracking_data?) we might wanna discuss where to move it

publish manually corrected data

  1. unpublished: Manually corrected fixation report
    : We used a script to manually correct the fixations as they were not always aligned correct. The script to correct the fixations is not published either.

we should publish them!?!

missing parse_asc_files.py and OSF/eyetracking_data/FixRep_20_Mai_2017.txt

in the preprocessing_scripts/README.md it is mentioned:

  1. published: .csv files
    : .asc files parsed into .tsv files, contains one sample per line. Script parse_asc_files.py creates the .csv file from the .asc files.

and

  1. published: Original fixation report
    : Based on the csv files we used the SR Research Data Viewer to create a fixation report. OSF/eyetracking_data/FixRep_20_Mai_2017.txt.

both are missing in the repo

publish raw data

@LenaJaeger @theDebbister

for all purposes, we might wanna release, additionally to the fixation data, the raw sample data. since we're now also doing reading comprehension on sample level.

stimuli/texts missing

stimuli/texts are mentioned in the stimuli/README.md.

there are missing:

In addition, the texts are also provided as separate text files in the [texts](./texts) folder.

there are both texts_and_questions and also text_examples and I'm unsure which we want to link. both are not "easily" python processable -- might need to change it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.