Light

luispintoc / linking-writing Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1.89 GB

https://www.kaggle.com/competitions/linking-writing-processes-to-writing-quality/overview

License: Apache License 2.0

Jupyter Notebook 99.67% Python 0.33%

linking-writing's Introduction

Hi, I’m @luispintoc

Welcome to my repo!

You can reach me via email at [email protected] or via LinkedIn

linking-writing's People

Contributors

Watchers

linking-writing's Issues

Fix Essay Reconstruction Funcs

Generated essay seems way too large (almost 1 Giga of text!), it is likely generating the essays incorrectly, need to rework this. Fix the generation so that output is correct.

Heuristic to turn continuous output to categorical

Turn some predictions into categoricals to get perfect score (0 rmse)
Idea:
Use conformal prediction (uncertainty)
Since we'll have many models, use the std

More more feats

https://www.kaggle.com/competitions/linking-writing-processes-to-writing-quality/discussion/462198

The rate of written language production can be measured by counting the number of characters, words, clauses, sentences, or T-units in the writing process or written product generated per unit of time. Example measures are as follows.

number of characters (including spaces) produced per minute during the process
number of characters (including spaces) produced per minute in the product

Essay Reconstruction

Reconstruct a given essay's text in order to extract structural feats.

Leads:

Error analysis: Analyze output / understand why with SHAP

More feats

https://www.kaggle.com/competitions/linking-writing-processes-to-writing-quality/discussion/445545

TODO

Check bursts:

https://people.eng.unimelb.edu.au/baileyj/papers/paper249-EDM.pdf

Check what pressed keys is
check process_variance
expand on cursor visits
do aggregations in "segments visits"
expand on relative size paragraph
Punctuation per intro/body/conclusion
Add aggs to time features
Take a look at deletions
IKI papers read

Feature eng so far

Overview

Recap of issues made so far + TODOs for these issues

Basic:

TODO:

Normalize average cursor position wrt essay size
Ensure that total time of all actions is calculated properly

Text changes:

No. large text changes (size=20)
No. extremely large text changes (size=100)
No. very small text changes (size=5)

TODO:

Make the size of the text selections relative to the essay

Occurrences of Activities:

Moving selections of text: (Move From action)

Size of the moved text (['mean', 'max'])
Distance between location of original location of moved text and final location of moved text (['mean', 'max'])

Misc:

Amount of time spent writing (wrt to max time allowed of 30mins)

TODO:

Fix way of normalizing amount of time spent writing (up_time measure seems to go over max time allowed)

More feats for Cursor visits

Optimize Essay Reconstruction

Optimize for memory usage
Optimize for speed (priority)

Calculating runtime on 10 smallest essays.

Note: I'm too lazy to calculate run time accurately so the run times are variable. Should be just taken as an estimation of time it takes to create features. All we care about is seeing big speed ups. If we don't see a big speed up from a way of calculating something we change the approach.

Misc features

Dump of feature ideas:

Actual word count
Length of an action
Last word count
[DONE] * Amount of paragraphs (number of times /n is used)
Create time series features and then aggregate them
Time it took to write
Did they review (reread) what they wrote?
Distance of mouse movement
Angle of mouse movement
Movement around essay with non mouse movement
Different common patterns of activities (Maybe take a look at X sized window of actions and determine which ones are more commmon amongst different scoring groups)
[DONE] * Essay structure reconstruction
- What is the structure of the essay
- Average size of
Usage of punctuation
- Double quotations --> How many citations are there?
- Square brackets --> References
Time allocation
- What is the average time of their pauses?
- What is the pattern of the pauses?
- What is the patterns/length of pauses at the beginning of the essay writing?
- What is the patterns/length of pauses at the end of the essay writing?
Perform clustering on the features in order to try and seperate them into 4 groups (4 prompts)
perplexity --> LLM scores
Words per minute --> Speed of typing
Introduction is for the end
- Can you isolate when the paragraphs were written in the course of the total time of the essay writing)

keyboard shortcuts

https://www.kaggle.com/code/crispychurch/keyboard-shortcuts-and-reliance-on-the-mouse/notebook

Optimize output submission

Output of the model is continuous, from 0.5 to 6
We can optimize the output by making it discrete, i.e. steps of 0.5
use quantification?
use a classifier to decide?
majority voting? (when using ensemble, instead of avg, take the one closest to the discrete value)

Quartiles of cursor position normalized by maximum cursor position (and by final char. count)

Revision

Revisions are operations of deletions or insertions in writing. A deletion is defined as the removal of any stretch of characters from a text whereas an insertion refers to a sequence of activities to add characters to a growing text (except the end). Below are some commonly used revision measures:

number of deletions (in total or per minute)
number of insertions (in total or per minute)
length of deletions (in characters)
length of insertions (in characters)
proportion of deletions (as a % of total writing time)
proportion of insertions (as a % of total writing time)
product vs. process ratio (The number of characters in the product divided by the number of characters produced during the writing process)
number/length of revisions at the point of inscription (i.e., at the current end of a text being produced)
number/length of revisions after the text has been transcribed (i.e., at a previous point in the text)
number of immediate revisions (the distance between the position of the flashing cursor and the revision point equal to zero)
number of distant revisions (the distance between the position of the flashing cursor and the revision point larger than zero)

Drop id `3e10785d`?

https://www.kaggle.com/competitions/linking-writing-processes-to-writing-quality/discussion?sort=recent-comments

Burst

Bursts refer to the periods in text production in which stretches of texts were continuously produced with no pauses and/or revisions. There are mainly two types of bursts: P-bursts that refer to the written segments terminated by pauses, and R-bursts that describe the segments terminated by an evaluation, revision or other grammatical discontinuity.

number of P-bursts (in total or per minute)
number of R-bursts (in total or per minute)
proportion of P-bursts (as a % of total writing time)
proportion of R-bursts (as a % of total writing time)
length of P-bursts (in characters)
length of R-bursts (in characters)

Copy these features

link
link2

Rare edge cases (<2 score) - create classifier before

https://www.kaggle.com/competitions/linking-writing-processes-to-writing-quality/discussion/447885

Pause

Pauses are generally defined as inter-keystroke intervals (IKI) above a certain threshold (e.g., 2000 milliseconds). The IKI refers to the gap time between two consecutive key presses typically expressed in milliseconds. To illustrate, suppose a writer types a character "A" at time 1 and then a character "B" at time 2. One can obtain the IKI between the two characters simply using the formula: IKI = Time2 - Time1. Global measures of pausing are usually associated with the duration and frequency of pauses calculated from different dimensions. Below are some typical pause measures.

number of pauses (in total or per minute)
proportion of pause time (as a % of total writing time)
pause length (usually the mean duration of all pauses in text production)
pause lengths or frequencies within words, between words, between sentences, between paragraphs, etc.

Count of words with 1 letter, 2 letters / Count of special characters

Idea: they are usually connectors

Use Regex to extract Move From values?

Something along the lines of:

map(
something,
re.match(
r"Move From [(\d+), (\d+)] To [(\d+), (\d+)]",
row["activity"],
).groups(),
)

Could be faster than current way of doing it which takes ~4 seconds

### Checking bursts:

def calculate_bursts(df):
    df = df.sort_values(['id', 'event_id'])
    df['next_down_time'] = df.groupby('id')['down_time'].shift(-1)
    df['pause'] = df['next_down_time'] - df['up_time']
    df['is_revision'] = df['activity'].isin(['Remove/Cut', 'Replace'])

    p_bursts = []
    r_bursts = []
    for _, group in df.groupby('id'):
        p_burst = r_burst = 0
        for _, row in group.iterrows():
            if row['pause'] > 2000:  # End of a P-burst
                p_bursts.append(p_burst)
                p_burst = 0
            if row['is_revision']:  # End of an R-burst
                r_bursts.append(r_burst)
                r_burst = 0
            p_burst += 1
            r_burst += 1
        p_bursts.append(p_burst)  # Last P-burst
        r_bursts.append(r_burst)  # Last R-burst

    return p_bursts, r_bursts

p_bursts, r_bursts = calculate_bursts(train_logs)

does this code makes sense then? We should include 'Replace' I think

Checking bursts:

P-burst:

Seems to be correct . Only thing that could be for discussion is the threshold of 2 seconds for labeling a P-burst.

R-burst:

Filtering on cols input and remove/cut, to then filter remove/cut. Seems redundant. This is an efficiency thing though so we don't care

Between P-burst and R-burst, we apply a condition to selecting entries for P-burst, but don't apply any conditions before selecting entries for R-burst. This means that we necessarily have entries used in calculating R-burst features that are also used when calculating P-burst features, from what I understand. Is this an issue? Because my understanding of both types of bursts is that they are mutually exclusive.

Originally posted by @lucselmes in #28 (comment)

Copy ideas from this 1st place solution

https://www.kaggle.com/competitions/icr-identify-age-related-conditions/discussion/430843

Ideas

Number of changes in intro/body/conclusion

Process Variance

Process variance attends to the dynamics of the writing process in relation to time and thus represents how the writer's fluency may differ at different stages.

Process variance is generally measured by first dividing the whole writing process into a certain number of equal time intervals (e.g., 5 or 10) and then calculating the total number of characters produced in the intervals (often normalized to the average number of characters per minute), or to make it more comparable among writers, the proportion of characters produced per interval. The standard deviation of characters produced per interval is also calculated from keystroke logs as an indicator of process variance.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.