Hi, Iโm @luispintoc
Welcome to my repo!
You can reach me via email at [email protected] or via LinkedIn
https://www.kaggle.com/competitions/linking-writing-processes-to-writing-quality/overview
License: Apache License 2.0
Hi, Iโm @luispintoc
Welcome to my repo!
You can reach me via email at [email protected] or via LinkedIn
Generated essay seems way too large (almost 1 Giga of text!), it is likely generating the essays incorrectly, need to rework this. Fix the generation so that output is correct.
Turn some predictions into categoricals to get perfect score (0 rmse)
Idea:
Use conformal prediction (uncertainty)
Since we'll have many models, use the std
The rate of written language production can be measured by counting the number of characters, words, clauses, sentences, or T-units in the writing process or written product generated per unit of time. Example measures are as follows.
Reconstruct a given essay's text in order to extract structural feats.
Leads:
https://people.eng.unimelb.edu.au/baileyj/papers/paper249-EDM.pdf
Check what pressed keys is
check process_variance
expand on cursor visits
do aggregations in "segments visits"
expand on relative size paragraph
Punctuation per intro/body/conclusion
Add aggs to time features
Take a look at deletions
IKI papers read
Recap of issues made so far + TODOs for these issues
Basic:
TODO:
Text changes:
TODO:
Occurrences of Activities:
Moving selections of text: (Move From action)
Misc:
TODO:
Calculating runtime on 10 smallest essays.
Note: I'm too lazy to calculate run time accurately so the run times are variable. Should be just taken as an estimation of time it takes to create features. All we care about is seeing big speed ups. If we don't see a big speed up from a way of calculating something we change the approach.
Dump of feature ideas:
Output of the model is continuous, from 0.5 to 6
We can optimize the output by making it discrete, i.e. steps of 0.5
use quantification?
use a classifier to decide?
majority voting? (when using ensemble, instead of avg, take the one closest to the discrete value)
Revisions are operations of deletions or insertions in writing. A deletion is defined as the removal of any stretch of characters from a text whereas an insertion refers to a sequence of activities to add characters to a growing text (except the end). Below are some commonly used revision measures:
Bursts refer to the periods in text production in which stretches of texts were continuously produced with no pauses and/or revisions. There are mainly two types of bursts: P-bursts that refer to the written segments terminated by pauses, and R-bursts that describe the segments terminated by an evaluation, revision or other grammatical discontinuity.
Pauses are generally defined as inter-keystroke intervals (IKI) above a certain threshold (e.g., 2000 milliseconds). The IKI refers to the gap time between two consecutive key presses typically expressed in milliseconds. To illustrate, suppose a writer types a character "A" at time 1 and then a character "B" at time 2. One can obtain the IKI between the two characters simply using the formula: IKI = Time2 - Time1. Global measures of pausing are usually associated with the duration and frequency of pauses calculated from different dimensions. Below are some typical pause measures.
Idea: they are usually connectors
Something along the lines of:
map(
something,
re.match(
r"Move From [(\d+), (\d+)] To [(\d+), (\d+)]",
row["activity"],
).groups(),
)
Could be faster than current way of doing it which takes ~4 seconds
def calculate_bursts(df):
df = df.sort_values(['id', 'event_id'])
df['next_down_time'] = df.groupby('id')['down_time'].shift(-1)
df['pause'] = df['next_down_time'] - df['up_time']
df['is_revision'] = df['activity'].isin(['Remove/Cut', 'Replace'])
p_bursts = []
r_bursts = []
for _, group in df.groupby('id'):
p_burst = r_burst = 0
for _, row in group.iterrows():
if row['pause'] > 2000: # End of a P-burst
p_bursts.append(p_burst)
p_burst = 0
if row['is_revision']: # End of an R-burst
r_bursts.append(r_burst)
r_burst = 0
p_burst += 1
r_burst += 1
p_bursts.append(p_burst) # Last P-burst
r_bursts.append(r_burst) # Last R-burst
return p_bursts, r_bursts
p_bursts, r_bursts = calculate_bursts(train_logs)
does this code makes sense then? We should include 'Replace' I think
P-burst:
R-burst:
Between P-burst and R-burst, we apply a condition to selecting entries for P-burst, but don't apply any conditions before selecting entries for R-burst. This means that we necessarily have entries used in calculating R-burst features that are also used when calculating P-burst features, from what I understand. Is this an issue? Because my understanding of both types of bursts is that they are mutually exclusive.
Originally posted by @lucselmes in #28 (comment)
Process variance attends to the dynamics of the writing process in relation to time and thus represents how the writer's fluency may differ at different stages.
Process variance is generally measured by first dividing the whole writing process into a certain number of equal time intervals (e.g., 5 or 10) and then calculating the total number of characters produced in the intervals (often normalized to the average number of characters per minute), or to make it more comparable among writers, the proportion of characters produced per interval. The standard deviation of characters produced per interval is also calculated from keystroke logs as an indicator of process variance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.