Giter VIP home page Giter VIP logo

linking-writing's Introduction

linking-writing's People

Contributors

lucselmes avatar luispintoc avatar

Watchers

 avatar

linking-writing's Issues

Fix Essay Reconstruction Funcs

Generated essay seems way too large (almost 1 Giga of text!), it is likely generating the essays incorrectly, need to rework this. Fix the generation so that output is correct.

Production Rate

The rate of written language production can be measured by counting the number of characters, words, clauses, sentences, or T-units in the writing process or written product generated per unit of time. Example measures are as follows.

  • number of characters (including spaces) produced per minute during the process
  • number of characters (including spaces) produced per minute in the product

Feature eng so far

Overview

Recap of issues made so far + TODOs for these issues

Basic:

  • No. of activities
  • Average time of an action
  • Total time of all actions
  • Total essay size (max word count)
  • Number of unique text changes
  • Average cursor position

TODO:

  • Normalize average cursor position wrt essay size
  • Ensure that total time of all actions is calculated properly

Text changes:

  • No. large text changes (size=20)
  • No. extremely large text changes (size=100)
  • No. very small text changes (size=5)

TODO:

  • Make the size of the text selections relative to the essay

Occurrences of Activities:

  • Count of nonproduction
  • Count of input
  • Count of remove
  • Count of replace
  • Count of paste

Moving selections of text: (Move From action)

  • Size of the moved text (['mean', 'max'])
  • Distance between location of original location of moved text and final location of moved text (['mean', 'max'])

Misc:

  • Amount of time spent writing (wrt to max time allowed of 30mins)

TODO:

  • Fix way of normalizing amount of time spent writing (up_time measure seems to go over max time allowed)

Optimize Essay Reconstruction

  • Optimize for memory usage
  • Optimize for speed (priority)

Calculating runtime on 10 smallest essays.

Note: I'm too lazy to calculate run time accurately so the run times are variable. Should be just taken as an estimation of time it takes to create features. All we care about is seeing big speed ups. If we don't see a big speed up from a way of calculating something we change the approach.

Misc features

Dump of feature ideas:

  • Actual word count
  • Length of an action
  • Last word count
    [DONE] * Amount of paragraphs (number of times /n is used)
  • Create time series features and then aggregate them
  • Time it took to write
  • Did they review (reread) what they wrote?
  • Distance of mouse movement
  • Angle of mouse movement
  • Movement around essay with non mouse movement
  • Different common patterns of activities (Maybe take a look at X sized window of actions and determine which ones are more commmon amongst different scoring groups)
    [DONE] * Essay structure reconstruction
    • What is the structure of the essay
    • Average size of
  • Usage of punctuation
    • Double quotations --> How many citations are there?
    • Square brackets --> References
  • Time allocation
    • What is the average time of their pauses?
    • What is the pattern of the pauses?
    • What is the patterns/length of pauses at the beginning of the essay writing?
    • What is the patterns/length of pauses at the end of the essay writing?
  • Perform clustering on the features in order to try and seperate them into 4 groups (4 prompts)
  • perplexity --> LLM scores
  • Words per minute --> Speed of typing
  • Introduction is for the end
    • Can you isolate when the paragraphs were written in the course of the total time of the essay writing)

Optimize output submission

Output of the model is continuous, from 0.5 to 6
We can optimize the output by making it discrete, i.e. steps of 0.5
use quantification?
use a classifier to decide?
majority voting? (when using ensemble, instead of avg, take the one closest to the discrete value)

Revision

Revisions are operations of deletions or insertions in writing. A deletion is defined as the removal of any stretch of characters from a text whereas an insertion refers to a sequence of activities to add characters to a growing text (except the end). Below are some commonly used revision measures:

  • number of deletions (in total or per minute)
  • number of insertions (in total or per minute)
  • length of deletions (in characters)
  • length of insertions (in characters)
  • proportion of deletions (as a % of total writing time)
  • proportion of insertions (as a % of total writing time)
  • product vs. process ratio (The number of characters in the product divided by the number of characters produced during the writing process)
  • number/length of revisions at the point of inscription (i.e., at the current end of a text being produced)
  • number/length of revisions after the text has been transcribed (i.e., at a previous point in the text)
  • number of immediate revisions (the distance between the position of the flashing cursor and the revision point equal to zero)
  • number of distant revisions (the distance between the position of the flashing cursor and the revision point larger than zero)

Burst

Bursts refer to the periods in text production in which stretches of texts were continuously produced with no pauses and/or revisions. There are mainly two types of bursts: P-bursts that refer to the written segments terminated by pauses, and R-bursts that describe the segments terminated by an evaluation, revision or other grammatical discontinuity.

  • number of P-bursts (in total or per minute)
  • number of R-bursts (in total or per minute)
  • proportion of P-bursts (as a % of total writing time)
  • proportion of R-bursts (as a % of total writing time)
  • length of P-bursts (in characters)
  • length of R-bursts (in characters)

Pause

Pauses are generally defined as inter-keystroke intervals (IKI) above a certain threshold (e.g., 2000 milliseconds). The IKI refers to the gap time between two consecutive key presses typically expressed in milliseconds. To illustrate, suppose a writer types a character "A" at time 1 and then a character "B" at time 2. One can obtain the IKI between the two characters simply using the formula: IKI = Time2 - Time1. Global measures of pausing are usually associated with the duration and frequency of pauses calculated from different dimensions. Below are some typical pause measures.

  • number of pauses (in total or per minute)
  • proportion of pause time (as a % of total writing time)
  • pause length (usually the mean duration of all pauses in text production)
  • pause lengths or frequencies within words, between words, between sentences, between paragraphs, etc.

Use Regex to extract Move From values?

Something along the lines of:

map(
something,
re.match(
r"Move From [(\d+), (\d+)] To [(\d+), (\d+)]",
row["activity"],
).groups(),
)

Could be faster than current way of doing it which takes ~4 seconds

### Checking bursts:

def calculate_bursts(df):
    df = df.sort_values(['id', 'event_id'])
    df['next_down_time'] = df.groupby('id')['down_time'].shift(-1)
    df['pause'] = df['next_down_time'] - df['up_time']
    df['is_revision'] = df['activity'].isin(['Remove/Cut', 'Replace'])

    p_bursts = []
    r_bursts = []
    for _, group in df.groupby('id'):
        p_burst = r_burst = 0
        for _, row in group.iterrows():
            if row['pause'] > 2000:  # End of a P-burst
                p_bursts.append(p_burst)
                p_burst = 0
            if row['is_revision']:  # End of an R-burst
                r_bursts.append(r_burst)
                r_burst = 0
            p_burst += 1
            r_burst += 1
        p_bursts.append(p_burst)  # Last P-burst
        r_bursts.append(r_burst)  # Last R-burst

    return p_bursts, r_bursts

p_bursts, r_bursts = calculate_bursts(train_logs)

does this code makes sense then? We should include 'Replace' I think

Checking bursts:

P-burst:

  • Seems to be correct . Only thing that could be for discussion is the threshold of 2 seconds for labeling a P-burst.

R-burst:

  • Filtering on cols input and remove/cut, to then filter remove/cut. Seems redundant. This is an efficiency thing though so we don't care

Between P-burst and R-burst, we apply a condition to selecting entries for P-burst, but don't apply any conditions before selecting entries for R-burst. This means that we necessarily have entries used in calculating R-burst features that are also used when calculating P-burst features, from what I understand. Is this an issue? Because my understanding of both types of bursts is that they are mutually exclusive.

Originally posted by @lucselmes in #28 (comment)

Ideas

  • Change 0.5 to 1
  • Change 6 to 5.5
  • Add noise to labels
  • High classifier
  • Low classifier
  • Conformal pred
  • Multiclass

Process Variance

Process variance attends to the dynamics of the writing process in relation to time and thus represents how the writer's fluency may differ at different stages.

Process variance is generally measured by first dividing the whole writing process into a certain number of equal time intervals (e.g., 5 or 10) and then calculating the total number of characters produced in the intervals (often normalized to the average number of characters per minute), or to make it more comparable among writers, the proportion of characters produced per interval. The standard deviation of characters produced per interval is also calculated from keystroke logs as an indicator of process variance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.