Giter VIP home page Giter VIP logo

cs448_midterm's Introduction

cs448_midterm


Split the three algorithms among the three of us:

  • Bayesian Classifier
  • SVM
  • Linear / Logisitc Regression (for this task, likely a logisitic regression)

Ideas for maximizing performance: I don't think it'll be enough to just pass in the word and expect it to predict the part of speech, we might want to incorporate positional encoding as well (where the word is in the sentence) or maybe what words came before or after the word?

Linear/Logistic Regression


Here is the explanation of how the logistic regression for POS tagging. Here's a brief overview of its implementation:

Data Preparation: The code reads training data from "train.txt," which contains sentences with tokens and their corresponding POS tags. It extracts the token and POS tag from each line and stores them in a list of tuples, where each tuple contains a token and its POS tag.

Feature Extraction: It defines a feature extraction function extract_features(token) that extracts features for each token. In the provided code, it uses the token itself as a feature.

Data Vectorization: It vectorizes the features using scikit-learn's DictVectorizer. This step converts the feature dictionaries into a numerical format that can be used for training a machine learning model.

Logistic Regression Model Training: It trains a logistic regression model using scikit-learn's LogisticRegression class. The features and corresponding labels (POS tags) are used for training the model.

Model Evaluation: It evaluates the trained model using a small part of the training data as a dev set. It calculates and prints the classification report, which includes metrics such as precision, recall, and F1-score for each POS tag.

Prediction: Finally, it provides a function predict_pos_tags(sentence, vectorizer, clf) that allows you to predict POS tags for new sentences using the trained classifier.

cs448_midterm's People

Contributors

noahschiro avatar azizakaa avatar

Watchers

Ibraheem Moosa avatar Ryo Kamoi avatar  avatar

cs448_midterm's Issues

Dev dataset not needed

Just an issue I noticed, whoever wants to take it may.

In our test train split we are also splitting out a dev dataset. The dev dataset makes sense to do when we are training over multiple epochs and want to prevent overfitting. It gives us an "runtime" estimation of the performance of the network without having to go through the whole test set. If you want more explanation of why we don't need a dev dataset, reach out to me here or comment on this issue.

So I want to cut out the dev dataset and just have train / test with an 80 / 20 split. However this is going to break a bunch of code in Logistic_reg.py and bayes.py because certain functions here are expecting there to be 3 datasets.

Test on new test set

Could someone please:

  • Add the test set provided to us
  • Write logic to load this in
  • Test on this test set (NLTK has a POS tagger. Please use this as a "ground truth" that we measure our accuracy against). This is probably the best we have at measuring our accuracy since we don't have labels.

There is now a PR linked to this, updating what has been done by checking the boxes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.