cs448_midterm's Introduction

cs448_midterm

Split the three algorithms among the three of us:

Bayesian Classifier
SVM
Linear / Logisitc Regression (for this task, likely a logisitic regression)

Ideas for maximizing performance: I don't think it'll be enough to just pass in the word and expect it to predict the part of speech, we might want to incorporate positional encoding as well (where the word is in the sentence) or maybe what words came before or after the word?

Linear/Logistic Regression

Here is the explanation of how the logistic regression for POS tagging. Here's a brief overview of its implementation:

Data Preparation: The code reads training data from "train.txt," which contains sentences with tokens and their corresponding POS tags. It extracts the token and POS tag from each line and stores them in a list of tuples, where each tuple contains a token and its POS tag.

Feature Extraction: It defines a feature extraction function extract_features(token) that extracts features for each token. In the provided code, it uses the token itself as a feature.

Data Vectorization: It vectorizes the features using scikit-learn's DictVectorizer. This step converts the feature dictionaries into a numerical format that can be used for training a machine learning model.

Logistic Regression Model Training: It trains a logistic regression model using scikit-learn's LogisticRegression class. The features and corresponding labels (POS tags) are used for training the model.

Model Evaluation: It evaluates the trained model using a small part of the training data as a dev set. It calculates and prints the classification report, which includes metrics such as precision, recall, and F1-score for each POS tag.

Prediction: Finally, it provides a function predict_pos_tags(sentence, vectorizer, clf) that allows you to predict POS tags for new sentences using the trained classifier.

cs448_midterm's People

Contributors

Watchers

cs448_midterm's Issues

Implement support vector machine

Dev dataset not needed

Just an issue I noticed, whoever wants to take it may.

In our test train split we are also splitting out a dev dataset. The dev dataset makes sense to do when we are training over multiple epochs and want to prevent overfitting. It gives us an "runtime" estimation of the performance of the network without having to go through the whole test set. If you want more explanation of why we don't need a dev dataset, reach out to me here or comment on this issue.

So I want to cut out the dev dataset and just have train / test with an 80 / 20 split. However this is going to break a bunch of code in Logistic_reg.py and bayes.py because certain functions here are expecting there to be 3 datasets.

Implement logisitic regression

Implement bayesian classifier

Test on new test set

Could someone please:

Add the test set provided to us
Write logic to load this in
Test on this test set (NLTK has a POS tagger. Please use this as a "ground truth" that we measure our accuracy against). This is probably the best we have at measuring our accuracy since we don't have labels.

There is now a PR linked to this, updating what has been done by checking the boxes

Recommend Projects