cgpotts / cs224u Goto Github PK

View Code? Open in Web Editor NEW

2.1K 84.0 894.0 41.48 MB

Code for Stanford CS224u

License: Apache License 2.0

Python 4.03% Jupyter Notebook 95.97%

cs224u's Introduction

CS224u: Natural Language Understanding

Code for the Stanford course.

Spring 2023

Christopher Potts

Core components

`setup.ipynb`

Details on how to get set up to work with this code.

`hw_*.ipynb`

The set of homeworks for the current run of the course.

`tutorial_*` notebooks

Introductions to Jupyter notebooks, scientific computing with NumPy and friends, and PyTorch.

`torch_*.py` modules

A generic optimization class (torch_model_base.py) and subclasses for GloVe, Autoencoders, shallow neural classifiers, RNN classifiers, tree-structured networks, and grounded natural language generation.

tutorial_pytorch_models.ipynb shows how to use these modules as a general framework for creating original systems.

`evaluation_*.ipynb` and `projects.md`

Notebooks covering key experimental methods and practical considerations, and tips on writing up and presenting work in the field.

`iit*` and `feature_attribution.ipynb`

Part of our unit on explainability and model analysis.

`np_*.py` modules

This is now considered background material for the course.

Reference implementations for the torch_*.py models, designed to reveal more about how the optimization process works.

`vsm_*`

This is now considered background material for the course.

A unit on vector space models of meaning, covering traditional methods like PMI and LSA as well as newer methods like Autoencoders and GloVe. vsm.py provides a lot of the core functionality, and torch_glove.py and torch_autoencoder.py are the learned models that we cover. vsm_03_contextualreps.ipynb explores methods for deriving static representations from contextual models.

`sst_*`

This is now considered background material for the course.

A unit on sentiment analysis with the English Stanford Sentiment Treebank. The core code is sst.py, which includes a flexible experimental framework. All the PyTorch classifiers are put to use as well: torch_shallow_neural_network.py, torch_rnn_classifier.py, and torch_tree_nn.py.

`finetuning.ipynb`

This is now considered background material for the course.

Using pretrained parameters from Hugging Face for featurization and fine-tuning.

`utils.py`

Miscellaneous core functions used throughout the code.

`test/`

To run these tests, use

py.test -vv test/*

or, for just the tests in test_shallow_neural_classifiers.py,

py.test -vv test/test_shallow_neural_classifiers.py

If the above commands don't work, try

python3 -m pytest -vv test/test_shallow_neural_classifiers.py

License

The materials in this repo are licensed under the Apache 2.0 license and a Creative Commons Attribution-ShareAlike 4.0 International license.

cs224u's People

Stargazers

Watchers

Forkers

mihail911 ydshrimp keyua-cisco adb8165 djih gongbudaizhe mikepatrickryan killedision bpblanken jonathanchiang koth pujun-ai alexjwade konstantine33 stevealbertwong ivaylogb paris007 tyeah kisoph cineno nicolecrawford jazracherif angelo337 huangpeng1126 xiaoxinyi devsinghsachan migndul adekunleba chenjun0210 furaoing joseph-jung89 daryadedik tfortunato datavizweb franklinma810 jkhlot fundou lxueaa vyraun gopigrip7 vuhonganh bgfurfeature ai2160 pierrenowi wenbotse james-fu datasoccer meshiguge ra2630 jiongfu trungtv hxyshare zby0902 sruan2 kormilitzin cvo9 aaron-effron fred-robson chaogu77 taoketao kolchinski jervisfm austinvl michguo itsmrlin bsparkes theailabs translorentz obannon37 britnichau ericx134 mjfang boraerden manal-khaled ananddhoot shivaalroy fayejf anthonyperez xusliebana koenig125 piushvaish dzliu msaffarm jimmy-ksu mwilbz exotol chucktan123 devangpaliwal infinite-joy szdbl hulalazz ziyaowei kangbojk malijun525 davislf2 aus10powell pdkyll andrew05200 atsushionoue munendra7777

cs224u's Issues

minor error in sanity check test_op_unigrams_phi

in hw2_sst.ipynb

I think you meant amazing instead of enlightening in expected
(BTW: Thank you so much for your amazing work)

def test_op_unigrams_phi(func):
    tree = Tree.fromstring("""(4 (2 NLU) (4 (2 is) (4 amazing)))""")
    expected = {"enlightening": 1}
    result = func(tree)
    assert result == expected, \
        ("Error for `op_unigrams_phi`: "
         "Got `{}` which differs from `expected` "
         "in `test_op_unigrams_phi`".format(result))

tf_model_base.py: hidden_activation hardcoded

In line 35 of tf_model_base.py, self.hidden_activation is initialized hardcoded to tf.nn.tanh instead of taking the parameter value.

Problem loading video on Video: High-level goals and guiding hypotheses [slides] page

The video at the following page is not loading. The page is also shooting a warning, since https is not implemented: Video: High-level goals and guiding hypotheses [slides]

hw_colors - Decoder foward method requiring lengths

Hi,

I was working on the hw_colors notebook to create the updated EncoderDecoder model following the instructions. I thought I got all my updates correct until I received an error in the last test.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-217-f204130de1b0> in <module>
----> 1 test_full_system(ColorizedInputDescriber)

<ipython-input-216-df38441535ee> in test_full_system(describer_class)
      8     toy_mod = describer_class(toy_vocab)
      9 
---> 10     _ = toy_mod.fit(toy_color_seqs_train, toy_word_seqs_train)
     11 
     12     acc = toy_mod.listener_accuracy(toy_color_seqs_test, toy_word_seqs_test)

~/Downloads/Stanford-CS224U/codebase/torch_model_base.py in fit(self, *args)
    359                 y_batch = batch[-1]
    360 
--> 361                 batch_preds = self.model(*X_batch)
    362 
    363                 err = self.loss(batch_preds, y_batch)

/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

<ipython-input-214-6cb1cf97c969> in forward(self, color_seqs, word_seqs, seq_lengths, hidden, targets)
     18         output, hidden = self.decoder(
     19             word_seqs,
---> 20             target_colors=color_seqs[:,-1,:])
     21 
     22         # Your decoder will return `output, hidden` pairs; the

/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/Downloads/Stanford-CS224U/codebase/torch_color_describer.py in forward(self, word_seqs, seq_lengths, hidden, target_colors)
    232                 batch_first=True,
    233                 lengths=seq_lengths,
--> 234                 enforce_sorted=False)
    235             # RNN forward:
    236             output, hidden = self.rnn(embs, hidden)

/usr/local/lib/python3.7/site-packages/torch/nn/utils/rnn.py in pack_padded_sequence(input, lengths, batch_first, enforce_sorted)
    232                       'the trace incorrect for any other combination of lengths.',
    233                       stacklevel=2)
--> 234     lengths = torch.as_tensor(lengths, dtype=torch.int64)
    235     if enforce_sorted:
    236         sorted_indices = None

TypeError: an integer is required (got type NoneType)

I was having a hard time understanding this error because I thought we didn't need to input hidden and seq_lengths in the forward step.

I am defining my decoder call as:

output, hidden = self.decoder(
            word_seqs,
            target_colors=color_seqs[:,-1,:])

Is it a mistake in my code or do I have to include an additional length variable?
I know this is part of the homework and not necessarily related to the code base, but any help would be appreciated.

Thank you

data.zip, the data used in this course cannot be unpacked

I downloaded data.zip from "the course data" link on this notebook https://github.com/cgpotts/cs224u/blob/spring-2019/vsm_01_distributional.ipynb

System:
macOS High Sierra 10.13.6
java 12.0.2 2019-07-16
Java(TM) SE Runtime Environment (build 12.0.2+10)
Java HotSpot(TM) 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)

I run jar xvf data.zip and get:

created: data/
inflated: data/.DS_Store
created: __MACOSX/
created: __MACOSX/data/
inflated: __MACOSX/data/._.DS_Store
created: data/glove.6B/
inflated: data/glove.6B/glove.6B.100d.txt
created: __MACOSX/data/glove.6B/
inflated: __MACOSX/data/glove.6B/._glove.6B.100d.txt
inflated: data/glove.6B/glove.6B.200d.txt
inflated: __MACOSX/data/glove.6B/._glove.6B.200d.txt
inflated: data/glove.6B/glove.6B.300d.txt
inflated: __MACOSX/data/glove.6B/._glove.6B.300d.txt
inflated: data/glove.6B/glove.6B.50d.txt
inflated: __MACOSX/data/glove.6B/._glove.6B.50d.txt
created: data/negotiate/
inflated: data/negotiate/data.txt
created: __MACOSX/data/negotiate/
inflated: __MACOSX/data/negotiate/._data.txt
inflated: data/negotiate/selfplay.txt
inflated: __MACOSX/data/negotiate/._selfplay.txt
inflated: data/negotiate/test.txt
inflated: __MACOSX/data/negotiate/._test.txt
inflated: data/negotiate/train.txt
inflated: __MACOSX/data/negotiate/._train.txt
inflated: data/negotiate/val.txt
inflated: __MACOSX/data/negotiate/._val.txt
inflated: __MACOSX/data/._negotiate
created: data/nlidata/
created: data/nlidata/.ipynb_checkpoints/
inflated: data/nlidata/.ipynb_checkpoints/prep_wordentail_data-checkpoint.ipynb
inflated: data/nlidata/.ipynb_checkpoints/prep_wordentail_data-Copy1-checkpoint.ipynb
created: data/nlidata/multinli_1.0/
extracted: data/nlidata/multinli_1.0/Icon
created: __MACOSX/data/nlidata/
created: __MACOSX/data/nlidata/multinli_1.0/
inflated: __MACOSX/data/nlidata/multinli_1.0/._Icon
inflated: data/nlidata/multinli_1.0/manuscript.pdf
inflated: __MACOSX/data/nlidata/multinli_1.0/._manuscript.pdf
inflated: data/nlidata/multinli_1.0/multinli_1.0_dev_matched.jsonl
inflated: __MACOSX/data/nlidata/multinli_1.0/._multinli_1.0_dev_matched.jsonl
inflated: data/nlidata/multinli_1.0/multinli_1.0_dev_matched.txt
inflated: __MACOSX/data/nlidata/multinli_1.0/._multinli_1.0_dev_matched.txt
inflated: data/nlidata/multinli_1.0/multinli_1.0_dev_mismatched.jsonl
inflated: __MACOSX/data/nlidata/multinli_1.0/._multinli_1.0_dev_mismatched.jsonl
inflated: data/nlidata/multinli_1.0/multinli_1.0_dev_mismatched.txt
inflated: __MACOSX/data/nlidata/multinli_1.0/._multinli_1.0_dev_mismatched.txt
inflated: data/nlidata/multinli_1.0/multinli_1.0_train.jsonl
inflated: __MACOSX/data/nlidata/multinli_1.0/._multinli_1.0_train.jsonl
java.io.EOFException: Unexpected end of ZLIB input stream
at java.base/java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:245)
at java.base/java.util.zip.InflaterInputStream.read(InflaterInputStream.java:159)
at java.base/java.util.zip.ZipInputStream.read(ZipInputStream.java:195)
at java.base/java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:141)
at jdk.jartool/sun.tools.jar.Main.extractFile(Main.java:1457)
at jdk.jartool/sun.tools.jar.Main.extract(Main.java:1364)
at jdk.jartool/sun.tools.jar.Main.run(Main.java:409)
at jdk.jartool/sun.tools.jar.Main.main(Main.java:1681)

I run unzip data.zip and get:

Archive: data.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of data.zip or
data.zip.zip, and cannot find data.zip.ZIP, period.

YouTube videos for later lectures?

Thank you for the wonderful course resources you've made available! Will the videos for the later lectures such as grounded language understanding, semantic parsing, evaluation metrics, and contextual word embeddings ever make their way to YouTube?

DictVectorizer.get_feature_names( ... ) should be DictVectorizer.get_feature_names_out( ...)

I believe get_feature_names() is deprecated (and now removed). get_feature_names_out() is the replacement.

hw_wordentail -define_graph() error

While working on the hw_wordentail notebook, I have found an error in test_TorchDeepNueralClassifier function.

The test aims to check whether the nn.module return value from class TorchDeepNeuralClassifier has been successfully implemented. However, the test function extracts the graph by using define_graph() function which is not implemented.

This should be changed to build_graph(). I can make the change and create a pull request if needed.

Trouble with hw_wordsim: Combining PPMI and LSA

In the second homework question for hw_wordsim, we are provided with the following instructions:

Gigaword with LSA at different dimensions [0.5 points]
We might expect PPMI and LSA to form a solid pipeline that combines the strengths of PPMI with those of dimensionality reduction. However, LSA has a hyper-parameter  𝑘  – the dimensionality of the final representations – that will impact performance. For this problem, write a wrapper function run_ppmi_lsa_pipeline that does the following:

1. Takes as input a count pd.DataFrame and an LSA parameter k.
2. Reweights the count matrix with PPMI.
3. Applies LSA with dimensionality k.
4. Evaluates this reweighted matrix using full_word_similarity_evaluation. The return value of run_ppmi_lsa_pipeline should be the return value of this call to full_word_similarity_evaluation.
The goal of this question is to help you get a feel for how much LSA alone can contribute to this problem.

The function test_run_ppmi_lsa_pipeline will test your function on the count matrix in data/vsmdata/giga_window20-flat.csv.gz.

When I construct run_ppmi_lsa_pipeline with the following steps, I get the wrong output:

Reweight input count_df using: ppmi_df = vsm.pmi(count_df, positive=True)
Perform LSA using: vsm.lsa(ppmi_df, k)
Calculate similarities

This leads to a similarity evaluation for men of 0.57. Not the expected 0.16

However, when I construct run_ppmi_lsa_pipeline using the following steps (without applying PPMI), I get the correct output:

Perform LSA using vsm.lsa(count_df, k)
Calculate similarities

This leads to the expected similarity for men of 0.16.

Is there a mistake in the instructions/expected results? Perhaps I did something wrong? Any and all help would be appreciated.

Should set return_dict=True for bert_model in finetuning.ipynb

Hi there,

I noticed that this line should be changed to

reps = bert_model(X_example, attention_mask=X_example_mask, return_dict=True)

for the following cells to work.

Thank you,
Wen

Add environment variable check

The autograder being used in XCS224U when converting the notebook hw_colors.ipynb to a python script adds get_ipython() command to the script. This won't work unless you import it - from IPython import get_ipython. something like this:

So it would be good to change these lines of code from:

to this:

I already have the permission to make changes to the repo. Just wanted to run by you once before making changes and committing it. Thanks @cgpotts

First cell of hw_sentiment.ipynb fails

Executing the first cell of the hw_sentiment.ipynb cell returns the error partially initialized module 'charset_normalizer' has no attribute 'md__mypyc' on a newly install nlu environment using miniconda.

This was resolved by running this command in the conda environment:
pip install -U --force-reinstall charset-normalizer

(based on the post on pip install -U --force-reinstall charset-normalizer .

super minor : link correction

For the main interface, we can just subclass TorchRNNClassifier and change the build_graph method to use TorchVecAvgModel. (For more details on the code and logic here, see the notebook: tutorial_torch_models.ipynb

This should be:

tutorial_pytorch_models.ipynb

Typo in hw_wordentail.ipynb

I wanted to create PR but I do not have create branch permission.

Minor thing, but could improve understanding. In hw_wordentail.ipynb:

"That is, if a word w appears in a training pair, it does not occur in any text pair. "

should read as:

"That is, if a word w appears in a training pair, it does not occur in any test pair. "

Consider using os.environ.get() instead of test ENV presence

The code base thoroughly uses in idiom to test environment variable presence as a flag, like

if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass

To enable above condition, one has to unset IS_GRADESCOPE_ENV. Flipping the value has not effect, such as IS_GRADESCOPE_ENV=0 or IS_GRADESCOPE_ENV=1.

However, it is usually to use os.environ.get() as a better alternative in production systems. For example,

if not os.environ.get('IS_GRADESCOPE_ENV', False):
    pass

# or
if not os.environ.get('IS_GRADESCOPE_ENV', None):
    pass

# or strictly checking against pre-defined value
if not os.environ.get('IS_GRADESCOPE_ENV', '0') == '1':
    pass

Because in shell script, we usually test if an environment variable flag is on by checking if it is present or if it is non-empty (i.e., [[ ! -z "$VAR" ]]). In this way, either of following will be evaluated as false as expected and follows the convention, just in case users forgot to unset the value from the environment.

IS_GRADESCOPE_ENV=
unset IS_GRADESCOPE_ENV
export IS_GRADESCOPE_ENV=''

Just a minor issue, feel free to ignore. 😄

Average of the context vector in lecture "Contextual Word Representation"

Thank you for the great course! The course lectures and the other materials are really valuable to learn more about NLU.

I am not an enrolled student, but I've decided ask here a minor question related to the first lecture about "Contextual Word Representation".

In slide 5 (https://web.stanford.edu/class/cs224u/slides/cs224u-contextualreps-part1-handout.pdf), the "context vector" is evaluated as $κ = mean([α_1.h_1, α_2.h_2, α_3.h_3])$.

My question: Is it really necessary to do the "mean" operation instead of a "sum" ?

The attention weights $a_n$ are already from a softmax. The term $sum([α_1.h_1, α_2.h_2, α_3.h_3])$ would be a "weighted average" of the hidden states.

What I see often is to scale the dot products (before the softmax) $h^T_C.h_n$ by $1/\sqrt{d_k}$, where $d_k$ is the vector dimension, to normalize the variance (and get better results) as presented in the paper "Attention Is All You Need".

Thanks again!

add expected score for `run_knn_score_model`?

Hi @cgpotts,

I've just done all of the assignments (except the original system) in hw_wordrelatedness.ipynb and related notebooks and have a couple of thoughts that may be useful to share.

First, it is great that you have open-sourced the class material. Thank you!
I noticed that in the Learned distance functions section, the function run_knn_score_model that students are asked to write is not tested at all. It could be helpful to check the output score so that students know they have written it correctly. You could make train_test_split deterministic by setting shuffle=False. Another option would be to add a note about what approximate score one should expect.
In vsm_01_distributional.ipynb your proper_cosine function looks to be returning the angular distance. Calling it proper_cosine may be confusing to some unless that's a standard name for it.

Looking forward to going through the next notebooks!

Location of Yelp and Gigaword data files

Where are the Gigaword and Yelp files located for vsm_01_distributional.ipynb notebook?

data package

Torch_model_base.py error in collate_fn

Hi,

I have been trying to run the codes in the notebook. However, as I try to run the code that utilizes torch_model, I keep getting an error when I fit the model. How can I resolve this issue?

Thank you,

Joey

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-60-77ee602003f0> in <module>
      1 giga_ae = TorchAutoencoder(max_iter=1000,
      2                           hidden_dim=100,
----> 3                           eta=0.03).fit(giga5_svd500)

~/Downloads/Stanford-CS224U/codebase/torch_autoencoder.py in fit(self, X)
    124 
    125         """
--> 126         super().fit(X, X)
    127         # Hidden representations:
    128         with torch.no_grad():

~/Downloads/Stanford-CS224U/codebase/torch_model_base.py in fit(self, *args)
    351             epoch_error = 0.0
    352 
--> 353             for batch_num, batch in enumerate(dataloader, start=1):
    354 
    355                 batch = [x.to(self.device, non_blocking=True) for x in batch]

/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    613         if self.num_workers == 0:  # same-process loading
    614             indices = next(self.sample_iter)  # may raise StopIteration
--> 615             batch = self.collate_fn([self.dataset[i] for i in indices])
    616             if self.pin_memory:
    617                 batch = pin_memory_batch(batch)

TypeError: 'NoneType' object is not callable

Course setup, Pytorch CPU

Hi - I've followed the instructions to setup an environment for the course on my machine. The only difference I am aware of is that miniconda was pre-installed.
When following the instructions, the version of pytorch installed was cpu only. To fix that, I ran the following command:

conda install pytorch=1.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
This seems to have resolved it for me.

I think the class data source has broken.

the data.zip might be broken since when I try to open it, my machine told me this and I could not find the csv.gz file as a result..

cgpotts / cs224u Goto Github PK

cs224u's Introduction

CS224u: Natural Language Understanding

Core components

setup.ipynb

hw_*.ipynb

tutorial_* notebooks

torch_*.py modules

evaluation_*.ipynb and projects.md

iit* and feature_attribution.ipynb

np_*.py modules

vsm_*

sst_*

finetuning.ipynb

utils.py

test/

License

cs224u's People

Stargazers

Watchers

Forkers

cs224u's Issues

Recommend Projects

Recommend Topics

Recommend Org

`setup.ipynb`

`hw_*.ipynb`

`tutorial_*` notebooks

`torch_*.py` modules

`evaluation_*.ipynb` and `projects.md`

`iit*` and `feature_attribution.ipynb`

`np_*.py` modules

`vsm_*`

`sst_*`

`finetuning.ipynb`

`utils.py`

`test/`