Giter VIP home page Giter VIP logo

cs224u's Introduction

CS224u: Natural Language Understanding

Code for the Stanford course.

Spring 2023

Christopher Potts

Core components

setup.ipynb

Details on how to get set up to work with this code.

hw_*.ipynb

The set of homeworks for the current run of the course.

tutorial_* notebooks

Introductions to Jupyter notebooks, scientific computing with NumPy and friends, and PyTorch.

torch_*.py modules

A generic optimization class (torch_model_base.py) and subclasses for GloVe, Autoencoders, shallow neural classifiers, RNN classifiers, tree-structured networks, and grounded natural language generation.

tutorial_pytorch_models.ipynb shows how to use these modules as a general framework for creating original systems.

evaluation_*.ipynb and projects.md

Notebooks covering key experimental methods and practical considerations, and tips on writing up and presenting work in the field.

iit* and feature_attribution.ipynb

Part of our unit on explainability and model analysis.

np_*.py modules

This is now considered background material for the course.

Reference implementations for the torch_*.py models, designed to reveal more about how the optimization process works.

vsm_*

This is now considered background material for the course.

A unit on vector space models of meaning, covering traditional methods like PMI and LSA as well as newer methods like Autoencoders and GloVe. vsm.py provides a lot of the core functionality, and torch_glove.py and torch_autoencoder.py are the learned models that we cover. vsm_03_contextualreps.ipynb explores methods for deriving static representations from contextual models.

sst_*

This is now considered background material for the course.

A unit on sentiment analysis with the English Stanford Sentiment Treebank. The core code is sst.py, which includes a flexible experimental framework. All the PyTorch classifiers are put to use as well: torch_shallow_neural_network.py, torch_rnn_classifier.py, and torch_tree_nn.py.

finetuning.ipynb

This is now considered background material for the course.

Using pretrained parameters from Hugging Face for featurization and fine-tuning.

utils.py

Miscellaneous core functions used throughout the code.

test/

To run these tests, use

py.test -vv test/*

or, for just the tests in test_shallow_neural_classifiers.py,

py.test -vv test/test_shallow_neural_classifiers.py

If the above commands don't work, try

python3 -m pytest -vv test/test_shallow_neural_classifiers.py

License

The materials in this repo are licensed under the Apache 2.0 license and a Creative Commons Attribution-ShareAlike 4.0 International license.

cs224u's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cs224u's Issues

minor error in sanity check test_op_unigrams_phi

in hw2_sst.ipynb

I think you meant amazing instead of enlightening in expected
(BTW: Thank you so much for your amazing work)

def test_op_unigrams_phi(func):
    tree = Tree.fromstring("""(4 (2 NLU) (4 (2 is) (4 amazing)))""")
    expected = {"enlightening": 1}
    result = func(tree)
    assert result == expected, \
        ("Error for `op_unigrams_phi`: "
         "Got `{}` which differs from `expected` "
         "in `test_op_unigrams_phi`".format(result))

hw_colors - Decoder foward method requiring lengths

Hi,

I was working on the hw_colors notebook to create the updated EncoderDecoder model following the instructions. I thought I got all my updates correct until I received an error in the last test.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-217-f204130de1b0> in <module>
----> 1 test_full_system(ColorizedInputDescriber)

<ipython-input-216-df38441535ee> in test_full_system(describer_class)
      8     toy_mod = describer_class(toy_vocab)
      9 
---> 10     _ = toy_mod.fit(toy_color_seqs_train, toy_word_seqs_train)
     11 
     12     acc = toy_mod.listener_accuracy(toy_color_seqs_test, toy_word_seqs_test)

~/Downloads/Stanford-CS224U/codebase/torch_model_base.py in fit(self, *args)
    359                 y_batch = batch[-1]
    360 
--> 361                 batch_preds = self.model(*X_batch)
    362 
    363                 err = self.loss(batch_preds, y_batch)

/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

<ipython-input-214-6cb1cf97c969> in forward(self, color_seqs, word_seqs, seq_lengths, hidden, targets)
     18         output, hidden = self.decoder(
     19             word_seqs,
---> 20             target_colors=color_seqs[:,-1,:])
     21 
     22         # Your decoder will return `output, hidden` pairs; the

/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/Downloads/Stanford-CS224U/codebase/torch_color_describer.py in forward(self, word_seqs, seq_lengths, hidden, target_colors)
    232                 batch_first=True,
    233                 lengths=seq_lengths,
--> 234                 enforce_sorted=False)
    235             # RNN forward:
    236             output, hidden = self.rnn(embs, hidden)

/usr/local/lib/python3.7/site-packages/torch/nn/utils/rnn.py in pack_padded_sequence(input, lengths, batch_first, enforce_sorted)
    232                       'the trace incorrect for any other combination of lengths.',
    233                       stacklevel=2)
--> 234     lengths = torch.as_tensor(lengths, dtype=torch.int64)
    235     if enforce_sorted:
    236         sorted_indices = None

TypeError: an integer is required (got type NoneType)

I was having a hard time understanding this error because I thought we didn't need to input hidden and seq_lengths in the forward step.

I am defining my decoder call as:

output, hidden = self.decoder(
            word_seqs,
            target_colors=color_seqs[:,-1,:])

Is it a mistake in my code or do I have to include an additional length variable?
I know this is part of the homework and not necessarily related to the code base, but any help would be appreciated.

Thank you

data.zip, the data used in this course cannot be unpacked

I downloaded data.zip from "the course data" link on this notebook https://github.com/cgpotts/cs224u/blob/spring-2019/vsm_01_distributional.ipynb

System:
macOS High Sierra 10.13.6
java 12.0.2 2019-07-16
Java(TM) SE Runtime Environment (build 12.0.2+10)
Java HotSpot(TM) 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)

I run jar xvf data.zip and get:

created: data/
inflated: data/.DS_Store
created: __MACOSX/
created: __MACOSX/data/
inflated: __MACOSX/data/._.DS_Store
created: data/glove.6B/
inflated: data/glove.6B/glove.6B.100d.txt
created: __MACOSX/data/glove.6B/
inflated: __MACOSX/data/glove.6B/._glove.6B.100d.txt
inflated: data/glove.6B/glove.6B.200d.txt
inflated: __MACOSX/data/glove.6B/._glove.6B.200d.txt
inflated: data/glove.6B/glove.6B.300d.txt
inflated: __MACOSX/data/glove.6B/._glove.6B.300d.txt
inflated: data/glove.6B/glove.6B.50d.txt
inflated: __MACOSX/data/glove.6B/._glove.6B.50d.txt
created: data/negotiate/
inflated: data/negotiate/data.txt
created: __MACOSX/data/negotiate/
inflated: __MACOSX/data/negotiate/._data.txt
inflated: data/negotiate/selfplay.txt
inflated: __MACOSX/data/negotiate/._selfplay.txt
inflated: data/negotiate/test.txt
inflated: __MACOSX/data/negotiate/._test.txt
inflated: data/negotiate/train.txt
inflated: __MACOSX/data/negotiate/._train.txt
inflated: data/negotiate/val.txt
inflated: __MACOSX/data/negotiate/._val.txt
inflated: __MACOSX/data/._negotiate
created: data/nlidata/
created: data/nlidata/.ipynb_checkpoints/
inflated: data/nlidata/.ipynb_checkpoints/prep_wordentail_data-checkpoint.ipynb
inflated: data/nlidata/.ipynb_checkpoints/prep_wordentail_data-Copy1-checkpoint.ipynb
created: data/nlidata/multinli_1.0/
extracted: data/nlidata/multinli_1.0/Icon
created: __MACOSX/data/nlidata/
created: __MACOSX/data/nlidata/multinli_1.0/
inflated: __MACOSX/data/nlidata/multinli_1.0/._Icon
inflated: data/nlidata/multinli_1.0/manuscript.pdf
inflated: __MACOSX/data/nlidata/multinli_1.0/._manuscript.pdf
inflated: data/nlidata/multinli_1.0/multinli_1.0_dev_matched.jsonl
inflated: __MACOSX/data/nlidata/multinli_1.0/._multinli_1.0_dev_matched.jsonl
inflated: data/nlidata/multinli_1.0/multinli_1.0_dev_matched.txt
inflated: __MACOSX/data/nlidata/multinli_1.0/._multinli_1.0_dev_matched.txt
inflated: data/nlidata/multinli_1.0/multinli_1.0_dev_mismatched.jsonl
inflated: __MACOSX/data/nlidata/multinli_1.0/._multinli_1.0_dev_mismatched.jsonl
inflated: data/nlidata/multinli_1.0/multinli_1.0_dev_mismatched.txt
inflated: __MACOSX/data/nlidata/multinli_1.0/._multinli_1.0_dev_mismatched.txt
inflated: data/nlidata/multinli_1.0/multinli_1.0_train.jsonl
inflated: __MACOSX/data/nlidata/multinli_1.0/._multinli_1.0_train.jsonl
java.io.EOFException: Unexpected end of ZLIB input stream
at java.base/java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:245)
at java.base/java.util.zip.InflaterInputStream.read(InflaterInputStream.java:159)
at java.base/java.util.zip.ZipInputStream.read(ZipInputStream.java:195)
at java.base/java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:141)
at jdk.jartool/sun.tools.jar.Main.extractFile(Main.java:1457)
at jdk.jartool/sun.tools.jar.Main.extract(Main.java:1364)
at jdk.jartool/sun.tools.jar.Main.run(Main.java:409)
at jdk.jartool/sun.tools.jar.Main.main(Main.java:1681)

I run unzip data.zip and get:

Archive: data.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of data.zip or
data.zip.zip, and cannot find data.zip.ZIP, period.

YouTube videos for later lectures?

Thank you for the wonderful course resources you've made available! Will the videos for the later lectures such as grounded language understanding, semantic parsing, evaluation metrics, and contextual word embeddings ever make their way to YouTube?

hw_wordentail -define_graph() error

While working on the hw_wordentail notebook, I have found an error in test_TorchDeepNueralClassifier function.

The test aims to check whether the nn.module return value from class TorchDeepNeuralClassifier has been successfully implemented. However, the test function extracts the graph by using define_graph() function which is not implemented.

This should be changed to build_graph(). I can make the change and create a pull request if needed.

Trouble with hw_wordsim: Combining PPMI and LSA

In the second homework question for hw_wordsim, we are provided with the following instructions:

Gigaword with LSA at different dimensions [0.5 points]
We might expect PPMI and LSA to form a solid pipeline that combines the strengths of PPMI with those of dimensionality reduction. However, LSA has a hyper-parameter  ๐‘˜  โ€“ the dimensionality of the final representations โ€“ that will impact performance. For this problem, write a wrapper function run_ppmi_lsa_pipeline that does the following:

1. Takes as input a count pd.DataFrame and an LSA parameter k.
2. Reweights the count matrix with PPMI.
3. Applies LSA with dimensionality k.
4. Evaluates this reweighted matrix using full_word_similarity_evaluation. The return value of run_ppmi_lsa_pipeline should be the return value of this call to full_word_similarity_evaluation.
The goal of this question is to help you get a feel for how much LSA alone can contribute to this problem.

The function test_run_ppmi_lsa_pipeline will test your function on the count matrix in data/vsmdata/giga_window20-flat.csv.gz.

When I construct run_ppmi_lsa_pipeline with the following steps, I get the wrong output:

  1. Reweight input count_df using: ppmi_df = vsm.pmi(count_df, positive=True)
  2. Perform LSA using: vsm.lsa(ppmi_df, k)
  3. Calculate similarities

This leads to a similarity evaluation for men of 0.57. Not the expected 0.16

However, when I construct run_ppmi_lsa_pipeline using the following steps (without applying PPMI), I get the correct output:

  1. Perform LSA using vsm.lsa(count_df, k)
  2. Calculate similarities

This leads to the expected similarity for men of 0.16.

Is there a mistake in the instructions/expected results? Perhaps I did something wrong? Any and all help would be appreciated.

Add environment variable check

The autograder being used in XCS224U when converting the notebook hw_colors.ipynb to a python script adds get_ipython() command to the script. This won't work unless you import it - from IPython import get_ipython. something like this:
image

So it would be good to change these lines of code from:
image
to this:
image

I already have the permission to make changes to the repo. Just wanted to run by you once before making changes and committing it. Thanks @cgpotts

super minor : link correction

For the main interface, we can just subclass TorchRNNClassifier and change the build_graph method to use TorchVecAvgModel. (For more details on the code and logic here, see the notebook: tutorial_torch_models.ipynb

This should be:

tutorial_pytorch_models.ipynb

Typo in hw_wordentail.ipynb

I wanted to create PR but I do not have create branch permission.

Minor thing, but could improve understanding. In hw_wordentail.ipynb:

"That is, if a word w appears in a training pair, it does not occur in any text pair. "

should read as:

"That is, if a word w appears in a training pair, it does not occur in any test pair. "

Consider using os.environ.get() instead of test ENV presence

The code base thoroughly uses in idiom to test environment variable presence as a flag, like

if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass

To enable above condition, one has to unset IS_GRADESCOPE_ENV. Flipping the value has not effect, such as IS_GRADESCOPE_ENV=0 or IS_GRADESCOPE_ENV=1.

However, it is usually to use os.environ.get() as a better alternative in production systems. For example,

if not os.environ.get('IS_GRADESCOPE_ENV', False):
    pass

# or
if not os.environ.get('IS_GRADESCOPE_ENV', None):
    pass

# or strictly checking against pre-defined value
if not os.environ.get('IS_GRADESCOPE_ENV', '0') == '1':
    pass

Because in shell script, we usually test if an environment variable flag is on by checking if it is present or if it is non-empty (i.e., [[ ! -z "$VAR" ]]). In this way, either of following will be evaluated as false as expected and follows the convention, just in case users forgot to unset the value from the environment.

IS_GRADESCOPE_ENV=
unset IS_GRADESCOPE_ENV
export IS_GRADESCOPE_ENV=''

Just a minor issue, feel free to ignore. ๐Ÿ˜„

Average of the context vector in lecture "Contextual Word Representation"

Thank you for the great course! The course lectures and the other materials are really valuable to learn more about NLU.

I am not an enrolled student, but I've decided ask here a minor question related to the first lecture about "Contextual Word Representation".

In slide 5 (https://web.stanford.edu/class/cs224u/slides/cs224u-contextualreps-part1-handout.pdf), the "context vector" is evaluated as $ฮบ = mean([ฮฑ_1.h_1, ฮฑ_2.h_2, ฮฑ_3.h_3])$.

My question: Is it really necessary to do the "mean" operation instead of a "sum" ?

The attention weights $a_n$ are already from a softmax. The term $sum([ฮฑ_1.h_1, ฮฑ_2.h_2, ฮฑ_3.h_3])$ would be a "weighted average" of the hidden states.

What I see often is to scale the dot products (before the softmax) $h^T_C.h_n$ by $1/\sqrt{d_k}$, where $d_k$ is the vector dimension, to normalize the variance (and get better results) as presented in the paper "Attention Is All You Need".

Thanks again!

add expected score for `run_knn_score_model`?

Hi @cgpotts,

I've just done all of the assignments (except the original system) in hw_wordrelatedness.ipynb and related notebooks and have a couple of thoughts that may be useful to share.

  • First, it is great that you have open-sourced the class material. Thank you!
  • I noticed that in the Learned distance functions section, the function run_knn_score_model that students are asked to write is not tested at all. It could be helpful to check the output score so that students know they have written it correctly. You could make train_test_split deterministic by setting shuffle=False. Another option would be to add a note about what approximate score one should expect.
  • In vsm_01_distributional.ipynb your proper_cosine function looks to be returning the angular distance. Calling it proper_cosine may be confusing to some unless that's a standard name for it.

Looking forward to going through the next notebooks!

Torch_model_base.py error in collate_fn

Hi,

I have been trying to run the codes in the notebook. However, as I try to run the code that utilizes torch_model, I keep getting an error when I fit the model. How can I resolve this issue?

Thank you,

Joey

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-60-77ee602003f0> in <module>
      1 giga_ae = TorchAutoencoder(max_iter=1000,
      2                           hidden_dim=100,
----> 3                           eta=0.03).fit(giga5_svd500)

~/Downloads/Stanford-CS224U/codebase/torch_autoencoder.py in fit(self, X)
    124 
    125         """
--> 126         super().fit(X, X)
    127         # Hidden representations:
    128         with torch.no_grad():

~/Downloads/Stanford-CS224U/codebase/torch_model_base.py in fit(self, *args)
    351             epoch_error = 0.0
    352 
--> 353             for batch_num, batch in enumerate(dataloader, start=1):
    354 
    355                 batch = [x.to(self.device, non_blocking=True) for x in batch]

/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    613         if self.num_workers == 0:  # same-process loading
    614             indices = next(self.sample_iter)  # may raise StopIteration
--> 615             batch = self.collate_fn([self.dataset[i] for i in indices])
    616             if self.pin_memory:
    617                 batch = pin_memory_batch(batch)

TypeError: 'NoneType' object is not callable

Course setup, Pytorch CPU

Hi - I've followed the instructions to setup an environment for the course on my machine. The only difference I am aware of is that miniconda was pre-installed.
When following the instructions, the version of pytorch installed was cpu only. To fix that, I ran the following command:

conda install pytorch=1.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
This seems to have resolved it for me.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.