krishnap25 / mauve Goto Github PK

View Code? Open in Web Editor NEW

268.0 268.0 24.0 4.46 MB

Package to compute Mauve, a similarity score between neural text and human text. Install with `pip install mauve-text`.

Home Page: https://krishnap25.github.io/mauve/

License: Other

Python 100.00%

deep-learning huggingface-transformers nlp pytorch text-generation

mauve's People

Contributors

Stargazers

Watchers

mauve's Issues

Comparing learned representations

I was wondering if mauve can be applied in a slightly different setting. So if I understand correctly -- Given two text distributions t1 and t2, mauve first computes fixed dimensional representations using a pretrained model like GPT-2. Let's say the features are: t1-(100, 1024), t2- (150, 1024). Then a common dataset d1 (250, 1024) is created and clustered using K-means. The cluster assignments are then used to compute normalized histograms which are then used to plot the divergence curve and then quantify the gap using AUC.

So my question is can we use mauve to compare the representations and quantify the information gap between the representations? Here we do not have two datasets from different data distributions, but two representations of the same data distribution -- for e.g. 8 and 16 dimensions. I am guessing we cannot use mauve directly here because we cannot combine them to form d1 (as the dimensions are different) and then cluster them. right? Do you have any suggestions on how it can be done by modifying mauve?
Thanks,
Kalyani

MAUVE can vary greatly when computed with different K-Means random seeds

While using MAUVE in a real use case, I decided to compute MAUVE multiple times per comparison with different K-Means random seeds.

I've noticed that the value of the MAUVE metric varies a lot across these K-Means seeds.

In particular, MAUVE varies about as much across K-Means seeds as it does across GPT sampling seeds. Typical values for std. dev. across 5 seeds are ~0.005 to ~0.01, for either type of seed (while holding the other constant).

This is also comparable in size to the MAUVE differences reported in some model/sampler comparisons, e.g. between nucleus GPT-2-large and nucleus GPT-2-xl in Table 6 of the original paper.

Do you have recommendations about what to do about this variability?

Am I doing something wrong?
Is this less of an issue with the DRMM or SPV algorithms? I haven't tried them.
I have an (untested) hypothesis that MAUVE would be less variable if fewer clusters were used.
- The rule k = n/10 gives us an average of 10 members per cluster for each of p and q, with many clusters having fewer than this. The small counts mean there is is high uncertainty in the individual terms of the KL-divergence sum.
- By the same token, we are averaging over a large number of bins, and one might hope the errors would wash out in the average, but perhaps they don't as much as we would hope.
- Using fewer clusters would tend to push MAUVE estimates close to one another (Fig. 8 in the original paper), but maybe we could compensate for this by using a higher scaling constant (Fig. 5). What do you think about this idea?

Colab notebook with an example: https://colab.research.google.com/drive/1wh38JRSr5vkOqlWUxNkP4tUgkFwZwAD0?usp=sharing

OMP: Error #179: Function pthread_mutex_init failed

Hi,
I'm using mauve in my project and pointing to already downloaded gpt2-large model. Below is the code:

import mauve
mauve_results = mauve.compute_mauve(p_text=[reference.strip("\n").replace(",", "")], q_text=[response.strip("\n").replace(",", "")], featurize_model_name='pre_trained_models/mauve/gpt2-large').mauve

When I use the above code in a Jupyter Notebook in PyCharm, I just get a warning
WARNING clustering 2 points to 2 centroids: please provide at least 78 training points
and I'm able to print the mauve score. However, when I use the same code in conjunction with streamlit to deploy a chatbot, I'm getting the below error:

WARNING clustering 2 points to 2 centroids: please provide at least 78 training points
OMP: Error #179: Function pthread_mutex_init failed:
OMP: System error #22: Invalid argument
zsh: abort      streamlit run streamlit_chatbot.py

Does anyone know where I'm going wrong? I'm using python 3.8.10 on a conda virtual env. System: MacBook Pro M1 chip.

10~1000 times speedup of Kmeans with pytorch+cuda

I found Kmeans in mauve.compute_mauve takes quite a long time, so i implemented Kmeans (with the same api to kmeans in sklearn) with pytorch and cuda, and replace the faiss.Kmeans with it.

In my observation, the time cost of Kmeans on 5000 text reduced from 16.07s ~ 668.27s to 0.65s ~ 1.39s with the mauve score a little (<0.01) change. (runing on an A100 gpu)

I think this might be useful for others, so i just share my implementation (i.e., compute_mauve.py) here.

# Author: Krishna Pillutla
# License: GPLv3

import math
import numpy as np
import time
from types import SimpleNamespace

import faiss
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.metrics import auc as compute_area_under_curve

try:
    import torch
    FOUND_TORCH = True
except (ImportError, ModuleNotFoundError):
    FOUND_TORCH = False

try:
    import transformers
    FOUND_TRANSFORMERS = True
except (ImportError, ModuleNotFoundError):
    FOUND_TRANSFORMERS = False

if FOUND_TORCH and FOUND_TRANSFORMERS:
    # only needed for tokenizing
    from .utils import get_tokenizer, get_model, featurize_tokens_from_model, get_device_from_arg


MODEL, TOKENIZER, MODEL_NAME = None, None, None

def compute_mauve(
        p_features=None, q_features=None,
        p_tokens=None, q_tokens=None,
        p_text=None, q_text=None,
        num_buckets='auto', pca_max_data=-1, kmeans_explained_var=0.9,
        kmeans_num_redo=5, kmeans_max_iter=500,
        featurize_model_name='gpt2-large', device_id=-1, max_text_length=1024,
        divergence_curve_discretization_size=25, mauve_scaling_factor=5,
        verbose=False, seed=25, batch_size=1, use_float64=False,
):

    """
    Compute the MAUVE score between two text generations P and Q.

    P is either specified as ``p_features``, ``p_tokens``, or ``p_text``. Same with Q.

    :param ``p_features``: ``numpy.ndarray`` of shape (n, d), where n is the number of generations.
    :param ``q_features``: ``numpy.ndarray`` of shape (n, d), where n is the number of generations.
    :param ``p_tokens``: list of length n, each entry is torch.LongTensor of shape (1, length).
    :param ``q_tokens``: list of length n, each entry is torch.LongTensor of shape (1, length).
    :param ``p_text``: list of length n, each entry is a string.
    :param ``q_text``: list of length n, each entry is a string.
    :param ``num_buckets``: the size of the histogram to quantize P and Q. Options: ``'auto'`` (default, which is n/10) or an integer.
    :param ``pca_max_data``: the number data points to use for PCA. If `-1`, use all the data. Default -1.
    :param ``kmeans_explained_var``: amount of variance of the data to keep in dimensionality reduction by PCA. Default 0.9.
    :param ``kmeans_num_redo``: number of times to redo k-means clustering (the best objective is kept). Default 5. 
        Try reducing this to 1 in order to reduce running time.
    :param ``kmeans_max_iter``: maximum number of k-means iterations. Default 500.
        Try reducing this to 100 in order to reduce running time.
    :param ``featurize_model_name``: name of the model from which features are obtained. Default 'gpt2-large'.
        We support all models which can be loaded from ``transformers.AutoModel.from_pretrained(featurize_model_name)``.
    :param ``device_id``: Device for featurization. Supply gpu_id (e.g. 0 or 3) to use GPU or -1 to use CPU.
    :param ``max_text_length``: maximum number of tokens to consider. Default 1024.
    :param ``divergence_curve_discretization_size``: Number of points to consider on the divergence curve. Default 25.
        Larger values do not offer much of a difference. 
    :param ``mauve_scaling_factor``: The constant``c`` from the paper. Default 5.
        See `Best Practices <index.html#best-practices-for-mauve>`_ for details.
    :param ``verbose``: If True, print running time updates.
    :param ``seed``: random seed to initialize k-means cluster assignments.
    :param ``batch_size``: Batch size for feature extraction

    :return: an object with fields p_hist, q_hist, divergence_curve and mauve.

    * ``out.mauve`` is a number between 0 and 1, the MAUVE score. Higher values means P is closer to Q.
    * ``out.frontier_integral``, a number between 0 and 1. Lower values mean that P is closer to Q. 
    * ``out.p_hist`` is the obtained histogram for P. Same for ``out.q_hist``.
    * ``out.divergence_curve`` contains the points in the divergence curve. It is of shape (m, 2), where m is ``divergence_curve_discretization_size``

    """

    if p_features is None and p_tokens is None and p_text is None:
        raise ValueError('Supply at least one of p_features, p_tokens, p_text')
    if q_features is None and q_tokens is None and q_text is None:
        raise ValueError('Supply at least one of q_features, q_tokens, q_text')
    p_features = get_features_from_input(
        p_features, p_tokens, p_text, featurize_model_name, max_text_length,
        device_id, name="p", verbose=verbose, batch_size=batch_size, use_float64=use_float64,
    )
    q_features = get_features_from_input(
        q_features, q_tokens, q_text, featurize_model_name, max_text_length,
        device_id, name="q", verbose=verbose, batch_size=batch_size, use_float64=use_float64,
    )
    if num_buckets == 'auto':
        # heuristic: use num_clusters = num_generations / 10
        num_buckets = max(2, int(round(min(p_features.shape[0], q_features.shape[0]) / 10)))
    elif not isinstance(num_buckets, int):
        raise ValueError('num_buckets is expected to be an integer or "auto"')

    # Acutal binning
    t1 = time.time()
    p, q = cluster_feats(p_features, q_features,
                         num_clusters=num_buckets,
                         norm='l2', whiten=False,
                         pca_max_data=pca_max_data,
                         explained_variance=kmeans_explained_var,
                         num_redo=kmeans_num_redo,
                         max_iter=kmeans_max_iter,
                         seed=seed, verbose=verbose)
    t2 = time.time()
    if verbose:
        print('total discretization time:', round(t2-t1, 2), 'seconds')

    # Divergence curve and mauve
    mixture_weights = np.linspace(1e-6, 1-1e-6, divergence_curve_discretization_size)
    divergence_curve = get_divergence_curve_for_multinomials(p, q, mixture_weights, mauve_scaling_factor)
    x, y = divergence_curve.T
    idxs1 = np.argsort(x)
    idxs2 = np.argsort(y)
    mauve_score = 0.5 * (
        compute_area_under_curve(x[idxs1], y[idxs1]) +
        compute_area_under_curve(y[idxs2], x[idxs2])
    )
    fi_score = get_fronter_integral(p, q)
    to_return = SimpleNamespace(
        p_hist=p, q_hist=q, divergence_curve=divergence_curve, 
        mauve=mauve_score,
        frontier_integral=fi_score,
        num_buckets=num_buckets,
    )
    return to_return

def get_features_from_input(features, tokenized_texts, texts,
                            featurize_model_name, max_len, device_id, name, batch_size,
                            verbose=False, use_float64=False):
    global MODEL, TOKENIZER, MODEL_NAME
    if features is None:
        # Featurizing is necessary. Make sure the required packages are available
        if not FOUND_TORCH:
            raise ModuleNotFoundError(
                """PyTorch not found. Please install PyTorch if you would like to use the featurization.
                    For details, see `https://github.com/krishnap25/mauve` 
                    and `https://pytorch.org/get-started/locally/`.
                """)
        if not FOUND_TRANSFORMERS:
            raise ModuleNotFoundError(
                """Transformers not found. Please install Transformers if you would like to use the featurization.
                    For details, see `https://github.com/krishnap25/mauve` 
                    and `https://huggingface.co/transformers/installation.html`.
                """)

        if tokenized_texts is None:
            # tokenize texts
            if TOKENIZER is None or MODEL_NAME != featurize_model_name:
                if verbose: print('Loading tokenizer')
                TOKENIZER = get_tokenizer(featurize_model_name)
            if verbose: print('Tokenizing text...')
            tokenized_texts = [
                TOKENIZER.encode(sen, return_tensors='pt', truncation=True, max_length=max_len)
                for sen in texts
            ]
        # use tokenized_texts to featurize
        if TOKENIZER is None or MODEL_NAME != featurize_model_name:
            if verbose: print('Loading tokenizer')
            TOKENIZER = get_tokenizer(featurize_model_name)
        if MODEL is None or MODEL_NAME != featurize_model_name:
            if verbose: print('Loading model')
            MODEL = get_model(featurize_model_name, TOKENIZER, device_id)
            MODEL_NAME = featurize_model_name
        else:
            MODEL = MODEL.to(get_device_from_arg(device_id))
        if use_float64:
            MODEL = MODEL.double()
        if verbose: print('Featurizing tokens')
        features = featurize_tokens_from_model(MODEL, tokenized_texts, batch_size, name).detach().cpu().numpy()
    else:
        features = np.asarray(features)
    return features

def cluster_feats(p, q, num_clusters,
                  norm='none', whiten=True,
                  pca_max_data=-1,
                  explained_variance=0.9,
                  num_redo=5, max_iter=500,
                  seed=0, verbose=False):
    assert 0 < explained_variance < 1
    if verbose:
        print(f'seed = {seed}')
    assert norm in ['none', 'l2', 'l1', None]
    data1 = np.vstack([q, p])
    if norm in ['l2', 'l1']:
        data1 = normalize(data1, norm=norm, axis=1)
    pca = PCA(n_components=None, whiten=whiten, random_state=seed+1)
    if pca_max_data < 0 or pca_max_data >= data1.shape[0]:
        pca.fit(data1)
    elif 0 < pca_max_data < data1.shape[0]:
        rng = np.random.RandomState(seed+5)
        idxs = rng.choice(data1.shape[0], size=pca_max_data, replace=False)
        pca.fit(data1[idxs])
    else:
        raise ValueError(f'Invalid argument pca_max_data={pca_max_data} with {data1.shape[0]} datapoints')
    s = np.cumsum(pca.explained_variance_ratio_)
    idx = np.argmax(s >= explained_variance)  # last index to consider
    if verbose:
        print(f'performing clustering in lower dimension = {idx}')
    data1 = pca.transform(data1)[:, :idx+1]
    # Cluster
    data1 = data1.astype(np.float32)
    """
    t1 = time.time()
    print(f"k-means by faiss")
    kmeans = faiss.Kmeans(data1.shape[1], num_clusters, niter=max_iter,
                          verbose=verbose, nredo=num_redo, update_index=True,
                          seed=seed+2, gpu=True)
    kmeans.train(data1)
    _, labels = kmeans.index.search(data1, 1)
    labels = labels.reshape(-1)
    t2 = time.time()
    if verbose:
        print('kmeans time:', round(t2-t1, 2), 's') # 668.27 s
    """
    t1 = time.time()
    print(f"k-means by me")
    kmeans = KMeans(n_clusters=num_clusters, n_init=num_redo, max_iter=max_iter)
    clusters = kmeans.fit(torch.from_numpy(data1).cuda())
    labels = [None] * data1.shape[0]
    for i, cluster in enumerate(clusters):
        for index in cluster:
            labels[index] = i
    labels = np.array(labels)
    t2 = time.time()
    if verbose:
        print('kmeans time:', round(t2-t1, 2), 's') # 0.65 s
    
    q_labels = labels[:len(q)]
    p_labels = labels[len(q):]

    q_bins = np.histogram(q_labels, bins=num_clusters,
                           range=[0, num_clusters], density=True)[0]
    p_bins = np.histogram(p_labels, bins=num_clusters,
                          range=[0, num_clusters], density=True)[0]
    return p_bins / p_bins.sum(), q_bins / q_bins.sum()


def kl_multinomial(p, q):
    assert p.shape == q.shape
    if np.logical_and(p != 0, q == 0).any():
        return np.inf
    else:
        idxs = np.logical_and(p != 0, q != 0)
        return np.sum(p[idxs] * np.log(p[idxs] / q[idxs]))


def get_divergence_curve_for_multinomials(p, q, mixture_weights, scaling_factor):
    # TODO: check if extreme points are needed
    divergence_curve = [[0, np.inf]] # extreme point
    for w in np.sort(mixture_weights):
        r = w * p + (1 - w) * q
        divergence_curve.append([kl_multinomial(q, r), kl_multinomial(p, r)])
    divergence_curve.append([np.inf, 0]) # other extreme point
    return np.exp(-scaling_factor * np.asarray(divergence_curve))

def get_fronter_integral(p, q, scaling_factor=2):
    total = 0.0
    for p1, q1 in zip(p, q):
        if p1 == 0 and q1 == 0:
            pass
        elif p1 == 0:
            total += q1 / 4
        elif q1 == 0:
            total += p1 / 4
        elif abs(p1 - q1) > 1e-8:
            t1 = p1 + q1
            t2 = p1 * q1 * (math.log(p1) - math.log(q1)) / (p1 - q1)
            total += 0.25 * t1 - 0.5 * t2
        # else: contribution is 0 
    return total * scaling_factor


# k-means through pytorch on cuda
class KMeans:
    def __init__(self, n_clusters, n_init=10, max_iter=300, min_variation=1e-3, device="cuda"):
        self.n_clusters = n_clusters
        self.n_init = n_init
        self.max_iter = max_iter
        self.min_variation = min_variation
        self.device = device

    def fit(self, encodings):
        if self.n_clusters<1:
            raise Exception(f"the number of clusters should be >=1, but got {self.n_clusters}")
        if self.n_clusters==1:
            clusters = [list(range(encodings.shape[0]))]
        else:
            best_group_index, min_loss = None, 1e10
            for init in range(self.n_init):
                loss = np.nan
                while np.isnan(loss):
                    group_index, loss = self.fit_once(encodings, init)
                if loss < min_loss:
                    best_group_index = group_index
            clusters = [[] for _ in range(self.n_clusters)]
            for i,index in enumerate(best_group_index.tolist()):
                clusters[index].append(i)
        return clusters

    @torch.no_grad()
    def fit_once(self, encodings, init):
        encodings = torch.nn.functional.normalize(encodings.to(self.device), dim=-1)

        unique_encodings = torch.unique(encodings, dim=0)
        if unique_encodings.shape[0]<self.n_clusters:
            self.n_clusters = unique_encodings.shape[0]
        centers = None
        ceil = torch.Tensor([1.0]).to(self.device)
        for i,idx in enumerate(torch.randperm(unique_encodings.shape[0]).tolist()):
            if i==0:
                centers = unique_encodings[idx].unsqueeze(0)
                continue
            new_center = unique_encodings[idx].unsqueeze(0)
            if not torch.isclose(torch.mm(centers, new_center.T).max(), ceil):
                centers = torch.cat([centers, new_center], dim=0)
                if centers.shape[0]==self.n_clusters:
                    break

        from tqdm import tqdm
        with tqdm(total=self.max_iter, desc=f"KMeans ({init+1}/{self.n_init})") as bar:
            for iter_step in range(self.max_iter):
                old_centers = centers
                group_index, loss = self.group_points(centers, encodings)
                centers = self.update_centers(group_index, encodings, old_centers)
                centers_max_movement = ((old_centers-centers)**2).sum(dim=-1).max().item()
                bar.set_description(f"KMeans ({init+1}/{self.n_init}): "
                                    f"loss: {loss:.3f} | "
                                    f"movement: {centers_max_movement:.3f}")
                if centers_max_movement < self.min_variation or np.isnan(loss):
                    bar.total = iter_step + 1
                    bar.update(1)
                    break
                else:
                    bar.update(1)
        return group_index, loss

    @torch.no_grad()
    def group_points(self, centers, encodings, capacity=int(1e10)):
        # centers: [n_clusters, hs]
        # encodings: [N, hs]
        split_len = capacity // (encodings.shape[0] * centers.shape[0])
        split_num = np.math.ceil(encodings.shape[0] / split_len)
        group_index = []
        loss = 0.0
        for i in range(split_num):
            split_encodings = encodings[i*split_len:(i+1)*split_len, :]
            split_distances = 1 - torch.mm(split_encodings, centers.T)
            group_index.append(split_distances.argmin(dim=-1).detach())
            loss += split_distances.min(dim=-1).values.sum().item()
        group_index = torch.cat(group_index, dim=0)
        return group_index, loss # [n_clusters], float

    @torch.no_grad()
    def update_centers(self, group_index, encodings, old_centers):
        #sum_vec = torch.zeros([self.n_clusters, encodings.shape[1]],
        #                      dtype=encodings.dtype, device=self.device)
        sum_vec = old_centers * 1e-6
        count_vec = torch.zeros([self.n_clusters], device=self.device) + 1e-6
        index = group_index.unsqueeze(1).repeat(1, encodings.shape[1]) # [n_clusters, hs]
        sum_vec = sum_vec.scatter_add_(dim=0, index=index, src=encodings)
        count_vec = count_vec.scatter_add_(dim=0, index=group_index, src=torch.ones_like(group_index).float())
        mean_vec = sum_vec.div_(count_vec.unsqueeze(1))
        centers = torch.nn.functional.normalize(mean_vec, dim=-1)
        return centers

Question about the Spearman rank correlation

Thanks for great paper! However I'm have a little bit confusion about the Spearman rank correlation in § 4.3 and Appendix E.
While calculating Spearman rank correlation between MAUVE and human, do you mean to use eight figures from MAUVE and other eight figures from Bradley-Terry model?

Is MAUVE meaningful, when keeping hyperparameters the same, to judge two pairs of distributions?

I understand that MAUVE is meant for relative comparisons, and is only meaningful when hyperparameters (c, k for k-means) and choice of embedding model are kept consistent. However, how about this setting:

I'm looking at two pairs of text -- a human generated set H1, a corresponding machine generated text M1, and another pair H2 and M2. I want to know which pair is closer, i.e. I want to decide whether M1 is a better representation of H1 or M2 of H2. Is it valid to muse MAUVE(H1,M1) and MAUVE(H2,M2) to compare them?

How can I accelerate the featurizing of MAUVE score

I try to direct leverage MAUVE API such as

out = mauve.compute_mauve(p_text=p_text, q_text=q_text, device_id=0, max_text_length=256, verbose=False).

I find that the GPU memory usage is quite low. How could I accelerate the calculation (featuring of sequence) of MAUVE score.

I note there is a hyper-parameter ''batch_size''. Could I tune the hyperparameter? Do different values cause different results? What is the default value of the ''batchsize''?

WARNING clustering 13104 points to 655 centroids: please provide at least 25545 training points

I received this warning when running MAUVE with basically default settings.
mauve_out = mauve.compute_mauve(p_text=p_text, q_text=q_text, verbose=False, device_id=0)
WARNING clustering 13104 points to 655 centroids: please provide at least 25545 training points

I've provided ~6,500 generations each for p_text and q_text which is more than the 5,000 used in the paper. Is this warning safe to ignore? The generations are fairly short compared to those in the paper (~20 BPE tokens).

Evaluation of empty strings results in error

Hi, I'm currently using HF wrapper to evaluate strings with MAUVE: https://huggingface.co/spaces/evaluate-metric/mauve

I'm getting the following an error when passing empty strings (e.g. ""). Is this an expected behavior? I would think that the empty string could be considered a valid string and therefore could be evaluated with MAUVE?

File "src/main.py", line 194, in <module>
    main()
  File "src/main.py", line 185, in main
    mauve_results_local, mauve_results_global = evaluate(
  File "/mnt/aiongpfs/projects/ai-art/global-decoding/global-decoding/src/eval/evaluate.py", line 58, in evaluate
    mauve_results_global = mauve.compute(
  File "/mnt/aiongpfs/projects/ai-art/global-decoding/lib/python3.8/site-packages/evaluate/module.py", line 462, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/users/dgareev/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--mauve/653f4690bcc6e9e16df42f9f85bd0f1b59cca1ae2864058dceedc595f26f6705/mauve.py", line 136, in _compute
    out = compute_mauve(
  File "/mnt/aiongpfs/projects/ai-art/global-decoding/lib/python3.8/site-packages/mauve/compute_mauve.py", line 87, in compute_mauve
    p_features = get_features_from_input(
  File "/mnt/aiongpfs/projects/ai-art/global-decoding/lib/python3.8/site-packages/mauve/compute_mauve.py", line 176, in get_features_from_input
    features = featurize_tokens_from_model(MODEL, tokenized_texts, batch_size, name).detach().cpu().numpy()
  File "/mnt/aiongpfs/projects/ai-art/global-decoding/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/aiongpfs/projects/ai-art/global-decoding/lib/python3.8/site-packages/mauve/utils.py", line 114, in featurize_tokens_from_model
    outs = model(input_ids=padded_chunk,
  File "/mnt/aiongpfs/projects/ai-art/global-decoding/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/mnt/aiongpfs/projects/ai-art/global-decoding/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/aiongpfs/projects/ai-art/global-decoding/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 777, in forward
    input_ids = input_ids.view(-1, input_shape[-1])
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

Warnings about the number of centroids used for clustering

Dear authors, thanks for release the MAUVE metrics in such an easy-to-use repository!

I had a quick question about a warning I get each time I run MAUVE using the command provided in the README,

mauve_out = mauve.compute_mauve(p_text=gen_seqs, q_text=gold_seqs, device_id=0, max_text_length=768, verbose=False)

This nearly always gives me a warning of the form,

WARNING clustering 9286 points to 464 centroids: please provide at least 18096 training points

Is this something I need to worry about? Should I set num_buckets manually?

Thanks again for your excellent project!

Mabye incorrect calculation in one small line

Hi,
First, thanks for the awesome paper.

mauve/src/mauve/compute_mauve.py

Lines 249 to 263 in 20613ee

 def get_fronter_integral(p, q, scaling_factor=2): 

 total = 0.0 

 for p1, q1 in zip(p, q): 

 if p1 == 0 and q1 == 0: 

 pass 

 elif p1 == 0: 

 total += q1 / 4 

 elif q1 == 0: 

 total += p1 / 4 

 elif abs(p1 - q1) > 1e-8: 

 t1 = p1 + q1 

 t2 = p1 * q1 * (math.log(p1) - math.log(q1)) / (p1 - q1) 

 total += 0.25 * t1 - 0.5 * t2 

 # else: contribution is 0  

 return total * scaling_factor

I think there is an issue at line 258, where the closing parentheses is placed.
Mabye it's supposed to be like this:

elif abs(p1 - q1) > 1e-8:

Tried to find this in the paper and in the fronter intergral but nothing jumped into my eyes.

Thanks

Allowing other models for extracting features

Hello!

First off, thanks for sharing the code. In the paper, it says that MAUVE works with other embedding models. Therefore, I wanted to try out models such as DialoGPT from Microsoft. But in the code, it limits the model and tokenizer name to "gpt2" family. I think it would better we remove this restriction since others might also want to try out other models.

If you want, I can make a PR to change this.

mauve/src/mauve/utils.py

Lines 25 to 39 in b3c01d5

 def get_model(model_name, tokenizer, device_id): 

 device = get_device_from_arg(device_id) 

 if 'gpt2' in model_name: 

 model = AutoModel.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id).to(device) 

 model = model.eval() 

 else: 

 raise ValueError(f'Unknown model: {model_name}') 

 return model 

 def get_tokenizer(model_name='gpt2'): 

 if 'gpt2' in model_name: 

 tokenizer = AutoTokenizer.from_pretrained(model_name) 

 else: 

 raise ValueError(f'Unknown model: {model_name}') 

 return tokenizer

Questions regarding hidden states.

Hi, and thank you for the great paper. I have questions regarding

mauve/src/mauve/utils.py

Lines 120 to 121 in 20613ee

 for hidden_state, sent_length in zip(outs.hidden_states[-1], chunk_sent_length): 

 h.append(hidden_state[sent_length - 1])

The activations in the final hidden layer is taken: outs.hidden_states[-1], right?
Is my understanding of why sent_length needed correct?

Looking at hidden_state[0] is looking at the embedding of the first word in the sentence
When sent_length < len(hidden_state), hidden_state[-1] is padding
Therefore, hidden_state[sent_length - 1] is the embedding of the entire sentence.

Second, is there a particular reason you chose to only look at the embeddings in the final hidden layer? Did you consider taking an average of the embeddings in all hidden layers?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	def get_fronter_integral(p, q, scaling_factor=2):
	total = 0.0
	for p1, q1 in zip(p, q):
	if p1 == 0 and q1 == 0:
	pass
	elif p1 == 0:
	total += q1 / 4
	elif q1 == 0:
	total += p1 / 4
	elif abs(p1 - q1) > 1e-8:
	t1 = p1 + q1
	t2 = p1 * q1 * (math.log(p1) - math.log(q1)) / (p1 - q1)
	total += 0.25 * t1 - 0.5 * t2
	# else: contribution is 0
	return total * scaling_factor

	def get_model(model_name, tokenizer, device_id):
	device = get_device_from_arg(device_id)
	if 'gpt2' in model_name:
	model = AutoModel.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id).to(device)
	model = model.eval()
	else:
	raise ValueError(f'Unknown model: {model_name}')
	return model

	def get_tokenizer(model_name='gpt2'):
	if 'gpt2' in model_name:
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	else:
	raise ValueError(f'Unknown model: {model_name}')
	return tokenizer

	for hidden_state, sent_length in zip(outs.hidden_states[-1], chunk_sent_length):
	h.append(hidden_state[sent_length - 1])