Giter VIP home page Giter VIP logo

jbesomi / texthero Goto Github PK

View Code? Open in Web Editor NEW
2.9K 42.0 238.0 22.64 MB

Text preprocessing, representation and visualization from zero to hero.

Home Page: https://texthero.org

License: MIT License

Shell 0.63% Python 71.46% Dockerfile 0.07% JavaScript 14.94% CSS 12.90%
text-preprocessing text-representation text-visualization nlp word-embeddings machine-learning text-mining nlp-pipeline text-clustering texthero

texthero's Introduction

Github stars pip package pip downloads Github issues Github license

Text preprocessing, representation and visualization from zero to hero.

From zero to heroInstallationGetting StartedExamplesAPIFAQContributions

From zero to hero

Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic.

You can think of Texthero as a tool to help you understand and work with text-based dataset. Given a tabular dataset, it's easy to grasp the main concept. Instead, given a text dataset, it's harder to have quick insights into the underline data. With Texthero, preprocessing text data, mapping it into vectors, and visualizing the obtained vector space takes just a couple of lines.

Texthero include tools for:

  • Preprocess text data: it offers both out-of-the-box solutions but it's also flexible for custom-solutions.
  • Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.
  • Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)
  • Vector space analysis: clustering (K-means, Meanshift, DBSCAN and Hierarchical), topic modeling (wip) and interpretation.
  • Text visualization: vector space visualization, place localization on maps (wip).

Texthero is free, open-source and well documented (and that's what we love most by the way!).

We hope you will find pleasure working with Texthero as we had during his development.

Hablas español? क्या आप हिंदी बोलते हैं? 日本語が話せるのか?

Texthero has been developed for the whole NLP community. We know how hard it is to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn): that's why we developed Texthero, to simplify things.

Now, the next main milestone is to provide multilingual support and for this big step, we need the help of all of you. ¿Hablas español? Sie sprechen Deutsch? 你会说中文? 日本語が話せるのか? Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer!

For improving the python toolkit and provide an even better experience, your aid and feedback are crucial. If you have any problem or suggestion please open a Github issue, we will be glad to support you and help you.

Beta version

Texthero's community is growing fast. Texthero though is still in a beta version; soon, a faster and better version will be released and it will bring some major changes.

For instance, to give a more granular control over the pipeline, starting from the next version on, all preprocessing functions will require as argument an already tokenized text. This will be a major change.

Once released the stable version (Texthero 2.0), backward compatibility will be respected. Until this point, backward compatibility will be present but it will be weaker.

If you want to be part of this fast-growing movements, do not hesitate to contribute: CONTRIBUTING!

Installation

Install texthero via pip:

pip install texthero

☝️Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.

For faster performance, make sure you have installed Spacy version >= 2.2. Also, make sure you have a recent version of python, the higher, the best.

Getting started

The best way to learn Texthero is through the Getting Started docs.

In case you are an advanced python user, then help(texthero) should do the work.

Examples

1. Text cleaning, TF-IDF representation and Visualization

import texthero as hero
import pandas as pd

df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['pca'] = (
   df['text']
   .pipe(hero.clean)
   .pipe(hero.tfidf)
   .pipe(hero.pca)
)
hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")

2. Text preprocessing, TF-IDF, K-means and Visualization

import texthero as hero
import pandas as pd

df = pd.read_csv(
    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['tfidf'] = (
    df['text']
    .pipe(hero.clean)
    .pipe(hero.tfidf)
)

df['kmeans_labels'] = (
    df['tfidf']
    .pipe(hero.kmeans, n_clusters=5)
    .astype(str)
)

df['pca'] = df['tfidf'].pipe(hero.pca)

hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")

3. Simple pipeline for text cleaning

>>> import texthero as hero
>>> import pandas as pd
>>> text = "This sèntencé    (123 /) needs to [OK!] be cleaned!   "
>>> s = pd.Series(text)
>>> s
0    This sèntencé    (123 /) needs to [OK!] be cleane...
dtype: object

Remove all digits:

>>> s = hero.remove_digits(s)
>>> s
0    This sèntencé    (  /) needs to [OK!] be cleaned!
dtype: object

Remove digits replaces only blocks of digits. The digits in the string "hello123" will not be removed. If we want to remove all digits, you need to set only_blocks to false.

Remove all types of brackets and their content.

>>> s = hero.remove_brackets(s)
>>> s 
0    This sèntencé    needs to  be cleaned!
dtype: object

Remove diacritics.

>>> s = hero.remove_diacritics(s)
>>> s 
0    This sentence    needs to  be cleaned!
dtype: object

Remove punctuation.

>>> s = hero.remove_punctuation(s)
>>> s 
0    This sentence    needs to  be cleaned
dtype: object

Remove extra white-spaces.

>>> s = hero.remove_whitespace(s)
>>> s 
0    This sentence needs to be cleaned
dtype: object

Sometimes we also want to get rid of stop-words.

>>> s = hero.remove_stopwords(s)
>>> s
0    This sentence needs cleaned
dtype: object

API

Texthero is composed of four modules: preprocessing.py, nlp.py, representation.py and visualization.py.

1. Preprocessing

Scope: prepare text data for further analysis.

Full documentation: preprocessing

2. NLP

Scope: provide classic natural language processing tools such as named_entity and noun_phrases.

Full documentation: nlp

2. Representation

Scope: map text data into vectors and do dimensionality reduction.

Supported representation algorithms:

  1. Term frequency (count)
  2. Term frequency-inverse document frequency (tfidf)

Supported clustering algorithms:

  1. K-means (kmeans)
  2. Density-Based Spatial Clustering of Applications with Noise (dbscan)
  3. Meanshift (meanshift)

Supported dimensionality reduction algorithms:

  1. Principal component analysis (pca)
  2. t-distributed stochastic neighbor embedding (tsne)
  3. Non-negative matrix factorization (nmf)

Full documentation: representation

3. Visualization

Scope: summarize the main facts regarding the text data and visualize it. This module is opinionable. It's handy for anyone that needs a quick solution to visualize on screen the text data, for instance during a text exploratory data analysis (EDA).

Supported functions:

  • Text scatterplot (scatterplot)
  • Most common words (top_words)

Full documentation: visualization

FAQ

Why Texthero

Sometimes we just want things done, right? Texthero helps with that. It helps make things easier and give the developer more time to focus on his custom requirements. We believe that cleaning text should just take a minute. Same for finding the most important part of a text and the same for representing it.

In a very pragmatic way, texthero has just one goal: make the developer spare time. Working with text data can be a pain and in most cases, a default pipeline can be quite good to start. There is always time to come back and improve previous work.

Contributions

"Texthero has been developed by a member of the NLP community for the whole NLP-community"

Texthero is for all of us NLP-developers and it can continue to exist with the precious contribution of the community.

Your level of expertise of python and NLP does not matter, anyone can help and anyone is more than welcome to contribute!

Are you an NLP expert?

  • open an issue and tell us what you like and dislike of Texthero and what we can do better!

Are you good at creating websites?

The website will be soon moved from Docusaurus to Sphinx: read the open issue there. Good news: the website will look like now :) Average news: we need to do some web-development to adapt this Sphinx template to our needs. Can you help us?

Are you good at writing?

Probably this is the most important piece missing now on Texthero: more tutorials and more "Getting Started" guide.

If you are good at writing you can help us! Why don't you start by Adding a FAQ page to the website or explain how to create a custom pipeline? Need help? We are there for you.

Are you good in python?

There are a lot of open issues for techie guys. Which one do you choose?

If you have just other questions or inquiry drop me a line at jonathanbesomi__AT__gmail.com

Contributors (in chronological order)

The MIT License (MIT)

Copyright (c) 2020 Texthero

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

texthero's People

Contributors

andre-sacilotti avatar andrewbird2 avatar bobfang1992 avatar cclauss avatar cedricconol avatar henrifroese avatar hugoabonizio avatar ishanarora04 avatar jbesomi avatar mk2510 avatar parthgandhi avatar patrickphat avatar peritract avatar richecr avatar robertrosca avatar selimelawwa avatar shreyasminocha avatar sleeper avatar thewchan avatar vidyap-xgboost avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

texthero's Issues

Kmeans topics detection

A feature involving topics detection for unsupervised learning (KMeans for ex) would be really welcome!

TokenSeries as input to every representation function

One of the principles of Texthero is to give to the NLP developer more control.

Motivation

A simple example is the TfidfVectorizer object from the scikit learn. It's absolutely fast and great but it has too many parameters and before applying the TF-IDF it actually preprocesses the text data. I just discovered that TfidfVectorizer even L2 normalizes the output and that there is no option to avoid a normalization.

With Texthero's tf-idf we just want the code to apply TF-IDF. That's it. No stopwords removal, no tokenization, no normalization. All this essential step can be done by the NLP developer on the pipeline (the drawback is that it might be less efficient, but at the advantage of having clear and expected behavior).

Solution

All representation functions will require the Pandas Series to be already tokenized. In the beginning, we can still accept Text Pandas Series; in this case the default hero.tokenize the function will be applied but a warning message will be outputted (see example below).

Interested in working on this task?
For the tfidf + term_frequency function, the code has already (almost) been made. The body of the function would look like this:

if type(s.iloc[0]) != list:
    raise ValueError(
        "🤔 It seems like the given Pandas Series is not tokenized. Have you tried passing the Series through `hero.tokenize(s)`?"
    )

tfidf = TfidfVectorizer(
    use_idf=True,
    max_features=max_features,
    min_df=min_df,
    max_df=max_df,
    tokenizer=lambda x: x,
    preprocessor=lambda x: x,
)

If you are interested in helping out, just leave a comment!

Explain how to read text data from PDF and PowerPoint and use it with Texthero

PDF, PowerPoint presentations and other unstructured text, contain very valuable data that can be used for analysis.
There are many tools providing this features. It would be nice if we can provide a single method to read such files and don't bother user with this.

There is a python library textract provide this functionality unfortunately it is not maintained.

We can provide a method loadData or so that has different implementation depending on file type

Add a flag to remove_punctuation to prevent removing punctuation in a token

Overview

During pre-processing, when we need to remove punctuation, sometimes we want to preserve punctuation in the token. Example:

spider-man is powerful, isn't?

In this case, we might expect remove_punctuation to return:

spider-main is powerful isn't

Approach

Need to modify remove_punctuation and add a new argument keep_tokens or remove_in_between or something like that.

For the implementation, we can either tokenize the text (see texthero.preprocessing.tokenize) and remove all "punctuation tokens" or, probably better, add a regex that drops all punctuations symbols that are not between two characters (see again the tokenize function for an example of such regular expression).

Open question

Decide what does the default behaviour looks like. Probably, it's better to remove all punctuation as default but make it clear that there is the opportunity to keep punctuation present in the tokens.

Move website from Docusaurus to Sphinx

Motivation

Initially, docusaurus.io has been chosen as a tool to visualize Texture's documentation on the website.
Docusaurus is great for two reasons: it's very beautiful and it's super easy to use as users need to write simple markdown files.
Docusaurus's main drawback is that it does not support natively Sphinx, the python documentation generator used to create documentation from the docstrings.

As of now, the API documentation is generated through Sphinx and a python function (to_docusaurs.py) is used to map Sphinx's html files to Docusaurus's markdown files.

This solution, even if it works, is not practical and that's why we should move texthero to Sphinx.

Implementation

Goal: move Texthero's website from Docusaurus to Sphinx.

Sphinx theme: pydata-sphinx-theme as it is very elegant and it resembles the actual website design.

The design should be adapted (not really a complex task) to match almost perfectly the current Texthero's design. This basically means change some CSS to match Texthero's colors and adapt the menu bar.

There is some extra work to do to create the homepage. The homepage will be probably hardcoded in an Html files, without using React (Javascript) as it's the case now with Docusaurus.

Extra

Once ported on Sphinx, it will be possible to add new functionalities to the API's page. Some of the most important are:

  • Add the 'source' button to every function for easy lookup of the code
  • Add Sphinx internationalization, i.e multilingual support. I have no idea yet how this will/should work, any help is much welcomed!
  • New ideas

Collaboration

If you've read this far; wow, congrats! If you are interested in helping with this (very important!) task just leave a comment, or, if you prefer, contact me directly: jonathanbesomi__AT__gmail.com

Colab notebook crashing while calculating PCA/K-Means. CSV file contains 80,000+ rows!

Hello,

I'm trying to visualize Kmeans for the dataset I have which has 80K+ rows with 9 columns.

The notebook keeps crashing whenever I try to run this particular code:

#Add pca value to dataframe to use as visualization coordinates
df1['pca'] = (
            df1['clean_tweet']
            .pipe(hero.tfidf)
            .pipe(hero.pca)
   )
#Add k-means cluster to dataframe 
df1['kmeans'] = (
            df1['clean_tweet']
            .pipe(hero.tfidf)
            .pipe(hero.kmeans)
   )
df1.head()

Is it because texthero can't handle that many rows yet?
Any other solution?

Add tests for (and implement when missing) using the input's index when returning a new Series.

For functions that take as input a pandas Series s, and then return a new pandas Series t, we have to be careful to use the index from s for the series t as well. This allows users to seamlessly integrate the function's result and continue working with it.

An example for a test would be the following (note it does not test whether the function works correctly, other test cases should check that; it only checks the index):

def test_count_sentences_index(self):
    s = pd.Series(["Test"], index=[5])
    counted_sentences_s = nlp.count_sentences(s)
    t_same_index = pd.Series([""], index=[5])

    self.assertTrue(counted_sentences_s.index.equals(t_same_index.index))

def test_count_sentences_wrong_index(self):
    s = pd.Series(["Test", "Test"], index=[5, 6])
    counted_sentences_s = nlp.count_sentences(s)
    t_different_index = pd.Series(["", ""], index=[5, 7])

    self.assertFalse(counted_sentences_s.index.equals(t_different_index.index))

This makes sure that the index returned by the function nlp.count_sentences is the same as the index in the input.

The tests could go into an extra file in the tests folder, e.g. test_indexes

Feature Request - Stats for Speech Corpus CSV's

Hi ,

@jbesomi This is great initiative, and you gave it when we needed it most, Thanks, Hero.!

I have been working on ASR, for which i deal with text corpus as well as speech corpus. I hope the following feature if added will be more helpful for people working on ASR.

Stats of a text corpus

  • count_sentence = count number of sentence in the given column ( given text corpus)
  • count_words = count number of words in the given column
  • count_unique_sentence = count number of unique sentence in the given column
  • count_unique_words = count number of unique words in the given column
  • count_unique_char = list and count unique char in given column
  • remove_if_content = remove a cell if contains a specific char/word
  • words_in_sentence = Gives highest, lowest, number of words in sentences and also mean

Stats of Speech Corpus
Since it is TextHero, i dont know if we will be working on audio data, however i list some of it that i encounter while working with deepspeech

  • Get audio durations
  • Calc Total duration,
  • Similar calc mean, lowest and highest duration of audiofile, given path in the csv.

PS: Kindly excuse for any mistakes. This is the first Github repo, i wish to contribute.

Add POS tagging

Under the nlp module, add part-of-speech tagging hero.pos. Again here the solution would be to use spaCy. This should not be particularly complex as the code would resemble named_entities or noun_chunks.

Food for thoughts: having separate functions allows for simple code, but it makes the pipeline quite inefficient as every time an NLP function is called spaCy has to go through all the corpus again. Other ideas and solutions are also welcomed.

Will be implemented only after #65 is solved.

Linked PR: #57

How to contribute: CONTRIBUTE.md

My wordcloud looks ugly. Which argument to change to make it look cleaner?

Right now my word cloud looks like this:

Wordcloud

There's lot of blur coming into this cloud, how to reduce the blur and make it look cleaner/sharp?
I tried all the arguments provided in this document. But nothing seems to work properly.

My code:

hero.visualization.wordcloud(df1['clean_tweet'],width = 200, height= 200,background_color='White')

Spanish translation contribution

Hello 👋. My name is José De Freitas, I'm a native Spanish speaker and I can help contribute to this project by translating it into Spanish. I'm opening this issue because I saw in the README file that you'll like multilingual support and content.

About me

I born in a Spanish language-country. Spanish is my main language. I started learning English at the age of 4 (and I'm still doing it).
I've recently made a translation from English to Spanish of the clean-code-typescript repository. I've also translated some small texts for friends.

Where can I contribute

I can help translate the documentation and the tutorials about how to use texthero as well as add multilanguage support in Spanish.
I have no experience with Python nor text-data and things related, but if it's only to translate English text into Spanish, I can do it.
My time of investment in this project is moderated but not that fixed, I can work some hours per week (depending on how many texts I have to translate).

More translators

It seems that this project will increase a lot, that's why if any other Spanish speaker who wants to help me contribute to this project in the Spanish translation can do it, and it'll be great get your help.

Implement/support/explain topic modelling

Goal
Implement topic modeling on Texthero.

Topic modeling
There are mainly two ways to do topic modeling: LSA/LSI (latent semantic indexing) and LDA (Latent Dirichlet allocation). This simple tutorial explains how to implement it in python.

Python implementation
LSA/LSI is just basically TF-IDF + SVD. What's it's important is to understand how to visualize and how to return the topic model information from the function.

Documentation
Other than adding the docstring, it's probably useful to write a "getting started" tutorial on how topic modeling works and how to use Texthero's function.

We will probably want to implement both LSI and LDA, in two? separate functions.

This issue is a work in progress. Any help is very appreciated!

Add drop_duplicates

(Edit)

Add hero.drop_duplicates(s, representation, distance_algorithm, threshold).

Where:

  • s is a Pandas Series
  • representation is either a Flair embedding or a hero representation function. Need to define a default value.
  • distance_algorithm is either a string or a function that takes as input two vectors and it computes their distance. Example of such a function is sklearn.metrics.pairwise.euclidean_distances (see scikit-learn repository)
  • threshold boolean values. All vectors that share a distance less than this value will be considered as a single document. The first in order of appearance of the Pandas Series will be kept.

Task:
Drop all duplicated from the given Pandas Series and return a cleaned version of it.

TODO:
It will be interesting to drop_duplicates from a DataFrame, specifying which column to drop (as done in Pandas).

Add a FAQ page to the website

Some of the user's most common questions are:

  • Does Texthero support other languages than English?
  • How fast is Texthero?

Other important questions are:

  • Why Texthero in the era of Transformers? (spoiler: preprocessing and understanding data is crucial before using fancy ML model)
  • How can I contribute?
  • Why Texthero is model agnostic? See #155

These questions should be added and answered in a clean page on texthero.org.

Call to action

Is there anything that's not clear? Do you have a question/answer pairs to add to the FAQ? Let me know on the comments. Thanks!

count(s) and term_frequency(s)

Texthero's hero.term_frequency(s) should be reanamed hero.count(s) as in reality it just count the term.
Because of that, we should add another function that implements hero.term_frequency that is: "(number of times term t appears in a document) / (Total number of terms in the document).

The distinctions between the two should be made clear in both docstring and both functions should have a "See Also" to let the user quickly move from one function documentation to the other.

Both implementations might be written using scikit-learn CountVectorizer.

Remove Diacritics for Urdu Language

I would like to contribute for Urdu language support, let me start with simple issue now,

For the Urdu text with diacritics text = "اِس, اُس"

following code produces incorrect output

import pandas as pd 
import re
import texthero as hero
text = "اِس, اُس"
s = pd.Series(text)
s1 = hero.remove_diacritics(s)
s1
is, us

produces the output is, us which is not the intended., but it is transliterated output.

The intented output is اس, اس

Probably which can be acheived by replacing following diacritics char

Urdu Diacritics
zabar = u'\u064e'
pesh = u'\u064f'
zer = u'\u0650'
tashdid = u'\u0651'
jazam = u'\u0652'

Preprocessing: explain how to create a custom pipeline

(Edit)

Add under Getting Started - Preprocessing a section that explains how to create a custom pipeline. This solution is easier than #9

Explain in the docstring of clean how to create a custom pipeline. Code example:

import texthero as hero
import pandas as pd

s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']

pipeline = [
    lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]

s.pipe(clean, pipeline=pipeline)

Check if Series consists of strings only, instead of casting to unicode

Currently, some functions do not check if the Series they get as input really consists of strings only, and they give unexpected results, e.g. if there are missing values.

Example:

import texthero as hero
import pandas as pd
import numpy as np

s = pd.Series(["Test", np.nan])
hero.noun_chunks(s)
>>0                   []
>>1    [(nan, NP, 0, 3)]

This could be fixed by stopping to use s.astype('unicode') which e.g. converts np.nan -> "nan". Instead, a function should check whether the Series consists of strings only. Something along the lines of

def _check_series_strings(s):
    if not df.map(type).eq(str).all():
        raise TypeError("Non-string values in series. Use hero.drop_no_content(s) to drop those values.")

Adding metadata to Series

When we apply tfidf and term_frequency for instance, it would be very valuable to have the get_feature_names() information integrated into the Series.

There are different methods to do that, yet I still haven't found a reliable solution.

>>> import pandas as pd
>>> df = pd.DataFrame({'text': ['a','b','c']})
>>> data = pd.Series([1,2,3])
>>> data.featues_names = ['a b']
>>> print(data.featues_names)
['a b']
>>> df['data'] = data
>>> print(df.data.featues_names)
AttributeError: 'Series' object has no attribute 'featues_names'

works, but, if we assign this series to a Dataframe, the metadata are lost.

The solution is probably one from the following:

Extending Pandas:

  1. Registering custom accessors - not sure how.
  2. Subclassing pandas data structures - may work but it may be unnecessary and unnecessarily complicated.
  3. Extension types - This may work, but solution 1 should be preferable.

geopandas is an example of solution 2. It may be useful somehow to understand better the problem.

Ideas and suggestions are very welcome!

[EDIT]

Property attrs may work, but it's an experimental feature: pandas.DataFrame.attrs

cannot import NLTKWordTokenizer

Should I install a specific version of NLTK?

when I run :
`import texthero as hero
import pandas as pd

df = pd.read_csv(
"dataset/bbcsport.csv"
)

df['pca'] = (
df['text']
.pipe(hero.clean)
.pipe(hero.tfidf)
.pipe(hero.pca)
)
hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")`

I got this output:

[nltk_data] Downloading package stopwords to
[nltk_data] /home/aistudio/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
⚠ Skipping model package dependencies and setting --no-deps. You
don't seem to have the spaCy package itself installed (maybe because you've
built from source?), so installing the model dependencies would cause spaCy to
be downloaded, which probably isn't what you want. If the model package has
other dependencies, you'll have to install them manually.
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
---------------------------------------------------------------------------ImportError Traceback (most recent call last) in
----> 1 import texthero as hero
2 import pandas as pd
3
4 df = pd.read_csv(
5 "dataset/bbcsport.csv"
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/texthero/init.py in
10 from .representation import *
11
---> 12 from . import visualization
13 from .visualization import *
14
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/texthero/visualization.py in
7
8 from wordcloud import WordCloud
----> 9 from nltk import NLTKWordTokenizer
10
11 from texthero import preprocessing
ImportError: cannot import name 'NLTKWordTokenizer' from 'nltk' (/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/nltk/init.py)

WordCloud: TypeError: wordcloud() got an unexpected keyword argument 'min_font_size' and 'max_font_size'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-58-9c76fea65749> in <module>()
----> 1 hero.visualization.wordcloud(df1['clean_tweet'],width = 200, height= 200,background_color='White',min_font_size=2)

TypeError: wordcloud() got an unexpected keyword argument 'min_font_size'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-63-951ae06e0982> in <module>()
----> 1 hero.visualization.wordcloud(df1['clean_tweet'],width = 200, height= 200,background_color='White',max_font_size=8)

TypeError: wordcloud() got an unexpected keyword argument 'max_font_size'

I am working on Google Colab and tried using the argument min_font_size to change it from default value 4 to 2. But it gives me a TypeError.

Although the documentation says the method has the argument, it doesn't take in this argument when changed to another value.

Was this feature removed or yet to be added?

Add most_similar

Add hero.most_similar(s, representation, distance_algorithm, threshold, vector)

This task is very similar to #4.

Given a Pandas Series, the code returns a Pandas Series ordered by the distance from the vector.

Open question (same question as #4)

In case of s is a Text Series, representation is required.
In case of s is already represented (i.e is a Representation Series), representation is not required.

We can either accept that s has both structures or stick to only one (in this case probably the second solution is preferable). Opinions?

Make spaCy-nlp functions faster

(Edit)

Almost all functions of the nlp module under-the-hoods make use of spaCy.

In general, spaCy is quite fast as it uses Cython.

The core code looks like this:

new_data = []
for row in nlp.pipe(s.values, batch_size=32):
        new_data.append( ... row ...)

spacy pipe has been initially chosen as it's multi-threading. An alternative might be to use apply ( probably is slower).

The pipe functions have among other the n_threads as well as the batch_size arguments. Tuning this values might be very important.

This task consists in:

  1. Understand spaCy pipe
  2. Test on a large dataset different combinations of n_threads and batch_size value
  3. (it it make sense) Compare this results with the pandas apply approach
  4. Pick the best solution and implement it in all NLP functions that uses spaCy under-the-hoods
  • We might find that the optimal values of n_threads and batch_size are not always the same, in this case, we will need to add it as arguments to the NLP functions and update the docstring.

Useful resources:

Turbo-charge your spaCy NLP pipeline

Arabic support

Hi, my name is Adam. I'm a native Arabic speaker, and I'm interested in helping out as much as I can in this project.

Previous Experience
I have never worked on an NLP project before but i'm used to normal classification and regression problems using SVM, KNN and Logical Regression but i've seen on the README that that is not an issue and you're fine with someone new to NLP.

Add replace_hashtags, remove_hasthags

(Edit)

Add replace_hashtags, remove_hastags

Example:

>>> s = pd.Series(["Hey #git123 how are you doing?"])
>>> hero.replace_tag(s, symbol="X")
0    Hey Xgit123 how are you doing? 
dtype: object

Implementation

Please refer to the other replace_, remove_ examples by looking at the code in preprocessing.py. remove_hashtags should simply call replace_hashtags with symbol= .

Thank you /u/penatbater from Reddit for the suggestion.

Preprocessing issue "E-I-E-I-O" with pipeline lowercase + stop_words

Problem:

Given the sentence

"E-I-E-I-O\nAnd on"

And the pipeline pre.lowercase, pre.remove_stopwords. Method clean returns

"e--e--\n "

It should returns:

"e-i-e-i-o\n "

Code:

import texthero as hero
import pandas as pd

s = pd.Series("E-I-E-I-O\nAnd on")
pipeline = [pre.lowercase, pre.remove_stopwords]

hero.clean(s, pipeline=pipeline)

Kind of Pandas Series

Motivation

Having a unified view and a clear idea of the expected Pandas Series input it's useful both for the users and for the developers.

To receive precise and correct errors is very valuable for the users as this permits an easy and pleasant debugging. We can summarize three kinds of Pandas Series a Texthero's function can receive as input (or it can output):

Types

  • "Pandas Text Series" --> every cell has some text
  • "Pandas Tokenized Series" --> every cell has a list of tokens
  • "Pandas Representation Series" --> every cell is a representation of a text ( it's a list of float values). This will be improved soon (See issue #43)

In the best scenario, every Texthero's function receive as input a Pandas Series of one of these three kind. Testing that the given Pandas Series is of the right expected types is therefore useful.

Go further

  • preprocess.py: almost all function (at the exception of tokenize) takes as input a Pandas Text Series and Return a Pandas Text Series.
  • represention.py: input (will be #44 ) a Tokenized Pandas Series and output will be Representation Pandas Series
  • nlp.py: input is a Text Pandas Series, whereas the output is TODO
  • visualization.py TODO.

It would be great to have a unified and clear view of all this:

  1. Every function should check for the right type (we will need to define the "check" function, probably under a new file, something like _helper.py)
  2. Once everything is in place and defined, add under the website (documentation) a clear document that explain all this. It will be so easy to use Texthero then!
  3. New ideas

Extra

Unfortunately, there are more variants of Pandas Series (output of named_entities, output of pca, ...) there is still some design work to go there ...

Work in progress ...

Chinese language support

Hello, my name is Guoao Wei. I am a Chinese student interested in NLP and I can help with the Chinese language support for this amazing repository.

About me

I received a bachelor's degree of Software Engineering in China. I worked as a research intern in the Chinese Chinese Academy of Sciences for a year, focusing on NLP-related topics.

I have been searching for tools that saves time on writing redundant preprocessing codes when dealing with text data (I wrote my own simple one AlfredWGA/nlputils), until I find Texthero. Therefore I am happy to contribute to this toolkit.

Thins I can do

  • Translate documents & tutorials into Chinese
  • Add Chinese support for modules
  • Add more features (pre-trained word embedding & language models, etc.)

tfidf(s): remove normalization, improve docstring

The tfidf function, under-the-hoods makes use of the sklearn.feature_extraction.text.TfidfVectorizer.

By default, the TfidfVectorizer return a L2-normalized TF-IDF vector. In Texthero, we would like to avoid this hidden behavior, to let the user have more granular control over each action.

This task consists of removing (and testing) the normalization, as well as make it more clear in the docstring which tfidf-formula is used. For extra information, the chapter text feature extraction of the scikit-learn documentation might be useful.

Task:

  1. Remove L2-normalization.
  2. Test the new function (using the mathematical formula for each value)
  3. Improve the docstring making it clear which TF-IDF function is used

Implementation:

There are probably two ways of solve that:

  1. By using TfidfVectorizer and de-l2-normalizing the output
  2. Using pure NumPy. This would be nicer from a code perspective but it might be tricky as working with NumPy vectors is not always trivial. As the result should be a Sparse Matrix (see #43 ) this mean we will have to deal with sparse matrix.

Extra:
After having implemented #43, we will add two new functions, L1 and L2, to normalize any "Pandas Representation Series"

Add replace_punctuation and replace_digits

Example:

>>> replace_digits(pd.Series("123"), "#", block=True)
pd.Series(["#"])
>>> replace_digits(pd.Series("123"), "#", block=False)
pd.Series(["###"])

(arguments name can be changed)

Add hero.infer_lang(s)

(Edit)

Add a function hero.infer_lang(s) (a suggestion for a better function name is more than welcomed!) that given a Pandas Series finds for each row the respective language.

Implementation

  1. Probably, we will need to define an _infer_lang function inside the mother function that will take as input a text and return the lang of the text. Then infer_lang will just apply it to s.
  2. Searching in Google for infer language python might be a good start.
  3. There are probably two ways to solve: rule-based and model-based. If rule-based has high accuracy, then we can stick to this solution as it's probably faster.
  4. Add it under nlp.py

Improvement

A more complex solution would not only return the lang but also the probability or even better a dictionary like this one (or similar):

{
   'en': 0.8, 
   'fr': 0.1, 
   'es': 0.0
}

Expected PR

  1. Should motivate the reason of a chosen algorithm or external library
  2. Explain how complex would it be to add the "improvement"
  3. Show a concrete example on a large and multilingual dataset or at least give proof that it works well (for example citing that under-the-hood the function use package X that achieved Y accuracy on ...)
  4. All other requirements as stated in CONTRIBUTING.md
  • Removed good first issue label

Add hero.phrases(s)

(Edited)

For some NLP tasks, it's useful to merge tokens together. For instance "New York" should be considered as a single word: "New_York".

As of now, tokenize_with_phrases from preprocessing.py does exactly that. This function is not published now on the website.

tokenize_with_phrases uses Gensim PhrasesTransformer on a tokenized text. In turn, PhrasesTransformer implement the "phrases algorithm" from Mikolov, et. al Word2Vec authors Distributed Representations of Words and Phrases and their Compositionality

The actual solution of using tokenize_with_phrases is not optimal as this function call hero.tokenize. A better alternative would be to add a separate function, i.e hero.phrases(s) that given an already tokenized function it creates phrases, by merging token together.

hero.phrases(s, ...)

Where:

  • s is a Tokenized Pandas Series
  • for the other arguments look at tokenize_with_phrases

Returns:
A Tokenized Pandas Series with phrases.

TODO:

  • The function might be instead called add_phrases or merge_with_phrases.
  • Can add an argument symbol for the merging symbol. Example symbol="__" --> New__York

How to
Rename tokenize_with_phrases and adapt it. Not much need to be done actually :)

Allow a custom stop-words list

Allow a custom stop-words list to remove_stop_words function

Edit Starting from v1.0.6, the function is called remove_stopwords (without the second underscore)

Create a method to summarize text data

Overview
We need to have a method that take in a pd.Series of text data and be able to summarize it, identify topic, important entities and figures.

Approach
Research deep learning approaches to summarizing text and decide on the most suitable one.
Use spacy or similar libraries to identify main entities

Help is required and all ideas are welcome!

Support "Pandas Series Representation"

This is one of the most interesting (future) aspects of Texthero: the ability to represent any text-dataset with ease, even very large dataset.

Motivation
One of the big limitations of the current version of the Texthero is that the output of the tfidf function or whatever other "representation" function is not particularly interpretable. The user do not even know which tf-idf weight is associated with which word/token.

The solution is to return a Multiindex Pandas Series where the first level represent the document and the second document represents the word. See this example below:

>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(["I am GROOT", "Flame on"])
>>> s = hero.tokenize(s)
>>> hero.tfidf(s)
document  word 
0        GROOT    0.577350
          I        0.577350
          am       0.577350
1        Flame    0.707107
          on       0.707107
dtype: Sparse[float64, nan]

The advantage of this approach is that:

  1. The result is much more interpretable
  2. The result is a Sparse Pandas Series! This is very good as often the output of TF-IDF (especially when max_features=None) is a very large and very sparse matrix.

The drawback is that this Pandas Series cannot be appended directly into the Pandas Dataframe.

We refer to this MultiIndex series where the first level is the document and the second level is the term as
"Pandas Series Representation" (a better name is welcomed!)

Texthero 2.0

Starting from Texthero 2.0 all? "representation" functions will return such Pandas Representation Series. The pca/nmf function will accept as input a Pandas Representation and will (probably) return a flat representation as it does not make sense anymore to have a second level called "pca-component-1".

From Pandas Representation Series to Pandas Series

A function to_flat_series or something similar will transform the Pandas Representation Series into a (flatten) Pandas Series (as the actual output of tfidf). This will permit to append the Series into the initial df.

From Pandas Representation Series to a document-term matrix

Just by calling .stack() on the Pandas Representation Series it will be possible to convert it to a Pandas DataFrame where rows are the documents and every column is a term. Nice, right? We will need to explain clearly how to deal with MultiIndex (basics are not particularly hard)

Interested in helping out?
Most of the code has already been written. If you are interested in helping out for this important changes leave a comment. We will be glad to have you onboard!

Your opinion
Your opinion matter; let us know your thoughts!

Any way I can upload texthero visualization plot to my Plotly dashboard?

Hi,

I am just curious about if it is doable I can directly upload the plot to my plotly dashboard once called hero.scatterplot()?

chart_studio.tools.set_credentials_file(username='username', api_key='token')
chart_studio.tools.set_config_file(world_readable=True, sharing='public')

Regards,
Jack

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.