5pk's People
5pk's Issues
Structure 4 sentiment and the other approach
Each text is a list of sentences
Each sentence is a list of tokens
Each token is a tuple of three elements: a word form (the exact word that appeared in the text), a word lemma (a generalized version of the word), and a list of associated tags
Etwas zum zitieren
German novels text mined...
http://www.jltonline.de/index.php/conferences/article/view/502/1306
Semantic networks with gephi tut
Database
Take the leipzig corpora files and put it into a sqlite3 db. Afterthat process the sql stuff with sqlalchemy inside python
Text preparation
Semantic analysis tut
find a rhyme
Titel
- Goethe der Schöngeist und Schiller ein Kämpfer? Können moderne Methoden der Computerlinguistik dies bestätigen?
[Spacy] TTR - type token ratio - analysis
Implementation: https://stackoverflow.com/q/49247590/7477664
Definition:
https://en.m.wikipedia.org/wiki/Lexical_density
Korpora
next steps to glory
- create database
- fill every book into the database
[Spacy] Find the most common words/ nouns/ verbs /...
Examples
schiller books
Werke
Dramen:
Die Räuber (1781)
Kabale und Liebe (1783)
Die Verschwörung des Fiesco zu Genua (1784)
Don Carlos (1787)
Wallenstein-Trilogie (1799)
Wallensteins Lager
Die Piccolomini
Wallensteins Tod
~~Maria Stuart (1800)
Die Jungfrau von Orleans (1801)
Die Braut von Messina (1803)
Wilhelm Tell (1804)
Demetrius (1805, Fragment)
Erzählungen:
Der Verbrecher aus verlorener Ehre (1786)
Der Geisterseher (1789, Fragment)
Lyrik:
An die Freude (1785)
An die Freunde (1802)
An Emma (1797)
Berglied (1804)
Das Ideal und das Leben (1795)
Das Geheimniß (1797)
Das Lied von der Glocke (1797)
Das Siegesfest (1802)
Das verschleierte Bild zu Sais (1795)
Der Abend (1776)
Der Alpenjäger (1804)
Der Besuch (1797)
Der Gang nach dem Eisenhammer (1797)
Der Graf von Habsburg (1803)
Der Handschuh (1797)
Der Jüngling am Bache (1803)
Der Kampf mit dem Drachen (1798)
Der Pilgrim (1803)
Der Ring des Polykrates (1797)
Der Taucher (1797)
Des Mädchens Klage (1798)
Die Begegnung (1798)
Die Bürgschaft (1798)
Die Erwartung (1799)
Die Götter Griechenlands (1788)
Die Gunst des Augenblicks (1802)
Die Ideale (1795)
Die Kindsmörderin (1782)
Die Kraniche des Ibykus (1797)
Die Teilung der Erde (1795)
Die vier Weltalter (1802)
Hero und Leander (1801)
Kassandra (1802)
Klage der Ceres (1796)
Nadowessische Todtenklage (1797)
Nänie (1800)
Punschlied (1803)
Punschlied - Im Norden zu singen (1803)
Ritter Toggenburg (1797)
Sehnsucht (1802)
Sprüche des Confucius (1795)
Xenien (1796)
Literaturtheoretische Schriften:
Über Bürgers Gedichte (1791)
Über epische und dramatische Dichtung(1797, zus. mit Goethe)
Historische Schriften:
Geschichte des Abfalls der vereinigten Niederlande von der Spanischen Regierung (1788)
Was heißt und zu welchem Ende studiert man Universalgeschichte? (1789)
Geschichte des dreißigjährigen Krieges (1790)
Philosophische Schriften:
Über Anmuth und Würde (1793)
Über die ästhetische Erziehung des Menschen (1795)
Über naive und sentimentalische Dichtung (1795)
Über den Dilettantismus (1799, zus. mit Goethe)
Über das Erhabene (1801)
Die Schaubühne als moralische Anstalt (1802)
Coding environment
- Docker kaggle python
- pyenv with auto activation
- jupyter notebook server @ port 8080
More information about text mining
Go further with mining
Once NLTK is installed and you have a Python console running, we can start by creating a
paragraph of text:
para = "Hello World. It's good to see you. Thanks for buying this
book."
Now we want to split para into sentences. First we need to import the sentence tokenization
function, and then we can call it with the paragraph as an argument.
from nltk.tokenize import sent_tokenize
sent_tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this
book.']
So now we have a list of sentences that we can use for further processing.
How it works...
sent_tokenize uses an instance of PunktSentenceTokenizer from the nltk.
tokenize.punkt module. This instance has already been trained on and works well for
many European languages. So it knows what punctuation and characters mark the end of a
sentence and the beginning of a new sentence.
There's more...
The instance used in sent_tokenize() is actually loaded on demand from a pickle
file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the
PunktSentenceTokenizer once, and call its tokenize() method instead.
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer.tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this
book.']
Other languages
If you want to tokenize sentences in languages other than English, you can load one of the
other pickle files in tokenizers/punkt and use it just like the English sentence tokenizer.
Here's an example for Spanish:
spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.
pickle')
spanish_tokenizer.tokenize('Hola amigo. Estoy bien.')
See also
In the next recipe, we'll learn how to split sentences into individual words. After that, we'll
cover how to use regular expressions for tokenizing text.
Tokenizing sentences into words
In this recipe, we'll split a sentence into individual words. The simple task of creating a list of
words from a string is an essential part of all text processing.
How to do it...
Basic word tokenization is very simple: use the word_tokenize() function:
from nltk.tokenize import word_tokenize
word_tokenize('Hello World.')
['Hello', 'World', '.']
How it works...
word_tokenize() is a wrapper function that calls tokenize() on an instance of the
TreebankWordTokenizer. It's equivalent to the following:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('Hello World.')
['Hello', 'World', '.']
It works by separating words using spaces and punctuation. And as you can see, it does not
discard the punctuation, allowing you to decide what to do with it.
Datasets for sentiment analysis
https://www.w3.org/community/sentiment/wiki/Datasets
There is also a smiley set
Text sumarization
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.