robinarthur / 5pk Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 60.62 MB

Jupyter Notebook 57.56% Python 0.14% PLpgSQL 0.01% HTML 42.29%

friedrich nlp schiller

5pk's People

Watchers

5pk's Issues

Structure 4 sentiment and the other approach

Each text is a list of sentences

Each sentence is a list of tokens

Each token is a tuple of three elements: a word form (the exact word that appeared in the text), a word lemma (a generalized version of the word), and a list of associated tags

Etwas zum zitieren

German novels text mined...
http://www.jltonline.de/index.php/conferences/article/view/502/1306

Semantic networks with gephi tut

https://seinecle.github.io/gephi-tutorials/generated-html/working-with-text-en.html

Database

Take the leipzig corpora files and put it into a sqlite3 db. Afterthat process the sql stuff with sqlalchemy inside python

Titel

Goethe der Schöngeist und Schiller ein Kämpfer? Können moderne Methoden der Computerlinguistik dies bestätigen?

[Spacy] TTR - type token ratio - analysis

Implementation: https://stackoverflow.com/q/49247590/7477664

Definition:
https://en.m.wikipedia.org/wiki/Lexical_density

Korpora

https://cosmas2.ids-mannheim.de/cosmas2-web/

next steps to glory

create database
fill every book into the database

[Spacy] Find the most common words/ nouns/ verbs /...

https://stackoverflow.com/q/37253326/7477664

Examples

Tatort

http://mechlab-engineering.de/2014/05/german-tatort-on-twitter-natural-language-processing-and-sentiment-analysis-with-python-pandas-and-nltk/

schiller books

Werke

Dramen:

Die Räuber (1781)

Kabale und Liebe (1783)

Die Verschwörung des Fiesco zu Genua (1784)

Don Carlos (1787)

Wallenstein-Trilogie (1799)

Wallensteins Lager

Die Piccolomini

Wallensteins Tod

~~Maria Stuart (1800)

Die Jungfrau von Orleans (1801)

Die Braut von Messina (1803)

Wilhelm Tell (1804)

Demetrius (1805, Fragment)

Erzählungen:

Der Verbrecher aus verlorener Ehre (1786)

Der Geisterseher (1789, Fragment)

Lyrik:

An die Freude (1785)

An die Freunde (1802)

An Emma (1797)

Berglied (1804)

Das Ideal und das Leben (1795)

Das Geheimniß (1797)

Das Lied von der Glocke (1797)

Das Siegesfest (1802)

Das verschleierte Bild zu Sais (1795)

Der Abend (1776)

Der Alpenjäger (1804)

Der Besuch (1797)

Der Gang nach dem Eisenhammer (1797)

Der Graf von Habsburg (1803)

Der Handschuh (1797)

Der Jüngling am Bache (1803)

Der Kampf mit dem Drachen (1798)

Der Pilgrim (1803)

Der Ring des Polykrates (1797)

Der Taucher (1797)

Des Mädchens Klage (1798)

Die Begegnung (1798)

Die Bürgschaft (1798)

Die Erwartung (1799)

Die Götter Griechenlands (1788)

Die Gunst des Augenblicks (1802)

Die Ideale (1795)

Die Kindsmörderin (1782)

Die Kraniche des Ibykus (1797)

Die Teilung der Erde (1795)

Die vier Weltalter (1802)

Hero und Leander (1801)

Kassandra (1802)

Klage der Ceres (1796)

Nadowessische Todtenklage (1797)

Nänie (1800)

Punschlied (1803)

Punschlied - Im Norden zu singen (1803)

Ritter Toggenburg (1797)

Sehnsucht (1802)

Sprüche des Confucius (1795)

Xenien (1796)

Literaturtheoretische Schriften:

Über Bürgers Gedichte (1791)

Über epische und dramatische Dichtung(1797, zus. mit Goethe)

Historische Schriften:

Geschichte des Abfalls der vereinigten Niederlande von der Spanischen Regierung (1788)

Was heißt und zu welchem Ende studiert man Universalgeschichte? (1789)

Geschichte des dreißigjährigen Krieges (1790)

Philosophische Schriften:

Über Anmuth und Würde (1793)

Über die ästhetische Erziehung des Menschen (1795)

Über naive und sentimentalische Dichtung (1795)

Über den Dilettantismus (1799, zus. mit Goethe)

Über das Erhabene (1801)

Die Schaubühne als moralische Anstalt (1802)

Coding environment

Docker kaggle python
pyenv with auto activation
jupyter notebook server @ port 8080

More information about text mining

http://fredgibbs.net/courses/etc/getting-started-with-text-mining

Go further with mining

Once NLTK is installed and you have a Python console running, we can start by creating a
paragraph of text:

para = "Hello World. It's good to see you. Thanks for buying this
book."
Now we want to split para into sentences. First we need to import the sentence tokenization
function, and then we can call it with the paragraph as an argument.
from nltk.tokenize import sent_tokenize
sent_tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this
book.']
So now we have a list of sentences that we can use for further processing.

How it works...
sent_tokenize uses an instance of PunktSentenceTokenizer from the nltk.
tokenize.punkt module. This instance has already been trained on and works well for
many European languages. So it knows what punctuation and characters mark the end of a
sentence and the beginning of a new sentence.
There's more...
The instance used in sent_tokenize() is actually loaded on demand from a pickle
file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the
PunktSentenceTokenizer once, and call its tokenize() method instead.

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer.tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this
book.']
Other languages
If you want to tokenize sentences in languages other than English, you can load one of the
other pickle files in tokenizers/punkt and use it just like the English sentence tokenizer.
Here's an example for Spanish:
spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.
pickle')
spanish_tokenizer.tokenize('Hola amigo. Estoy bien.')
See also
In the next recipe, we'll learn how to split sentences into individual words. After that, we'll
cover how to use regular expressions for tokenizing text.
Tokenizing sentences into words
In this recipe, we'll split a sentence into individual words. The simple task of creating a list of
words from a string is an essential part of all text processing.

How to do it...
Basic word tokenization is very simple: use the word_tokenize() function:

from nltk.tokenize import word_tokenize
word_tokenize('Hello World.')
['Hello', 'World', '.']
How it works...
word_tokenize() is a wrapper function that calls tokenize() on an instance of the
TreebankWordTokenizer. It's equivalent to the following:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('Hello World.')
['Hello', 'World', '.']
It works by separating words using spaces and punctuation. And as you can see, it does not
discard the punctuation, allowing you to decide what to do with it.

Datasets for sentiment analysis

https://www.w3.org/community/sentiment/wiki/Datasets

There is also a smiley set

Text sumarization

https://rare-technologies.com/text-summarization-with-gensim/

robinarthur / 5pk Goto Github PK

5pk's People

Watchers

5pk's Issues

Recommend Projects

Recommend Topics

Recommend Org