Giter VIP home page Giter VIP logo

5pk's People

Watchers

 avatar  avatar

5pk's Issues

Structure 4 sentiment and the other approach

Each text is a list of sentences

Each sentence is a list of tokens

Each token is a tuple of three elements: a word form (the exact word that appeared in the text), a word lemma (a generalized version of the word), and a list of associated tags

Database

Take the leipzig corpora files and put it into a sqlite3 db. Afterthat process the sql stuff with sqlalchemy inside python

Titel

  • Goethe der Schöngeist und Schiller ein Kämpfer? Können moderne Methoden der Computerlinguistik dies bestätigen?

schiller books

Werke

Dramen:

Die Räuber (1781)

Kabale und Liebe (1783)

Die Verschwörung des Fiesco zu Genua (1784)

Don Carlos (1787)

Wallenstein-Trilogie (1799)

Wallensteins Lager

Die Piccolomini

Wallensteins Tod

~~Maria Stuart (1800)

Die Jungfrau von Orleans (1801)

Die Braut von Messina (1803)

Wilhelm Tell (1804)

Demetrius (1805, Fragment)

Erzählungen:

Der Verbrecher aus verlorener Ehre (1786)

Der Geisterseher (1789, Fragment)

Lyrik:

An die Freude (1785)

An die Freunde (1802)

An Emma (1797)

Berglied (1804)

Das Ideal und das Leben (1795)

Das Geheimniß (1797)

Das Lied von der Glocke (1797)

Das Siegesfest (1802)

Das verschleierte Bild zu Sais (1795)

Der Abend (1776)

Der Alpenjäger (1804)

Der Besuch (1797)

Der Gang nach dem Eisenhammer (1797)

Der Graf von Habsburg (1803)

Der Handschuh (1797)

Der Jüngling am Bache (1803)

Der Kampf mit dem Drachen (1798)

Der Pilgrim (1803)

Der Ring des Polykrates (1797)

Der Taucher (1797)

Des Mädchens Klage (1798)

Die Begegnung (1798)

Die Bürgschaft (1798)

Die Erwartung (1799)

Die Götter Griechenlands (1788)

Die Gunst des Augenblicks (1802)

Die Ideale (1795)

Die Kindsmörderin (1782)

Die Kraniche des Ibykus (1797)

Die Teilung der Erde (1795)

Die vier Weltalter (1802)

Hero und Leander (1801)

Kassandra (1802)

Klage der Ceres (1796)

Nadowessische Todtenklage (1797)

Nänie (1800)

Punschlied (1803)

Punschlied - Im Norden zu singen (1803)

Ritter Toggenburg (1797)

Sehnsucht (1802)

Sprüche des Confucius (1795)

Xenien (1796)

Literaturtheoretische Schriften:

Über Bürgers Gedichte (1791)

Über epische und dramatische Dichtung(1797, zus. mit Goethe)

Historische Schriften:

Geschichte des Abfalls der vereinigten Niederlande von der Spanischen Regierung (1788)

Was heißt und zu welchem Ende studiert man Universalgeschichte? (1789)

Geschichte des dreißigjährigen Krieges (1790)

Philosophische Schriften:

Über Anmuth und Würde (1793)

Über die ästhetische Erziehung des Menschen (1795)

Über naive und sentimentalische Dichtung (1795)

Über den Dilettantismus (1799, zus. mit Goethe)

Über das Erhabene (1801)

Die Schaubühne als moralische Anstalt (1802)

Coding environment

  • Docker kaggle python
  • pyenv with auto activation
  • jupyter notebook server @ port 8080

Go further with mining

Once NLTK is installed and you have a Python console running, we can start by creating a
paragraph of text:

para = "Hello World. It's good to see you. Thanks for buying this
book."
Now we want to split para into sentences. First we need to import the sentence tokenization
function, and then we can call it with the paragraph as an argument.
from nltk.tokenize import sent_tokenize
sent_tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this
book.']
So now we have a list of sentences that we can use for further processing.

How it works...
sent_tokenize uses an instance of PunktSentenceTokenizer from the nltk.
tokenize.punkt module. This instance has already been trained on and works well for
many European languages. So it knows what punctuation and characters mark the end of a
sentence and the beginning of a new sentence.
There's more...
The instance used in sent_tokenize() is actually loaded on demand from a pickle
file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the
PunktSentenceTokenizer once, and call its tokenize() method instead.

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer.tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this
book.']
Other languages
If you want to tokenize sentences in languages other than English, you can load one of the
other pickle files in tokenizers/punkt and use it just like the English sentence tokenizer.
Here's an example for Spanish:
spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.
pickle')
spanish_tokenizer.tokenize('Hola amigo. Estoy bien.')
See also
In the next recipe, we'll learn how to split sentences into individual words. After that, we'll
cover how to use regular expressions for tokenizing text.
Tokenizing sentences into words
In this recipe, we'll split a sentence into individual words. The simple task of creating a list of
words from a string is an essential part of all text processing.

How to do it...
Basic word tokenization is very simple: use the word_tokenize() function:

from nltk.tokenize import word_tokenize
word_tokenize('Hello World.')
['Hello', 'World', '.']
How it works...
word_tokenize() is a wrapper function that calls tokenize() on an instance of the
TreebankWordTokenizer. It's equivalent to the following:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('Hello World.')
['Hello', 'World', '.']
It works by separating words using spaces and punctuation. And as you can see, it does not
discard the punctuation, allowing you to decide what to do with it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.