leonieweissweiler / cistem Goto Github PK
View Code? Open in Web Editor NEWStemmer for German
Home Page: http://www.cis.lmu.de/~weissweiler/
License: MIT License
Stemmer for German
Home Page: http://www.cis.lmu.de/~weissweiler/
License: MIT License
Looking for complete stemmer implementations for german I stumbled upon this. Why not add a web assembly export? I know its very experimental, but it would open up a lot of platforms already. ;) (pretty please)
Thank you for the useful stemmer. I have stumbled upon an issue:
There should be some exceptions for removing the prefix "ge".
E.g.
"Geschlecht" is stemmed to "schlecht"
"Gesellschaft" is stemmed to "sellschaft"
"gesamt" is stemmed to "samt"
"genau" is stemmed to "nau"
etc.
If the word
as input is not normalised to NFC, the following parts will not work:
äöü
to aou
length()
calculationsubstr($original, - $rest_length)
Working on "whatever" normalised would be complicated.
I guess having constants with the same value is not intended. In fact DOLLAR1_PATTERN does not match any "$1" occurences.
Line 5 in ddfa683
Even in case-sensitive mode, the output is always lower case.
Is that on purpose?
The difference of transformation between stem()
and segments()
is:
Example:
#!perl
use strict;
use warnings;
use utf8;
binmode(STDOUT,":encoding(UTF-8)");
binmode(STDERR,":encoding(UTF-8)");
use lib qw(../CISTEM);
use Cistem;
my @words = qw/geheilwässert/;
for my $word (@words) {
for my $case_sensitive (0..1) {
print 'Cistem::stem(',$word,',',$case_sensitive,'): ',
Cistem::stem($word,$case_sensitive),"\n";
}
for my $case_sensitive (0..1) {
print 'Cistem::segment(',$word,',',$case_sensitive,'): ',
join('-',Cistem::segment($word,$case_sensitive)),"\n";
}
}
Which results in:
~/github/perl/CISTEM-test$ perl cistem.t
Cistem::stem(geheilwässert,0): heilwass
Cistem::stem(geheilwässert,1): heilwass
Cistem::segment(geheilwässert,0): geheilwäss-ert
Cistem::segment(geheilwässert,1): geheilwäss-ert
I would expect the same segmentation:
ge-heilwass-ert
This would also allow sharing most of the code.
Hi!
I recognized, the first line of "cistem.cpp"
says
#include "Cistem.hpp"
I assume it should be
#include "cistem.hpp"
Best,
Florian
I made a Rust translation of CISTEM for a project of mine: cistemrs. Probably you don't want to merge it into the repo, but maybe it's useful for other people using the language.
This issue is more of a comment and can be closed immediately. Thanks for the stemmer! It works very well for information retrieval. :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.