What happens? First, thanks for making DuckDB! I might be miss

Thanks, I was indeed misunderstanding stemming! EDIT. <a class="user

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

French stemmer doesn't work using FTS? about duckdb HOT 3 CLOSED

charnould commented on June 26, 2024

French stemmer doesn't work using FTS?

from duckdb.

Comments (3)

lnkuiper commented on June 26, 2024 1

Hi, we use the snowballstemmer to stem, which supports a bunch of different languages. In the dictionary, you find, for example, aim, not aimer, as this is the stemmed version of aimer. Our stemmer has the same behavior as the snowballstemmer in python:

import snowballstemmer
stemmer = snowballstemmer.stemmer('french')
stemmer.stemWords(['aimer'])
# ['aim']

In DuckDB:

select stem('aimer', 'french') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ aim     │
└─────────┘

Stemming works by reducing words to their base, so that slight changes to words yield the same word, which makes them easier to search:

D select stem('bicycle', 'english') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ bicycl  │
└─────────┘
D select stem('bicycles', 'english') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ bicycl  │
└─────────┘

So I think there is some confusion here, because DuckDB's stemmer does exactly this.

You don't need to stem your own query, just use the fts_main_knowledge.match_bm25 macro as explained in the docs.
You can use our tokenize function to stem an entire sentence:

CREATE TABLE knowledge AS SELECT * FROM 'knowledge.json';
PRAGMA create_fts_index('knowledge', 'id', 'content', stemmer = 'french');
SELECT fts_main_knowledge.tokenize('je m''appelle laurens') tokens;
┌───────────────────────────┐
│          tokens           │
│         varchar[]         │
├───────────────────────────┤
│ [je, m, appelle, laurens] │
└───────────────────────────┘

from duckdb.

charnould commented on June 26, 2024

Thanks, I was indeed misunderstanding stemming!

EDIT. @lnkuiper : Maybe a last question.

SELECT fts_main_knowledge.tokenize('J''aime beaucoup les chiens') allow to get tokens: [ j, aime, beaucoup, les, chiens ].

But how to get stems?
SELECT fts_main_knowledge.stem('J aime beaucoup les chiens', 'french') does not work.
Thanks again.

from duckdb.

lnkuiper commented on June 26, 2024

@charnould, When you use the tokenize function, you get a list of stemmed words. The stem function only works on individual words, not sentences

from duckdb.

Recommend Projects

French stemmer doesn't work using FTS? about duckdb HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent