Comments (3)
Hi, we use the snowballstemmer to stem, which supports a bunch of different languages. In the dictionary, you find, for example, aim
, not aimer
, as this is the stemmed version of aimer
. Our stemmer has the same behavior as the snowballstemmer in python:
import snowballstemmer
stemmer = snowballstemmer.stemmer('french')
stemmer.stemWords(['aimer'])
# ['aim']
In DuckDB:
select stem('aimer', 'french') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ aim │
└─────────┘
Stemming works by reducing words to their base, so that slight changes to words yield the same word, which makes them easier to search:
D select stem('bicycle', 'english') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ bicycl │
└─────────┘
D select stem('bicycles', 'english') stemmed;
┌─────────┐
│ stemmed │
│ varchar │
├─────────┤
│ bicycl │
└─────────┘
So I think there is some confusion here, because DuckDB's stemmer does exactly this.
- You don't need to stem your own query, just use the
fts_main_knowledge.match_bm25
macro as explained in the docs. - You can use our
tokenize
function to stem an entire sentence:
CREATE TABLE knowledge AS SELECT * FROM 'knowledge.json';
PRAGMA create_fts_index('knowledge', 'id', 'content', stemmer = 'french');
SELECT fts_main_knowledge.tokenize('je m''appelle laurens') tokens;
┌───────────────────────────┐
│ tokens │
│ varchar[] │
├───────────────────────────┤
│ [je, m, appelle, laurens] │
└───────────────────────────┘
from duckdb.
Thanks, I was indeed misunderstanding stemming!
EDIT. @lnkuiper : Maybe a last question.
SELECT fts_main_knowledge.tokenize('J''aime beaucoup les chiens')
allow to get tokens: [ j
, aime
, beaucoup
, les
, chiens
].
But how to get stems?
SELECT fts_main_knowledge.stem('J aime beaucoup les chiens', 'french')
does not work.
Thanks again.
from duckdb.
@charnould, When you use the tokenize
function, you get a list of stemmed words. The stem
function only works on individual words, not sentences
from duckdb.
Related Issues (20)
- CREATE AS SELECT segmentation faults HOT 1
- Python: Package pytz not included HOT 5
- Inconsistent date offsets with timezone handling HOT 16
- No comma separator writing list to csv in python HOT 8
- A simple filter is 100x slower after a positional join. HOT 6
- Can't load Pandas Datetime HOT 2
- When CAST a TIMESTAMP(x) value AS TIMESTAMP(y), and x > y, the decimal part that exceeds the accuracy will be directly ignored instead of rounding. HOT 2
- Escaping string double quote in JSON Path Not Working HOT 2
- duckdb的0.10.2版本用VS2019编译成功了,debug版本库嵌入到Qt项目中程序运行报错 HOT 3
- DuckDB crashes with crafted JSON casting
- DuckDB stack overflows with crafted TIMESTAMP casting
- DuckDB crashes with the crafted LIST_RESIZE function expression
- DuckDB crashes with the crafted `RESERVOIR_QUANTILE` expression (internal error in nightly build)
- Excel extension crashes with a crafted `TEXT` expression
- DuckDB crashes with a crafted ARRAY expression HOT 1
- DuckDB crashes via a crafted `LIST_WHERE` expression
- Serveral internal errors found in nightly build HOT 2
- DuckDB needs more than 11GB memory to process INSERT OR UPDATE statements
- [Fuzzer] DuckDB crashes via the `REPEAT` function with crafted arguments
- Failed to build benchmark "ModuleNotFoundError: No module named 'package_build'" HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from duckdb.