The has_enclosures filter predates entry tags, and was meant as a proxy for "is a podc

Ran some benchmarks, here's a summary: With only the <code cla

get_entries(has_enclosures=...) should be a plugin(?) about reader HOT 4 CLOSED

lemon24 commented on August 15, 2024

get_entries(has_enclosures=...) should be a plugin(?)

from reader.

Comments (4)

lemon24 commented on August 15, 2024

What about get_feeds(broken=..., updates_enabled=..., new=...)?

Where do we draw the line? Is this turning into #253? (DynamoDB has rotted my brain.)

from reader.

lemon24 commented on August 15, 2024

What would reader look like if you could only filter and sort by tags?

Both feed (user) title and entry recent sort are derived values, so we wouldn't have an issue here.
Ensuring consistency would be left to reader code (e.g. required "tag" attributes, updating computed values, modeling tristate attributes like important).
Idem for a lot of migrations (which may be better for databases that do not support schema transactions).
What would indices look like? (use cases: filter by [tag1, tag2, ...], sort by tag1 value, count by tag1)

from reader.

lemon24 commented on August 15, 2024

So, based on various SQLite forum threads, the general conclusion seems to be "don't bother – design your schema as you normally would, and add indexes as needed later on"; in fairness, this is something I already knew, but as I said, DynamoDB has rotted my brain.

I also tentatively removed has_enclosures, and it didn't remove all that much code.

So:

has_enclosure may become a tag, but that would likely lock in a performance penalty (right now, we don't have indexes on it, but if we make it a tag it won't be possible to add one)
- it might be interesting to see what query performance looks like with tags, though (update: slightly worse, see #327 (comment))
- the more specific enclosures filtering (e.g. .has-audio-enclosures) can still be achieved via the plugin
read and important are integral to the data model / filtering, so we still want them as regular columns
- related: #253
feed filtering attributes do not matter all that much, since feeds are both much fewer and smaller than entries (a feeds full table scan is likely negligible)
- on one hand, this is an argument for "do nothing", since the code is already there
- on the other hand, it may be an argument for "use tags" (since we can afford the performance penalty)
  - we may do this once we can set tags in a transaction, and get tags in a single query

from reader.

lemon24 commented on August 15, 2024

Ran some benchmarks, here's a summary:

With only the has-enclosures entry tag, there seems to be almost no difference between using has_enclosures or the tag.
Adding a 1-2 more tags to each entry seems to make using tags only a bit worse.
Adding 20 more tags to each entry seems to make using tags ~1.5x worse.

Single entry tag results.

Given a has-enclosures entry tag set like this:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
for e in reader.get_entries(has_enclosures=True):
    reader.set_tag(e, "has-enclosures")
print(reader.get_entry_counts())
'
EntryCounts(total=21609, read=15614, important=222, has_enclosures=3978, averages=(0.0, 6.868131868131868, 10.117808219178082))

...and this benchmark script:

export BENCH_TIME_STAT='avg min'
lines='for _ in reader.get_entries(has_enclosures=True): pass
for _ in reader.get_entries(tags=["has-enclosures"]): pass
for _ in reader.get_entries(has_enclosures=True, limit=100): pass
for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
for _ in reader.search_entries("python", has_enclosures=True): pass
for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass
for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass'
while IFS= read -r line; do
    echo "# $line"
    sync && sudo purge
    python scripts/bench.py time snippet -r10 --snippet "$line"
done <<< "$lines"

The output is:

# for _ in reader.get_entries(has_enclosures=True): pass
stat number repeat snippet
 avg      1     10   0.702
 min      1     10   0.374
# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.571
 min      1     10   0.393
# for _ in reader.get_entries(has_enclosures=True, limit=100): pass
stat number repeat snippet
 avg      1     10   0.022
 min      1     10   0.010
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.020
 min      1     10   0.010
# for _ in reader.search_entries("python", has_enclosures=True): pass
stat number repeat snippet
 avg      1     10   0.538
 min      1     10   0.384
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.514
 min      1     10   0.395
# for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass
stat number repeat snippet
 avg      1     10   0.250
 min      1     10   0.110
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.226
 min      1     10   0.112

1-2 entry tags results.

Extra tags were set for read and (un)important like so:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
for e in reader.get_entries():
    if e.read:
        reader.set_tag(e, "read")
    if e.important is True:
        reader.set_tag(e, "important")
    if e.important is False:
        reader.set_tag(e, "unimportant")
'

Output (same script as before, but only for the tags snippets):

# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.592
 min      1     10   0.408
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.022
 min      1     10   0.011
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.536
 min      1     10   0.408
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.245
 min      1     10   0.115

20+ entry tags results.

Extra tags were set for read and (un)important like so:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
tags = "one two three four five six seven eight nine ten eleven twelve thirteen fourteen sixteen seventeen eighteen nineteen twenty".split()
for e in reader.get_entries():
    for tag in tags:
        reader.set_tag(e, tag)
'

Output (same script as before, but only for the tags snippets):

# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   1.170
 min      1     10   0.613
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.042
 min      1     10   0.016
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.789
 min      1     10   0.548
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.342
 min      1     10   0.174

from reader.

get_entries(has_enclosures=...) should be a plugin(?) about reader HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent