Giter VIP home page Giter VIP logo

Comments (4)

lemon24 avatar lemon24 commented on August 15, 2024

What about get_feeds(broken=..., updates_enabled=..., new=...)?

Where do we draw the line? Is this turning into #253? (DynamoDB has rotted my brain.)

from reader.

lemon24 avatar lemon24 commented on August 15, 2024

Related: http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/, vaguely reminiscent of https://en.wikipedia.org/wiki/Star_schema; also see https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model

What would reader look like if you could only filter and sort by tags?

  • Both feed (user) title and entry recent sort are derived values, so we wouldn't have an issue here.
  • Ensuring consistency would be left to reader code (e.g. required "tag" attributes, updating computed values, modeling tristate attributes like important).
  • Idem for a lot of migrations (which may be better for databases that do not support schema transactions).
  • What would indices look like? (use cases: filter by [tag1, tag2, ...], sort by tag1 value, count by tag1)

from reader.

lemon24 avatar lemon24 commented on August 15, 2024

So, based on various SQLite forum threads, the general conclusion seems to be "don't bother – design your schema as you normally would, and add indexes as needed later on"; in fairness, this is something I already knew, but as I said, DynamoDB has rotted my brain.

I also tentatively removed has_enclosures, and it didn't remove all that much code.

So:

  • has_enclosure may become a tag, but that would likely lock in a performance penalty (right now, we don't have indexes on it, but if we make it a tag it won't be possible to add one)
    • it might be interesting to see what query performance looks like with tags, though (update: slightly worse, see #327 (comment))
    • the more specific enclosures filtering (e.g. .has-audio-enclosures) can still be achieved via the plugin
  • read and important are integral to the data model / filtering, so we still want them as regular columns
  • feed filtering attributes do not matter all that much, since feeds are both much fewer and smaller than entries (a feeds full table scan is likely negligible)
    • on one hand, this is an argument for "do nothing", since the code is already there
    • on the other hand, it may be an argument for "use tags" (since we can afford the performance penalty)
      • we may do this once we can set tags in a transaction, and get tags in a single query

from reader.

lemon24 avatar lemon24 commented on August 15, 2024

Ran some benchmarks, here's a summary:

  • With only the has-enclosures entry tag, there seems to be almost no difference between using has_enclosures or the tag.
  • Adding a 1-2 more tags to each entry seems to make using tags only a bit worse.
  • Adding 20 more tags to each entry seems to make using tags ~1.5x worse.
Single entry tag results.

Given a has-enclosures entry tag set like this:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
for e in reader.get_entries(has_enclosures=True):
    reader.set_tag(e, "has-enclosures")
print(reader.get_entry_counts())
'
EntryCounts(total=21609, read=15614, important=222, has_enclosures=3978, averages=(0.0, 6.868131868131868, 10.117808219178082))

...and this benchmark script:

export BENCH_TIME_STAT='avg min'
lines='for _ in reader.get_entries(has_enclosures=True): pass
for _ in reader.get_entries(tags=["has-enclosures"]): pass
for _ in reader.get_entries(has_enclosures=True, limit=100): pass
for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
for _ in reader.search_entries("python", has_enclosures=True): pass
for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass
for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass'
while IFS= read -r line; do
    echo "# $line"
    sync && sudo purge
    python scripts/bench.py time snippet -r10 --snippet "$line"
done <<< "$lines"

The output is:

# for _ in reader.get_entries(has_enclosures=True): pass
stat number repeat snippet
 avg      1     10   0.702
 min      1     10   0.374
# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.571
 min      1     10   0.393
# for _ in reader.get_entries(has_enclosures=True, limit=100): pass
stat number repeat snippet
 avg      1     10   0.022
 min      1     10   0.010
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.020
 min      1     10   0.010
# for _ in reader.search_entries("python", has_enclosures=True): pass
stat number repeat snippet
 avg      1     10   0.538
 min      1     10   0.384
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.514
 min      1     10   0.395
# for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass
stat number repeat snippet
 avg      1     10   0.250
 min      1     10   0.110
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.226
 min      1     10   0.112
1-2 entry tags results.

Extra tags were set for read and (un)important like so:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
for e in reader.get_entries():
    if e.read:
        reader.set_tag(e, "read")
    if e.important is True:
        reader.set_tag(e, "important")
    if e.important is False:
        reader.set_tag(e, "unimportant")
'

Output (same script as before, but only for the tags snippets):

# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.592
 min      1     10   0.408
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.022
 min      1     10   0.011
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.536
 min      1     10   0.408
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.245
 min      1     10   0.115
20+ entry tags results.

Extra tags were set for read and (un)important like so:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
tags = "one two three four five six seven eight nine ten eleven twelve thirteen fourteen sixteen seventeen eighteen nineteen twenty".split()
for e in reader.get_entries():
    for tag in tags:
        reader.set_tag(e, tag)
'

Output (same script as before, but only for the tags snippets):

# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   1.170
 min      1     10   0.613
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.042
 min      1     10   0.016
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.789
 min      1     10   0.548
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.342
 min      1     10   0.174

from reader.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.