Discussed in <a class="issue-link js-issue-link" data-error-text="Failed to load titl

Hi, today <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

We have 27570 torrents in the <a href="https://index.torrust-demo.com/torrents" rel="n

By the way <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

Import all torrents statistics from the tracker (even with high load and many torrents) about torrust-index HOT 8 OPEN

josecelano commented on September 10, 2024

Import all torrents statistics from the tracker (even with high load and many torrents)

from torrust-index.

Comments (8)

josecelano commented on September 10, 2024 1

Other potencial solutions that can be combined from ChatGPT:

To improve the efficiency and speed of importing statistics from your BitTorrent Tracker to your BitTorrent Index, especially given the large scale of torrents and the potential mismatch between the two, you can consider several strategies. Each of these methods leverages different aspects of system design and algorithmic efficiency:

1. Batch Processing with Enhanced API

Idea: Enhance the Tracker's API to support batch processing. This means enabling the API to accept requests for statistics of multiple torrents at once, rather than one at a time.
Implementation: Update the Tracker's API to allow for batch requests. You can set a reasonable limit on the number of torrents per request to balance load.
Advantages: Reduces the number of API calls drastically, decreasing network overhead and API load.

2. Differential Update Strategy

Idea: Instead of importing all torrent statistics every hour, determine which torrents have likely changed and update only those.
Implementation: Implement a mechanism (like a timestamp or a change log) on the Tracker to identify torrents that have been updated since the last import. The Index then requests statistics only for these torrents.
Advantages: Significantly reduces the amount of data transferred and processed.

3. Using Bloom Filters

Idea: Use Bloom filters to quickly check if a torrent in the Index is also in the Tracker, and update statistics accordingly.
Implementation:
- The Tracker maintains a Bloom filter of all its torrents.
- The Index queries this Bloom filter to check if its torrents are in the Tracker before making an API call.
- Implement a scheduled task to update the Bloom filter periodically.
Advantages: Bloom filters are space-efficient and fast for membership checking, which reduces unnecessary API calls for torrents not in the Tracker.

4. Webhooks or Push Mechanism

Idea: Instead of pulling data from the Tracker, have the Tracker push updates to the Index.
Implementation: Implement a webhook system in the Tracker that sends updates to the Index whenever torrent statistics change.
Advantages: Real-time updates and reduced load on both systems, as data is transferred only when there is a change.

5. Database Replication or Shared Access

Idea: If both services can securely access a shared database, replication or direct querying could be used.
Implementation: Implement a shared database or a replication mechanism where the Tracker updates a shared dataset that the Index can read directly.
Advantages: Eliminates the need for API calls and allows the Index to directly query the data it needs.

6. Caching and Incremental Updates

Idea: Cache statistics in the Index and only request updates for those statistics.
Implementation: Implement a caching layer in the Index that stores the most recent statistics. Periodically, request updates for these statistics from the Tracker.
Advantages: Reduces API calls to only those torrents whose cached data is out of date.

Conclusion

Each of these strategies has its advantages and can be combined for optimal performance. Given that both services are implemented in Rust, you can leverage Rust's performance and concurrency features to implement these solutions efficiently. The choice of strategy will depend on your specific requirements, such as the frequency of updates, the volume of data, and the infrastructure setup of your Tracker and Index.

from torrust-index.

josecelano commented on September 10, 2024

We have now 9945 torrents on the live demo and we are having this nice pattern:

I suppose it's the importer running every hour.

from torrust-index.

josecelano commented on September 10, 2024

Hi, today @da2ce7 and I were discussing a solution. @da2ce7 proposed:

Make sure we don't overlap executions of the importer. I think the current solution does not overlap executions. It simply waits one hour after finishing the importation. If the process takes less than 1 hour statistics will be imported every hour. If the process takes longer, for example, 3 hours, then statistics will be imported every 4 hours (3 hours for the process + 1 hour waiting for the next tick). I guess the intention was to update statistics at least once in an hour, right @WarmBeer? In that case, maybe we could just wait the remaining time between 1 hour and the duration of the importation process. @WarmBeer @da2ce7 ?
We can also improve the logs. See #468 (comment)
We should also take advantage of threads. @da2ce7 proposed to import a batch of torrents at the same time using tokio-spawned tasks. We are making a single request per torrent. In the future, we could add a new endpoint to the tracker to get the statistics for more than one torrent at the same time.
We should use pagination for the query to get all torrents from the Index. We can take for example 50 and import them in parallel. And then continue with the next page (order by the implicit DB table order).
If there is a problem with the tracker connection while importing a torrent, the current behavior is just resetting statistics for the torrent in the Index and trying again in the next importation.

        let interval = std::time::Duration::from_secs(torrent_info_update_interval);
        let mut interval = tokio::time::interval(interval);

        interval.tick().await; // first tick is immediate...

        loop {
            interval.tick().await;

            info!("Running tracker statistics importer ...");

            if let Err(e) = send_heartbeat(importer_port).await {
                error!("Failed to send heartbeat from importer cronjob: {}", e);
            }

            if let Some(tracker) = weak_tracker_statistics_importer.upgrade() {
                drop(tracker.import_all_torrents_statistics().await);
            } else {
                break;
            }
        }

Relates to: #468 (comment)

from torrust-index.

josecelano commented on September 10, 2024

It seems that as the number of torrents increases the server is having more problems importing all the statistics in 1 hour.

We have now 16823 torrents.

from torrust-index.

josecelano commented on September 10, 2024

We have 27570 torrents in the live demo.

It seems the server is always busy finally.

from torrust-index.

josecelano commented on September 10, 2024

By the way @WarmBeer @da2ce7 @mario-nt what do you think about my Proposed solution 2: import statistics on the fly.

We are importing statistics for all torrents every hour even if maybe we do not need them because we do not have users in the Index or the users are only interested in 20% of the torrents. As far as I know we only show the statistics on the list and details page.

We could:

Remove statistics from the Index database.
Import statistics only when we need them (list and detail pages).
We can use a in-memory cache valid for one hour. Before fetching them from the tracker we check if we have fresh data in the cache (less than 1 hour).

Pros:

If the Index does not have users we don't overload the Tracker.
We only get the data we use. If only 20% of the torrents are listed or viewed in detail we only import statistics for those torrents.

Cons:

Assuming a high load in the Index and users interested in 100% of the torrents equally distributed, it takes longer to build the response because it's not a direct SQL query. But this should not be too slow.

from torrust-index.

mario-nt commented on September 10, 2024

I liked the idea of sharing a database, but that option couples the tracker and the index completely.

My favourite one is number 4: Webhooks or Push Mechanism, as that way, we only push the data once it is updated, but it may still be slow if there are a lot of changes.

And bloom filters look good too.

from torrust-index.

Import all torrents statistics from the tracker (even with high load and many torrents) about torrust-index HOT 8 OPEN

Comments (8)

1. Batch Processing with Enhanced API

2. Differential Update Strategy

3. Using Bloom Filters

4. Webhooks or Push Mechanism

5. Database Replication or Shared Access

6. Caching and Incremental Updates

Conclusion

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent