Giter VIP home page Giter VIP logo

Comments (9)

jkeen avatar jkeen commented on July 24, 2024 1

For the staging site that kept running into issues yesterday:

DB_POOL = 30
RAILS_MAX_THREADS = 10
GRAPHITI_CONCURRENCY_MAX_THREADS = 3
WEB_CONCURRENCY = 2

It's running on a VPS and when inspecting the process list it didn't seem like any processes had CPU spiked, or anything else that might usually jump out at me as a problem. I wasn't sure what I should be looking for in spotting any hung threads from the command line, even. Any pointers there?

I can give #472 a try this week sometime. Appreciate the follow up

from graphiti.

MattFenelon avatar MattFenelon commented on July 24, 2024 1

Sure I can do that. There's no (real) limit to the threads that the released version of graphiti can use. Instead the problem is that the unbounded threads eventually can exhaust the available database connection pool as ActiveRecord has a 1 thread = 1 connection policy.

For this failure example, we have a puma thread pool of 1 and a database connection pool of 1:

  1. Thread 1 - The parent resource is accessed.
  2. Thread 1 - The parent resource loads data from ActiveRecord. ActiveRecord leases the thread a connection from the connection pool. The database connection pool is now empty
  3. Thread 1 - The parent resource creates Thread 2 to load the :authors resource
  4. Thread 1 - waits on thread 2.
  5. Thread 2 - Attempts to load author data from ActiveRecord. AR attempts to lease a database connection to the thread but can't because thread 1 has the only available connection.
  6. Thread 2 - raises ActiveRecord::ConnectionTimeoutError "all pooled connections were in use".

In the real world this is less likely to happen because there have to be enough web threads running at the same time to exhaust the database pool.

For example, consider we have 5 puma threads and a database connection pool of 5:

  1. Thread 1 - The parent resource is accessed.
  2. Thread 1 - The parent resource loads data from ActiveRecord. ActiveRecord leases the thread a connection from the connection pool. There are 4 connections left in the pool
  3. Thread 1 - The parent resource creates Thread 2 to load the :authors resource
  4. Thread 1 - waits on thread 2.
  5. Thread 2 - Attempts to load author data from ActiveRecord. ActiveRecord leases the thread a connection from the connection pool. There are 3 connections left in the pool
  6. Thread 3 - a new request comes in. ActiveRecord leases the thread a connection from the connection pool. There are 2 connections left in the pool
  7. Thread 2 - finishes. The connection is made available to the pool again. There are 3 connections left in the pool
  8. Thread 1 - finishes waiting. The connection is made available to the pool again. There are 4 connections left in the pool

And so on. If someone has limited traffic, limited sideloads, or a large database connection pool, they may never see this issue.

A naive way to solve it is to just increase the database connection pool but the AR error will raise its head eventually with enough traffic and sideloads.

from graphiti.

jkeen avatar jkeen commented on July 24, 2024 1

@MattFenelon I was adjusting values for db pool, max threads, and web_concurrency today when I ran square into this exact problem! I never ran into it before because of how my database.yml is setup where RAILS_MAX_THREADS wasn't being used for the pool value, where instead I had a separate DB_POOL var defined (and set at 40). Others might have a similar setup, thereby never running into it

image

from graphiti.

github-actions avatar github-actions commented on July 24, 2024

๐ŸŽ‰ This issue has been resolved in version 1.6.0 ๐ŸŽ‰

The release is available on:

Your semantic-release bot ๐Ÿ“ฆ๐Ÿš€

from graphiti.

jkeen avatar jkeen commented on July 24, 2024

@MattFenelon moving our convo from #470 into here.

In short I'm still having a ton of trouble with web server lockups after this change and it's hard to figure out what exactly is going onโ€ฆ but it seems like without this change everything is a-ok. My DB pool values and the sideload thread max values seem to be reasonable, and yet sometimes things will just hang causing the webserver to not respond to requests.

I think we need to roll this thing back and investigate further on an unreleased branch. The part that worries me most is that there's no error message or any sort of clue that this is the issue, meaning there could be people out there with apps breaking without a clear cut cause. Also for me in order to get things running properly again it has required a server restart, which is a real bummer.

I'm going to put out a new patch release without these changes, and then maybe we can work together to try and resolve what's going on by working off a branch?

from graphiti.

MattFenelon avatar MattFenelon commented on July 24, 2024

Sorry to hear that @jkeen. I see you've reverted, that makes sense until we can clear this up.

In terms of diagnosing:

  1. What're your puma thread and workers set to?
  2. Did you use the default concurrency_max_threads setting?
  3. What's pool set to in database.yml?
  4. Does your database have a connection limit? If so, what is it?
  5. How many dynos/containers/vms do you have running?

from graphiti.

MattFenelon avatar MattFenelon commented on July 24, 2024

The fact that it's locking up completely makes me think it's a deadlock related to the use of the Mutex. I've gone with a different approach in #472 but I haven't been able to test those changes yet. Do you want to give it a try? I can test them in about a week or so.

from graphiti.

MattFenelon avatar MattFenelon commented on July 24, 2024

I think I've found the issue. It only seems to happen when side-loading children of child resources. Does that tally up with the side loading you're doing, e.g. include=resource1.resource2(.resource3)+? I don't think this happens when the includes are only 1 level deep, e.g. include=resource1,resource2,(resource3)+.

For the sake of an example, let's imagine a thread pool of 1 and an unbounded queue:

  1. The parent resource creates Promise 1 to load child-resource :authors
  2. Waits for Promise 1 to finish
  3. Thread 1 is leased to run Promise 1. There are no more threads available in the pool.
  4. Thread 1 - Promise 1 - the authors resource creates Promise 2 to load child-resource :books
  5. Thread 1 - Promise 1 - waits for Promise 2 to finish
  6. Promise 2 can't run because the thread pool is exhausted. Thread 1 can't be released because Promise 1 is waiting on Promise 2. This is a deadlock.

Though it's technically a deadlock the graphiti code uses sleep to wait for the promises so it manifests itself as a hang. In my PR I replaced the sleep with Concurrent::Promise.zip(*promises).value!, which connects the threads together, allowing ruby to detect the deadlock and raise a fatal exception No live threads left. Deadlock?.

I think it can be fixed by making the child resource sideload logic non-blocking. That would remove the need for the child resources to wait, which would then free up the threads to be used for other promises in the queue. But it probably needs some kind of queue to allow the parent resource to know when all child resources are loaded.

  1. The parent resource creates Promise 1 to load child-resource :authors
  2. Waits for all promises in the queue to finish
  3. Thread 1 is leased to run Promise 1. There are no more threads available in the pool.
  4. Thread 1 - Promise 1 - the authors resource creates Promise 2 to load child-resource :books
  5. Thread 1 - Promise 1 finishes. Thread 1 is added back to the thread pool.
  6. Thread 1 is leased to run Promise 2.
  7. Thread 1 - Promise 2 - the books resource is loaded. There are no other resources to load.
  8. Thread 1 - Promise 2 finishes.
  9. The parent resource stops waiting as all promises in the queue have finished
  10. The parent resource returns. All resources have been loaded at this point.

A few ifs there but I'm going to try this approach and see how I get on.

I've added a successfully failing test case that shows the deadlock: e4de93e

from graphiti.

jkeen avatar jkeen commented on July 24, 2024

And just for the sake of clarityโ€”how is that scenario you outlined working now in the released version of graphiti?

I guess what I'm trying to understand is if the strategy you're describing works, does it yield in performance gains, or does it yield a more safe approach to what is currently happening by default?

from graphiti.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.