Giter VIP home page Giter VIP logo

Comments (9)

harjinder-flipkart avatar harjinder-flipkart commented on August 25, 2024

Based upon recent investigation, I have updated the problem description above.

Chronos team, can you please help us resolve the issue.

from chronos.

harjinder-flipkart avatar harjinder-flipkart commented on August 25, 2024

I have kept Chronos thread dump here.

Relevant threads look like this:
...
"Thread-264485" #264523 prio=5 os_prio=0 tid=0x00007fd9d4006800 nid=0x5fb9 waiting for monitor entry [0x00007fda1c9da000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.replaceJob(JobScheduler.scala:152) - waiting to lock <0x00000007042d73d0> (a java.util.concurrent.locks.ReentrantLock) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.handleFinishedTask(JobScheduler.scala:244) at org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:210) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37) at com.sun.proxy.$Proxy30.statusUpdate(Unknown Source)
...

"pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000] java.lang.Thread.State: RUNNABLE at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method) at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.liftedTree1$1(JobScheduler.scala:542) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.mainLoop(JobScheduler.scala:540) - locked <0x00000007042d73d0> (a java.util.concurrent.locks.ReentrantLock) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anon$1.run(JobScheduler.scala:516) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

from chronos.

harjinder-flipkart avatar harjinder-flipkart commented on August 25, 2024

@brndnmtthws can you please look into this issue ?

from chronos.

brndnmtthws avatar brndnmtthws commented on August 25, 2024

@harjinder-flipkart I haven't been involved with this project in years, so I'm not really in a position to help. Good luck with your debugging.

from chronos.

janisz avatar janisz commented on August 25, 2024

Can you send mesos state JSON?

from chronos.

harjinder-flipkart avatar harjinder-flipkart commented on August 25, 2024

State JSON for mesos master is here: https://gist.github.com/harjinder-flipkart/58f1dfc8e077ee9a80f1b544cf87ff4c

from chronos.

janisz avatar janisz commented on August 25, 2024

I suspect chronos is stuck with single offer. Have you tried restarting it? It might be helpful to set offer_timeout on Mesos Master.

from chronos.

harjinder-flipkart avatar harjinder-flipkart commented on August 25, 2024

Thanks @janisz for your reply !

Yes restarting Chronos and ZK brings the cluster back in working condition. Restarting chronos/zk is a work-around for the time being. But we are looking for a permanent solution and need your help :)

Also, I am not sure if Chronos was stuck with single offer. The thread dump shows that Chronos thread was trying to load jobs and it was waiting for ZK:

...
"pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method)
	at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226)
	at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
	at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68)
...

from chronos.

harjinder-flipkart avatar harjinder-flipkart commented on August 25, 2024

@janisz any pointers for this ?

from chronos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.