Giter VIP home page Giter VIP logo

Comments (59)

godfat avatar godfat commented on June 7, 2024 1

Thank you for looking into this! (I can't reproduce the issue by myself)

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024 1

@godfat you're welcome! If it's my code, I have to fix it :-)

from fugit.

godfat avatar godfat commented on June 7, 2024 1

@conet I am in the middle of something else, but I can help with this quickly:

bundle exec ruby -r et-orbi -e "EtOrbi._make_info"

Run this to start a console: sudo gitlab-rails console and then run:

EtOrbi._make_info

which should print the results.

from fugit.

conet avatar conet commented on June 7, 2024 1

@conet, are you somehow killing the thread to get those WARN messages? The outcome is different each time (I think I start to understand your "it seems in a loop" statement).

Yes, that is sidekiq's way of getting a stacktrace dumped to the logs, by running:

kill -TTIN <sidekiq_pid>

from fugit.

conet avatar conet commented on June 7, 2024 1

It took a bit longer but here it is:

2019-03-21_11:57:05.48766 2019-03-21T11:57:05.487Z 19536 TID-or1c33vzs WARN:  /// should_enque() 2019-03-21 11:57:05 UTC (Time/[Time, DateAndTime::Compatibility, DateAndTime::Calculati
ons, DateAndTime::Zones, Comparable, NakayoshiFork::Behavior, ActiveSupport::ToJsonWithActiveSupportEncoder, Object, GettextI18nRails::HtmlSafeTranslations, FastGettext::Translation, E
RB::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable, ActiveSupport::Tryable, JSON::Ext::Generator::GeneratorMethods::Object, Kernel, BasicObject]) / "2019-03-21 11:57:05 +
0000" / now: 2019-03-21 11:57:05 UTC (d 0.304640802s) / let: 2019-03-21 11:56:32 UTC

I assume that only the last entry is important, if you want more please let me know.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024 1

A few early observations:

I restarted gitlab after the change, it was running without issues for a couple of minutes
then it entered the endless loop, proven this time by the logs:

2019-03-21_08:33:05.75627 2019-03-21T08:33:05.756Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.75793 2019-03-21T08:33:05.757Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76032 2019-03-21T08:33:05.759Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: start
2019-03-21_08:33:05.76533 2019-03-21T08:33:05.765Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76710 2019-03-21T08:33:05.767Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
...

The endless loop is not the one exposed in gh-11 / gh-13. That loop (

fugit/lib/fugit/cron.rb

Lines 186 to 215 in f1e2d00

loop do
ti1 = t.to_i
fail RuntimeError.new(
"loop stalled for #{@original.inspect} #next_time, breaking"
) if stalling && ti == ti1
stalling = (ti == ti1)
ti = ti1
fail RuntimeError.new(
"too many loops for #{@original.inspect} #next_time, breaking"
) if (ti - ifrom).abs > BREAKER_S
(ifrom == ti) && (t.inc(1); next)
month_match?(t) || (t.inc_month; next)
day_match?(t) || (t.inc_day; next)
hour_match?(t) || (t.inc_hour; next)
min_match?(t) || (t.inc_min; next)
sec_match?(t) || (t.inc_sec(@seconds); next)
st = t.time.strftime('%F/%T')
(from, sfrom, ifrom = t.time, st, t.time.to_i; next) if st == sfrom
#
# when transitioning out of DST, this prevents #next_time from
# yielding the same literal time twice in a row, see gh-6
break
end
) is happening in #next_time while your case occurs around #previous_time.

As suggested by @godfat, there is now a breaker in the #next_time loop. At the same time, a breaker was added in the #previous_time loop (

fugit/lib/fugit/cron.rb

Lines 232 to 254 in f1e2d00

loop do
ti1 = t.to_i
fail RuntimeError.new(
"loop stalled for #{@original.inspect} #previous_time, breaking"
) if stalling && ti == ti1
stalling = (ti == ti1)
ti = ti1
fail RuntimeError.new(
"too many loops for #{@original.inspect} #previous_time, breaking"
) if (ifrom - ti).abs > BREAKER_S
(ifrom == ti) && (t.inc(-1); next)
month_match?(t) || (t.dec_month; next)
day_match?(t) || (t.dec_day; next)
hour_match?(t) || (t.dec_hour; next)
min_match?(t) || (t.dec_min; next)
sec_match?(t) || (t.dec_sec(@seconds); next)
break
end
). The breaker never triggered for you. No re-opening of gh-11 / gh-13.

The log is proving that Sidekiq-Cron is calling #should_enque repeatedly, with the same time value :-(

I've looked at Sidekiq-Cron (https://github.com/ondrejbartas/sidekiq-cron/blob/a9dfab248ca7/lib/sidekiq/cron/job.rb and https://github.com/ondrejbartas/sidekiq-cron/blob/a9dfab248ca7/lib/sidekiq/cron/poller.rb) and I don't see yet how this loop comes to be.

I think that the impression you got that the problem comes from #previous_time is that this method is probably the costliest of the ones used in #should_enque and your kills exposed it repeatedly.

from fugit.

conet avatar conet commented on June 7, 2024 1

Finally, here's the current end, because it goes on and on:

Fugit#previous_time \ {:i=>180840000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180850000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180860000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180870000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180880000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180890000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180900000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180910000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}

and the beginning:

Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 23 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"50 00 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"10 0 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"50-59 22 * * 0", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"19 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 0 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 */2 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 12 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 12 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 */1 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"17 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>10000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>20000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>30000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>40000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>50000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>60000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>70000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>80000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>90000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>100000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024 1

Thank you very much.

from fugit.

conet avatar conet commented on June 7, 2024 1

Now that you mention the match_second part I do recall seeing in on the top of the stack trace often, I'm glad we got to the bottom of it, thanks for your support.

from fugit.

godfat avatar godfat commented on June 7, 2024 1

Wow thank you both so much to get to the bottom of this! Very impressive debugging process. Sorry I didn't try to jump in because apparently I didn't exactly follow the reasoning, and I still don't really understand what went wrong.

I can provide some general ideas though.

You're giving me another idea: a simple timeout, whatever the iteration count, whatever the distance in time, if it takes too long, let's fail.

@jmettraux In general I think breaking on counting iterations is better than time based approach though, because what @conet said:

Yeah but that has the disadvantage that it might mean more or less iterations depending on various factors including current load so it might actually make this more unpredictable.

This can become very unpredictable. My favorite example was an old game The 7th Guest, there's a mini-game which the AI was pretty weak when the game was released. 20 years later the AI became unbeatable, because the AI only stopped searching for the best move based on time, and for a 20-years later computer, which had so much more powerful computing power, it can find the absolutely best move (because the game wasn't that complex anyway)

Beside, a timer requires another thread to watch it. If we have a global timer then this is fine resource-wise, but if we need a new timer for each calculation, then this will double the threads we need. We can also look at the time for each iteration, but again for example under heavy load, or we somehow want to suspend the execution and resume it later, they'll all interfere it and might cause it to return too early, breaking the application under load. This will be very hard to debug unless we know here it's time sensitive.

Something like that 41 years and a few days... there wont'be a next or a previous time but expressed as the number of iterations that would be enough to reach that state under normal circumstances.

Yes, if something like this can be done that'll be ideal. It's like a sanity check, based on complexity. I don't know cron and time enough to know if this is feasible though.

It was right under my nose. #previous_time called within a "matching second".

Is this because floating point precision?

Thanks to both of you for the patience and the guidance.

@jmettraux Thank you so much for your dedication to get to the bottom of this, and willing to install GitLab to try it out. I know that can be cumbersome. Please don't get yourself too much stress though, it's already much appreciated if you can take a look.

@conet Thank you for providing all the logs and traces and willing to go through it patiently especially when you're not familiar with Ruby. I know it takes a lot of efforts to spin an instance, prepare the patch, run it, and wait for it, and finally collect logs, especially when it might not reproduce it at all.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024 1

Hello @conet,

if you had the time to confirm the issue is fixed. That'd be great.

The fixed fugit has been running for 8 hours on my GitLab enterprise instance and I see no issues, no endless loops. (I just dropped the updated https://github.com/floraison/fugit/blob/b50bbfdb243f4ce4e8ce0d966eab25c6833ab594/lib/fugit/cron.rb file over the defective one).

If the fix is confirmed, I will release fugit 1.1.9.

Thanks in advance.

from fugit.

conet avatar conet commented on June 7, 2024 1

After several hours running the above container with your fix, the endless loop haven't kicked in, I'd say that the issue can be closed.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024 1

Hello @conet and @godfat,

I have released fugit 1.1.9.

Thank you very much! If there is anything wrong, please tell me!

Special thanks to @conet for fighting for his case and pushing fugit in the right direction!

from fugit.

godfat avatar godfat commented on June 7, 2024 1

Thank you all! I created a merge request to update Fugit in GitLab: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/26579

from fugit.

conet avatar conet commented on June 7, 2024

I just got online, I've managed to add a log statement in gitlab:

/opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb

to get the time value going into:

parsed_cron.previous_time(current_time).utc

that is causing this, I'm waiting for the results.

To answer the endless loop related question: by endless loop I mean that, based on successive stacktraces, it seems like the code is stuck in the loop in previous_time.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

@conet thank you very much.

Could you please answer my second question?

@jmettraux wrote:

Does the LDAP sync work your settings seem to imply happen or not? Does the LDAP sync work happens every 5 / 10 seconds ? Does it happen but with a different schedule ? Or is it NOT happening at all?

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

@conet's traces (a bit truncated):

.79856 2019-03-20T18:36:54.798Z 2977 TID-ot8g7o9m9 WARN: Thread TID-ot89u8ea9 scheduler
.79857 2019-03-20T18:36:54.798Z 2977 TID-ot8g7o9m9 WARN: gl/emb/lib/ruby/gems/2.5.0/gems/tzinfo-1.2.5/lib/tzinfo/time_or_datetime.rb:321:in `wrap'
.79857 gl/emb/lib/ruby/gems/2.5.0/gems/tzinfo-1.2.5/lib/tzinfo/timezone.rb:420:in `utc_to_local'
.79857 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:358:in `to_time'
.79858 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:203:in `block (2 levels) in <class:EoTime>'
.79858 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:76:in `block (2 levels) in <class:TimeCursor>'
.79858 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:119:in `sec_match?'
.79858 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:252:in `block in previous_time'
.79858 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `loop'
.79859 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `previous_time'
.79859 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:561:in `not_past_scheduled_time?'
.79859 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:23:in `block in should_enque?'
.79859 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq.rb:97:in `block in redis'
.79860 gl/emb/lib/ruby/gems/2.5.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:65:in `block (2 levels) in with'
.79860 gl/emb/lib/ruby/gems/2.5.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `handle_interrupt'
.79860 gl/emb/lib/ruby/gems/2.5.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `block in with'
.79860 gl/emb/lib/ruby/gems/2.5.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `handle_interrupt'
.79861 gl/emb/lib/ruby/gems/2.5.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `with'
.79861 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq.rb:94:in `redis'
.79861 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:21:in `should_enque?'
.79862 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:42:in `test_and_enque_for_time!'
.79862 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/poller.rb:28:in `enqueue_job'
.79862 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/poller.rb:15:in `block in enqueue'
.79863 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/poller.rb:14:in `each'
.79863 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/poller.rb:14:in `enqueue'
.79863 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq/scheduled.rb:68:in `block in start'
.79863 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq/util.rb:16:in `watchdog'
.79863 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq/util.rb:25:in `block in safe_thread'
.46726 2019-03-20T21:47:32.466Z 5353 TID-oxdlhwzy1 WARN: gl/emb/lib/ruby/gems/2.5.0/gems/activesupport-5.0.7.1/lib/active_support/duration.rb:47:in `==='
.46727 gl/emb/lib/ruby/gems/2.5.0/gems/activesupport-5.0.7.1/lib/active_support/core_ext/time/calculations.rb:236:in `plus_with_duration'
.46727 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:129:in `utc'
.46727 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:358:in `to_time'
.46727 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:203:in `block (2 levels) in <class:EoTime>'
.46728 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:76:in `block (2 levels) in <class:TimeCursor>'
.46728 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:119:in `sec_match?'
.46728 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:252:in `block in previous_time'
.46728 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `loop'
.46729 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `previous_time'
.46729 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:561:in `not_past_scheduled_time?'
2019-03-20_21:58:36.80890 2019-03-20T21:58:36.808Z 5353 TID-oxdlhwzy1 WARN: gl/emb/lib/ruby/gems/2.5.0/gems/activesupport-5.0.7.1/lib/active_support/core_ext/time/calculations.rb:239:in `+'
2019-03-20_21:58:36.80891 gl/emb/lib/ruby/gems/2.5.0/gems/activesupport-5.0.7.1/lib/active_support/core_ext/time/calculations.rb:239:in `plus_with_duration'
2019-03-20_21:58:36.80891 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:129:in `utc'
2019-03-20_21:58:36.80891 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:358:in `to_time'
2019-03-20_21:58:36.80892 gl/emb/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:203:in `block (2 levels) in <class:EoTime>'
2019-03-20_21:58:36.80892 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:76:in `block (2 levels) in <class:TimeCursor>'
2019-03-20_21:58:36.80892 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:119:in `sec_match?'
2019-03-20_21:58:36.80893 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:252:in `block in previous_time'
2019-03-20_21:58:36.80893 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `loop'
2019-03-20_21:58:36.80894 gl/emb/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `previous_time'
2019-03-20_21:58:36.80894 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:561:in `not_past_scheduled_time?'
2019-03-20_21:58:36.80894 gl/emb/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:23:in `block in should_enque?'

@conet, are you somehow killing the thread to get those WARN messages? The outcome is different each time (I think I start to understand your "it seems in a loop" statement).

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

Meanwhile, built a small project at https://github.com/jmettraux/fugit_gh15 to try and reproduce the issue.

So far, tried to feed an ActiveSupport enhance Time to #previous_time but no loop went out of control.

Looking forward to @conet log results.

from fugit.

conet avatar conet commented on June 7, 2024

Sorry for the delay here is are my results, @godfat I will need your help to transform the gitlab specific data into fugit specific stuff so that @jmettraux can work with that, first this is the logging I added to gitlab's should_enque:

      #crucial part of whole enquing job
      def should_enque? time
        enqueue = false
        @time_str = time.strftime(LAST_ENQUEUE_TIME_FORMAT)
        logger.warn { "should_enque: #{@time_str}" }
        enqueue = Sidekiq.redis do |conn|
          status == "enabled" &&
            not_past_scheduled_time?(time) &&
            not_enqueued_after?(time) &&
            conn.zadd(job_enqueued_key, formated_enqueue_time(time), formated_last_time(time))
        end
        enqueue
      end

I restarted gitlab after the change, it was running without issues for a couple of minutes then it entered the endless loop, proven this time by the logs:

2019-03-21_08:33:05.75627 2019-03-21T08:33:05.756Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.75793 2019-03-21T08:33:05.757Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76032 2019-03-21T08:33:05.759Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: start
2019-03-21_08:33:05.76533 2019-03-21T08:33:05.765Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76710 2019-03-21T08:33:05.767Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76857 2019-03-21T08:33:05.768Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76933 2019-03-21T08:33:05.769Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77176 2019-03-21T08:33:05.771Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77327 2019-03-21T08:33:05.773Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77441 2019-03-21T08:33:05.774Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77558 2019-03-21T08:33:05.775Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77670 2019-03-21T08:33:05.776Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77811 2019-03-21T08:33:05.778Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.77979 2019-03-21T08:33:05.779Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78125 2019-03-21T08:33:05.781Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78223 2019-03-21T08:33:05.782Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78332 2019-03-21T08:33:05.783Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78403 2019-03-21T08:33:05.784Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78503 2019-03-21T08:33:05.784Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78605 2019-03-21T08:33:05.786Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78708 2019-03-21T08:33:05.787Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78948 2019-03-21T08:33:05.789Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.79007 2019-03-21T08:33:05.790Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.86100 2019-03-21T08:33:05.771Z 29013 TID-ovgzfnhv9 UpdateAllMirrorsWorker JID-97db5164488292f394f70d41 INFO: start
2019-03-21_08:33:05.86361 2019-03-21T08:33:05.789Z 29013 TID-ovgxlnqdh Geo::SidekiqCronConfigWorker JID-83b3af2c2e5ddf868978583e INFO: start
2019-03-21_08:33:05.86423 2019-03-21T08:33:05.765Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: Started LDAP group sync
2019-03-21_08:33:06.06309 2019-03-21T08:33:06.061Z 29013 TID-ovgzfnhv9 UpdateAllMirrorsWorker JID-97db5164488292f394f70d41 INFO: done: 0.29 sec
2019-03-21_08:33:06.06425 2019-03-21T08:33:06.063Z 29013 TID-ovgxlnqdh Geo::SidekiqCronConfigWorker JID-83b3af2c2e5ddf868978583e INFO: done: 0.274 sec
2019-03-21_08:33:10.87072 2019-03-21T08:33:10.869Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: done: 0.511 sec

By proven I mean that nothing is logged by sidekiq after that... now when I dump stack traces I get:

2019-03-21_10:39:44.36153 2019-03-21T10:39:44.360Z 29013 TID-ovh3gnxw5 WARN: /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:292:in `initialize_dup'
2019-03-21_10:39:44.36154 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:292:in `dup'
2019-03-21_10:39:44.36154 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:292:in `inc'
2019-03-21_10:39:44.36154 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/et-orbi-1.1.7/lib/et-orbi/eo_time.rb:223:in `+'
2019-03-21_10:39:44.36154 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:79:in `inc'
2019-03-21_10:39:44.36154 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:247:in `block in previous_time'
2019-03-21_10:39:44.36155 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `loop'
2019-03-21_10:39:44.36155 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/fugit-1.1.7/lib/fugit/cron.rb:232:in `previous_time'
2019-03-21_10:39:44.36155 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:563:in `not_past_scheduled_time?'
2019-03-21_10:39:44.36155 /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/sidekiq-cron-1.0.4/lib/sidekiq/cron/job.rb:25:in `block in should_enque?'

This means that the scheduler is stopped by this and the task does not run anymore (to answer the other question). @godfat can you help me with the equivalent of this in gitlabs env?

bundle exec ruby -r et-orbi -e "EtOrbi._make_info"

from fugit.

conet avatar conet commented on June 7, 2024
-------------------------------------------------------------------------------------
 GitLab:       11.8.3-ee (822e412ad9a)
 GitLab Shell: 8.4.4
 postgresql:   9.6.11
-------------------------------------------------------------------------------------
Loading production environment (Rails 5.0.7.1)
irb(main):001:0> EtOrbi._make_info
(secs:1553165167.8255365,utc~:"2019-03-21 10:46:07.8255364894866943",ltz~:"UTC")
(etz:":/etc/localtime",tnz:"CET",tziv:"1.2.5",tzidv:nil,rv:"2.5.3",rp:"x86_64-linux",win:false,rorv:"5.0.7.1",astz:[ActiveSupport::TimeZone, "Etc/UTC"],eov:"1.1.7",eotnz:#<TZInfo::Data
Timezone: Etc/UTC>,eotnfz:"+0000",eotlzn:"Etc/UTC",eotnfZ:"UTC",debian:"Europe/Zurich",centos:nil,osx:"Europe/Zurich")
=> nil
irb(main):002:0> 

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

@conet Thanks for the great feedback! Working on digesting it.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

@conet, could you please try again with this updated snippet:

#crucial part of whole enqueueing job
def should_enque? time

  logger.warn {

    s = time.strftime(LAST_ENQUEUE_TIME_FORMAT)
    c = time.class
    n = Time.now.utc

    " /// should_enque() #{time.inspect} (#{c}/#{c.ancestors.inspect})" +
    " / #{s.inspect} / now: #{n.inspect} (d #{n - time}s)" +
    " / let: #{@last_enqueue_time.inspect}" }

  Sidekiq.redis { |conn|

    status == 'enabled' &&
    not_past_scheduled_time?(time) &&
    not_enqueued_after?(time) &&
    conn.zadd(job_enqueued_key, formated_enqueue_time(time), formated_last_time(time)) }
end

(sorry for the reformatting, I need to wrap my head around sidekiq-cron's API)

from fugit.

conet avatar conet commented on June 7, 2024

I applied the patch, waiting for the loop to kick in, usually it takes a couple of minutes.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

@conet,

would you mind testing again with the following patch?

#crucial part of whole enqueueing job
def should_enque? time

  logger.warn {

    s = time.strftime(LAST_ENQUEUE_TIME_FORMAT)
    c = time.class
    n = Time.now.utc

    " /// should_enque() #{time.inspect} (#{c}/#{c.ancestors.inspect})" +
    " / #{s.inspect} / now: #{n.inspect} (d #{n - time}s)" +
    " / let: #{@last_enqueue_time.inspect}" }

  if time == @__previous_should_enque_time
    logger.warn {
      "  \\\\\\ should_enque() called once again with the same time :-(\n" +
      caller.collect { |l| "  \\ #{l}" }.join("\n") }
  else
    @__previous_should_enque_time = time
  end

  Sidekiq.redis { |conn|

    status == 'enabled' &&
    not_past_scheduled_time?(time) &&
    not_enqueued_after?(time) &&
    conn.zadd(job_enqueued_key, formated_enqueue_time(time), formated_last_time(time)) }
end

Thanks in advance.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

(I'd need two complete caller stacks as output in the log, I want to know what calls this #should_enque repeatedly, or maybe my hypothesis does not stand)

from fugit.

conet avatar conet commented on June 7, 2024

I applied the patch, restarted, waiting for the loop to kick in, in the mean time:

grep "called once again with the same time" /var/log/gitlab/sidekiq/current

yields 0 results.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

@conet wrote:

in the mean time (...) yields 0 results.

Please correct me if I am wrong.

As I understand your feedback, your cron job works for a while, until the 'loop' kicks in and sends #should_enque to repeat itself in the WARN section of the log. The other jobs remain unaffected.

(What I can't explain myself is why your logs above only show the #should_enque for TID-ovgxgte1x, but that might be because you're only showing the repeat section of the log)

from fugit.

conet avatar conet commented on June 7, 2024

The loop is back, this gives no results:

grep "called once again with the same time" /var/log/gitlab/sidekiq/current

The last entry was:

2019-03-21_13:53:05.19952 2019-03-21T13:53:05.199Z 5133 TID-or644fx7x WARN:  /// should_enque() 2019-03-21 13:53:05 UTC (Time/[Time, DateAndTime::Compatibility, DateAndTime::Calculatio
ns, DateAndTime::Zones, Comparable, NakayoshiFork::Behavior, ActiveSupport::ToJsonWithActiveSupportEncoder, Object, GettextI18nRails::HtmlSafeTranslations, FastGettext::Translation, ER
B::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable, ActiveSupport::Tryable, JSON::Ext::Generator::GeneratorMethods::Object, Kernel, BasicObject]) / "2019-03-21 13:53:05 +0
000" / now: 2019-03-21 13:53:05 UTC (d 0.086686619s) / let: 2019-03-21 13:52:18 UTC

Several consecutive stack traces (extracted only the scheduler thread):
sidekiq.log

from fugit.

conet avatar conet commented on June 7, 2024

As I understand your feedback, your cron job works for a while, until the 'loop' kicks in and sends #should_enque to repeat itself in the WARN section of the log. The other jobs remain unaffected.

(What I can't explain myself is why your logs above only show the #should_enque for TID-ovgxgte1x, but that might be because you're only showing the repeat section of the log)

No, from what I understand TID-ovgxgte1x is the scheduler thread which, when operating normally, calls should_enque for each job it has (thus producing multiple log entries), but once this loop kicks in, the scheduler thread never goes passed it thus all job processing stops working. That is why I though that only the last log entry is important since that is the one that triggers the endless loop.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

@conet wrote:

but once this loop kicks in, the scheduler thread never goes passed it thus all job processing stops working. That is why I though that only the last log entry is important since that is the one that triggers the endless loop.

2019-03-21_08:33:05.78708 2019-03-21T08:33:05.787Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.78948 2019-03-21T08:33:05.789Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.79007 2019-03-21T08:33:05.790Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.86100 2019-03-21T08:33:05.771Z 29013 TID-ovgzfnhv9 UpdateAllMirrorsWorker JID-97db5164488292f394f70d41 INFO: start
2019-03-21_08:33:05.86361 2019-03-21T08:33:05.789Z 29013 TID-ovgxlnqdh Geo::SidekiqCronConfigWorker JID-83b3af2c2e5ddf868978583e INFO: start
2019-03-21_08:33:05.86423 2019-03-21T08:33:05.765Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: Started LDAP group sync
2019-03-21_08:33:06.06309 2019-03-21T08:33:06.061Z 29013 TID-ovgzfnhv9 UpdateAllMirrorsWorker JID-97db5164488292f394f70d41 INFO: done: 0.29 sec
2019-03-21_08:33:06.06425 2019-03-21T08:33:06.063Z 29013 TID-ovgxlnqdh Geo::SidekiqCronConfigWorker JID-83b3af2c2e5ddf868978583e INFO: done: 0.274 sec
2019-03-21_08:33:10.87072 2019-03-21T08:33:10.869Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: done: 0.511 sec

To me, it seems that this log of yours shows that job processing happens after the #should_enque loop kicks in.

from fugit.

conet avatar conet commented on June 7, 2024

To me, it seems that this log of yours shows that job processing happens after the #should_enque loop kicks in.

Well those are the last instances that the scheduler was able to launch, after those no more jobs are launched.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

OK, so I am completely wrong. Argh.

from fugit.

conet avatar conet commented on June 7, 2024

Shouldn't a log statement which prints everything you need to set this up in a minimal demo be enough? I assume that if we have all parameters involved in the previous_time call then it should be reproducible with a single statement, right? Could you help me with this log statement, I assumed that the first patch you sent did that but it seems like it didn't.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

Sorry, I was following a wrong track. I am preparing a new patch.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

@conet is your Sidekiq-Cron setup logging errors happening in your cron jobs?

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

Could you please try with this:

#crucial part of whole enqueueing job
def should_enque? time

  logger.warn {

    s = time.strftime(LAST_ENQUEUE_TIME_FORMAT)
    c = time.class
    n = Time.now.utc

    " /// should_enque() #{time.inspect} (#{c}/#{c.ancestors.inspect})" +
    " / #{name.inspect} / #{cron.inspect}" +
    " / #{s.inspect} / now: #{n.inspect} (d #{n - time}s)" +
    " / let: #{@last_enqueue_time.inspect}" }

  status == 'enabled' &&
  not_past_scheduled_time?(time) &&
  not_enqueued_after?(time) &&
  Sidekiq.redis { |conn|
    conn.zadd(job_enqueued_key, formated_enqueue_time(time), formated_last_time(time)) }

rescue Exception => exc
  logger.error {
    " \\\\\\ should_enque() failed: #{exc.inspect}\n" +
    " \\ name: #{name.inspect} \\ cron: #{cron.inspect}\n" +
    " \\ parsed cron: #{parsed_cron.inspect}\n" +
    " \\ trace:\n" + caller.collect { |l| "  \\ #{l}".join("\n") }
  raise
end

from fugit.

conet avatar conet commented on June 7, 2024

@conet is your Sidekiq-Cron setup logging errors happening in your cron jobs?

Yes it does, but the jobs themselves run on a different thread so they should not influence the scheduler.

I'm running your latest version, waiting to see if the issue occurs so that I can send the new logs.

from fugit.

conet avatar conet commented on June 7, 2024

Endless loop, no failures:

grep 'should_enque() failed' /var/log/gitlab/sidekiq/current

no results, the last entry is:

2019-03-21_15:43:10.47610 2019-03-21T15:43:10.476Z 19788 TID-ouz6ebs6o WARN:  /// should_enque() 2019-03-21 15:43:10 UTC (Time/[Time, DateAndTime::Compatibility, DateAndTime::Calculati
ons, DateAndTime::Zones, Comparable, NakayoshiFork::Behavior, ActiveSupport::ToJsonWithActiveSupportEncoder, Object, GettextI18nRails::HtmlSafeTranslations, FastGettext::Translation, E
RB::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable, ActiveSupport::Tryable, JSON::Ext::Generator::GeneratorMethods::Object, Kernel, BasicObject]) / "ldap_group_sync_worke
r" / "10 * * * * *" / "2019-03-21 15:43:10 +0000" / now: 2019-03-21 15:43:10 UTC (d 0.227103581s) / let: 2019-03-21 15:42:30 UTC

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

@conet, do we agree that your "ldap_group_sync_worker" job triggers a few times, then goes into the loop?

from fugit.

conet avatar conet commented on June 7, 2024

@conet, do we agree that your "ldap_group_sync_worker" job triggers a few times, then goes into the loop?

The job does fire a few times but it’s not the job that has the loop but the scheduler that calls previos_time which belongs to this project, no matter how bugy the library’s clients are they shouldn’t be able to trigger an endless loop in the library. It might be that there is no endless loop but many calls to previos_time but I don’t see that based on the logs and the code, then again I have no ruby expertise, I just want gitlab working properly.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

The job does fire a few times but it’s not the job that has the loop but the scheduler that calls previos_time which belongs to this project, no matter how bugy the library’s clients are they shouldn’t be able to trigger an endless loop in the library. It might be that there is no endless loop but many calls to previos_time but I don’t see that based on the logs and the code, then again I have no ruby expertise, I just want gitlab working properly.

I do understand that the scheduler thread ends up in the loop, but the loop happens in a method in the job's cron instance. When the scheduler tries to determine if it has to enqueue the job, it queries the job about that, it doesn't trigger the job. I do understand that.

What I am suspecting is #previous_time running OK most of the time, then choking on something. The choke probability seems to augment when you go from "5 * * * *" to "5 * * * * *", as your first few comments at GitLab seem to indicate.

I am trying to extract facts from your feedback, hence my questions and my poking. You have to understand that we both have to do our best to communicate.

I am no GitLab employee or contractor. I am just a freelance who has just spent 8 hours of a holiday debugging this issue because I care about my tiny open source library.

Shouldn't a log statement which prints everything you need to set this up in a minimal demo be enough? I assume that if we have all parameters involved in the previous_time call then it should be reproducible with a single statement, right? Could you help me with this log statement, I assumed that the first patch you sent did that but it seems like it didn't.

Indeed, something like that:

#require 'active_support/time'
#require 'active_support/time_with_zone'

p Time.class.ancestors
  #
  # On @conet's platform it looks like:
  #
  # [ Time, DateAndTime::Compatibility, DateAndTime::Calculations,
  #    DateAndTime::Zones, Comparable, NakayoshiFork::Behavior,
  #    ActiveSupport::ToJsonWithActiveSupportEncoder, Object,
  #    GettextI18nRails::HtmlSafeTranslations, FastGettext::Translation,
  #    ERB::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable,
  #    ActiveSupport::Tryable, JSON::Ext::Generator::GeneratorMethods::Object,
  #    Kernel, BasicObject ]

require 'fugit'; p Fugit::VERSION

cron = Fugit.parse('5 * * * * *')
#cron = Fugit.parse('10 * * * * *')

p cron.previous_time(Time.now.utc)

I'd love to know how Rails, Sidekiq and Sidekiq-cron do it to add all those mixin modules to the Time class. On my system it looks like [Class, Module, Object, ActiveSupport::Tryable, Kernel, BasicObject]

Tomorrow morning, I'll try to save some time from my day job to work on that. I will try with the latest data you gave me above, I'll try to augment the extensions to simulate the Rails environment.

An interesting question would be to know if the included class ancestors show above are the same with GitLab 11.8.2 and 11.8.3.

I also would love to know if you can confirm that GitLab 11.8.3 really rids you of the endless loop.

The worst case scenario for me? Having to install GitLab and try for myself, by the way, what operating system are you working on? That's sounds like a lot of work. I would much prefer having just a simple script that proves that my code goes into an endless loop.

Now, it's time for me to go to bed. I hope the night and the rest will dissipate my confusion and provide me with inspiration.

Best regards.

Thanks to @godfat for his guidance and patience!

from fugit.

godfat avatar godfat commented on June 7, 2024

Thank both of you trying to get to the bottom of this!

This really looks bizarre and I did try to look at relevant code mentioned in #15 (comment) and I almost wanted to ask if it keeps re-enqueuing the same thing over and over, but after looking at it closer I don't see why that will happen, so in the end I didn't ask that.

I actually started to think it might be a bug in sidekiq-cron but I don't really have a clue yet. I am not really familiar with sidekiq either.

An interesting question would be to know if the included class ancestors show above are the same with GitLab 11.8.2 and 11.8.3.

@jmettraux It should be the same. Here's the diff between 11.8.2 and 11.8.3, you can see that there's actually only one line change and that shouldn't have anything to do with this. I think this might be a very very rare edge case...

Here's when GitLab completely moved to Fugit 1.1.7. It was introduced in 11.8 which was released on 2019-02-22.

Or maybe it's really a bug in sidekiq-cron. I'll check the version updates in GitLab for sidekiq-cron tomorrow. Time to sleep for me, too.

from fugit.

conet avatar conet commented on June 7, 2024

This is how it looks on 11.8.3:

sudo gitlab-rails console
-------------------------------------------------------------------------------------
 GitLab:       11.8.3-ee (822e412ad9a)
 GitLab Shell: 8.4.4
 postgresql:   9.6.11
-------------------------------------------------------------------------------------
Loading production environment (Rails 5.0.7.1)
irb(main):001:0> p Time.class.ancestors
[Class, StateMachines::MacroMethods, Module, ActiveSupport::Dependencies::ModuleConstMissing, Module::Concerning, N
akayoshiFork::Behavior, ActiveSupport::ToJsonWithActiveSupportEncoder, Object, GettextI18nRails::HtmlSafeTranslatio
ns, FastGettext::Translation, ERB::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable, JSON::Ext::Generat
or::GeneratorMethods::Object, ActiveSupport::Tryable, Kernel, BasicObject]

from fugit.

conet avatar conet commented on June 7, 2024

I have to tell you that I'm in almost the same position, devoting free time to this just to make sure others will not go through the same problems as I did, with the only difference that I don't maintain any of the involved projects.

Remember this is solved for me with the upgrade to 11.8.3-ee and based on the monitoring data it appeared around march the 14th (I'm not sure that because I upgraded to 11.8.2-ee, from 11.8.1 or that I enabled ldap sync) by changing the cron from 5 * * * * * to 5 * * * *. So if you think that this is an edge case or something like that, fine with me, it might really be so. But I can definitely say that it's there and if it happened to me it probably happened to others, others just might not see it that fast, I was lucky to run gitlab on a 1 CPU machine (I was developing infra as code).

About the OS: this is running on GCP using a fresh ubuntu-minimal-1804-bionic-v20190320 image with gitlab installed. When I've got feedback this morning all I had to do to reproduce it is to setup a new instance running 11.8.2-ee using one of the snapshots of our data disk. But I'm really not willing to spend more into digging into data.

from fugit.

conet avatar conet commented on June 7, 2024

Please help me with some log statements to debug what is going on in cron.rb's previous_time log iterations, dump state, whatever helps, there is no such thing as a loop which goes on forever without a reason. Better to look directly at the core of the problem than scratching the edges.

from fugit.

conet avatar conet commented on June 7, 2024

I just updated my comment, it wasn't 11.8.3 that fixes stuff but changing the cron's precision, I was already running 11.8.3, see this comment. I might be that the scheduler does not support jobs which take longer than the scheduled interval.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

OK, that's great, let's go on working together.

Remember this is solved for me with the upgrade to 11.8.3-ee and based on the monitoring data it appeared around march the 14th (I'm not sure that because I upgraded to 11.8.2-ee, from 11.8.1 or that I enabled ldap sync) by changing the cron from 5 * * * * * to 5 * * * *.

Sorry, I can only remember what is written here and at https://gitlab.com/gitlab-org/gitlab-ce/issues/59273. Thanks for clarifying that because I was working with the "5 then 4 or 5 stars" on 11.8.2-ee only case. Mostly because of this comment.

I might be that the scheduler does not support jobs which take longer than the scheduled interval.

That's a very important piece of information, I wish I had it earlier. So your LDAP jobs took more than one minute it seems.

Please help me with some log statements to debug what is going on in cron.rb's previous_time log iterations, dump state, whatever helps, there is no such thing as a loop which goes on forever without a reason. Better to look directly at the core of the problem than scratching the edges.

OK, but please give us a clear, up-to-date picture.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

Hello,

could you please use the following monkey patch and try again? https://gist.github.com/jmettraux/5e998fa0ed4981e4239243a034903bb0

I guess, @godfat will certainly correct me, that it can be placed at the end of https://gitlab.com/gitlab-org/gitlab-ee/blob/v11.8.3-ee/config/initializers/sidekiq.rb to take effect after restart.

It should emit a log warn entry every 10_000th iteration in the endless loop and tell us more about it.

Thanks in advance!

from fugit.

conet avatar conet commented on June 7, 2024

ti1 - ti seems to be different, I hope you can figure out why...

from fugit.

conet avatar conet commented on June 7, 2024

Anyway, it seems like an number of iteration based circuit breaker would be a much safer idea.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

Anyway, it seems like an number of iteration based circuit breaker would be a much safer idea.

Yes, I need to find the right max number (quick to reach and safely irrealist).

from fugit.

conet avatar conet commented on June 7, 2024

Anyway, it seems like an number of iteration based circuit breaker would be a much safer idea.

Yes, I need to find the right max number (quick to reach and safely irrealist).

I'd go with a value that would limit the call to under a couple of seconds on an idle CPU, better fail in some strange situation than never return.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

I'd go with a value that would limit the call to under a couple of seconds on an idle CPU, better fail in some strange situation than never return.

You're giving me another idea: a simple timeout, whatever the iteration count, whatever the distance in time, if it takes too long, let's fail.

Still I have to determine why -1 turns into -0.

from fugit.

conet avatar conet commented on June 7, 2024

You're giving me another idea: a simple timeout, whatever the iteration count, whatever the distance in time, if it takes too long, let's fail.

Yeah but that has the disadvantage that it might mean more or less iterations depending on various factors including current load so it might actually make this more unpredictable, even if you figure out why -1 turns into -0 (some type overflow? is that possible in ruby?) I'd still add a fixed number of iteration. Something like that 41 years and a few days... there wont'be a next or a previous time but expressed as the number of iterations that would be enough to reach that state under normal circumstances.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

OK, installing a v11.8.3-ee on Debian 9.8.0. I'll use the .deb at https://about.gitlab.com/install/

Anything I should be aware of?

from fugit.

conet avatar conet commented on June 7, 2024

Yeah, in order to avoid using EE specific cron jobs which might need a license I would add one of these jobs in the config with a cron definition like 10 * * * * *.
After starting follow this log file for cron related logs:

/var/log/gitlab/sidekiq/current

If you apply your last patch the logs will show up in this file:

/var/log/gitlab/gitlab-rails/production.log

The endless loop should kick in anywhere between 5-60 minutes after start so please be patient. I wouldn't wait for more than an hour, if it does not happen I will also try to reproduce in on an empty, shareable instance.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

I have it running, I can see the logs exactly as you indicated.

I am all set up. Let's see.

from fugit.

jmettraux avatar jmettraux commented on June 7, 2024

@godfat @conet I can put my finger on it:

require 'fugit'  

cron = Fugit.parse('10 * * * * *')  

t = cron.previous_time.to_f# + 0.123 (some float x so that 0.0 <= x < 1.0)  
  # generate a t that is on the target

cron.previous_time(Time.at(t)) 
  # feed it to `#previous_time`

It was right under my nose. #previous_time called within a "matching second".

I will fix that and I will place a better breaker system just as discussed above.

Thanks to both of you for the patience and the guidance.

(Now I have to stop the gitlab instance I built, funny, as I typed this comment, it just entered in the "loop". I didn't need it, probably doing the install help clear my head)

from fugit.

conet avatar conet commented on June 7, 2024

Since I scraped my test instance after you managed to reproduce it I was looking today for an easier way to do it, if you have docker installed here's a one-liner:

docker run --detach --env GITLAB_OMNIBUS_CONFIG="gitlab_rails['stuck_ci_jobs_worker_cron'] = \"10 * * * * *\"" --name gitlab --volume ${PWD}/gitlab/config:/etc/gitlab --volume ${PWD}/gitlab/logs:/var/log/gitlab --volume ${PWD}/gitlab/data:/var/opt/gitlab gitlab/gitlab-ce:11.9.0-ce.0

Let it run a while, the endless loop will kick in (check cpu usage in the container), now that I managed to reproduce it this way, I will give your fix a try.

from fugit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.