Giter VIP home page Giter VIP logo

Comments (16)

vaibhavad avatar vaibhavad commented on June 18, 2024 1

Stale states definitely looks like a heroku issue. I tried MockProvider with both LocalArchitect and HerokuArchitect as you suggested. Socket disconnect or socket error is not being detected by HerokuArchitect. I tried with both Safari and Google Chrome. It is very likely that all the other issues are originating from this.

In one of the runs, the rare case happened. Here are the relevant Mephisto logs (HerokuArchitect and MockProvider).

[2020-12-08 15:16:52,506][mephisto.operations.supervisor][DEBUG] - Agent statuses received - {'22': 'waiting'}
[12-08 15:16:56] p42509 {supervisor.py:686} DEBUG - Agent statuses received - {'22': 'waiting'}
[2020-12-08 15:16:56,519][mephisto.operations.supervisor][DEBUG] - Agent statuses received - {'22': 'waiting'}
[12-08 15:17:00] p42509 {supervisor.py:527} DEBUG - Incoming request to register agent 220-5.
[2020-12-08 15:17:00,388][mephisto.operations.supervisor][DEBUG] - Incoming request to register agent 220-5.
[12-08 15:17:00] p42509 {supervisor.py:405} DEBUG - Worker 5 is being assigned one of 19 units.
[2020-12-08 15:17:00,398][mephisto.operations.supervisor][DEBUG] - Worker 5 is being assigned one of 19 units.
[12-08 15:17:00] p42509 {supervisor.py:425} DEBUG - Created agent 23, y.
[2020-12-08 15:17:00,405][mephisto.operations.supervisor][DEBUG] - Created agent 23, y.
Assignment 101 is launching with ['22', '23']
[12-08 15:17:00] p42509 {supervisor.py:686} DEBUG - Agent statuses received - {'22': 'waiting'}
[2020-12-08 15:17:00,567][mephisto.operations.supervisor][DEBUG] - Agent statuses received - {'22': 'waiting'}
[12-08 15:17:04] p42509 {supervisor.py:686} DEBUG - Agent statuses received - {'22': 'waiting', '23': 'in task'}
[2020-12-08 15:17:04,652][mephisto.operations.supervisor][DEBUG] - Agent statuses received - {'22': 'waiting', '23': 'in task'}

Agent 22 was actually disconnected but HerokuArchitect could not detect the socket disconnect so the status is waiting. But somehow the state is not updated to in task even though the assignment has been launched with this agent. In 4-5 test runs, this only happened once.

from mephisto.

JackUrb avatar JackUrb commented on June 18, 2024 1

#347 is now in, so I'll be closing this issue. Huge thanks for helping us debug what was going on here! Let us know if anything else comes up 👍

from mephisto.

vaibhavad avatar vaibhavad commented on June 18, 2024

Apart from this, I also observed some other unusual behaviors. Unfortunately, I don't have specific logs for these, only some screenshots.

  1. The agent status is waiting (as reported by Heroku server) but the agent is actually in task. This is evident from Waiting to pair with a task... message which is rendered only when agent status is waiting

Screen Shot 2020-12-04 at 1 08 56 PM

  1. Sometimes two agents are waiting, but are not paired with each other (Agent 80 and 81)
    Screen Shot 2020-12-04 at 2 07 29 AM

from mephisto.

vaibhavad avatar vaibhavad commented on June 18, 2024

Further looking into Heroku logs (for a different run), using DEBUG = true; here, Client connected log is present but not socket disconnected or socket error, even through workers returned the HIT or closed the window. Due to this handle_possible_disconnect(agent); is not called. Any directions on how I can fix this?

from mephisto.

JackUrb avatar JackUrb commented on June 18, 2024

Hi @vaibhavad - thanks for the detailed reproduction steps and analysis! I'll dive into this and see if there's anything that pops out to me.

from mephisto.

JackUrb avatar JackUrb commented on June 18, 2024

First thing I'm noting in your attached log file is that the additional logging may have broken some of the task cleanup steps, in that Mephisto couldn't receive confirmation from the worker that they're submitting the task:

ERROR - Channel heroku_channel_parlai-chat-example_1_0 encountered error on packet {'packet_type': 'agent_action', 'sender_id': '3', 'receiver_id': 'mephisto', 'data': {'task_data': {'final_data': {}}, 'MEPHISTO_is_submit': True}}
Traceback (most recent call last):
  File "/Users/vaibhav/Mephisto/mephisto/operations/supervisor.py", line 126, in _on_channel_message
    self._on_message(packet, channel_info)
  File "/Users/vaibhav/Mephisto/mephisto/operations/supervisor.py", line 671, in _on_message
    logger.debug(f'Message from Agent {packet.sender_id} - \'{packet.data["text"]}\'')
KeyError: 'text'

Updating a final status, was returned and want to set to in task

Yeah this should definitely be impossible, even if the system isn't getting to determine disconnects until after an agent hits a timeout on their task. I'll need to ensure the checks are rigorous enough to prevent this (and the related pairings of non-waiting agents).

The agent status is waiting (as reported by Heroku server) but the agent is actually in task. This is evident from Waiting to pair with a task... message which is rendered only when agent status is waiting.

Wow this looks like an instance of a very peculiar race condition. The status had to move to in task in order for the worker to have sent any messages at all, and then it was moved back to waiting. I'll need to look at the code paths directly to see how a transition like this is even possible.

Sometimes two agents are waiting, but are not paired with each other (Agent 80 and 81)

Is it possible both agent 80 and 81 are the same worker? Mephisto won't connect these into a task.

but not socket disconnected or socket error, even through workers returned the HIT or closed the window.

This is the biggest thing I'm going to look into, your logs look mostly consistent except for this part. This isn't the first time I've seen this not work though, and the issue was tied to a particular browser version. Do you also not see the disconnection messages if you try different browsers?

Overall this is going to require some more investigation on my part to clean up these code paths - will likely have some clearer actions + fixes on Tuesday or Wednesday.

from mephisto.

vaibhavad avatar vaibhavad commented on June 18, 2024

First thing I'm noting in your attached log file is that the additional logging may have broken some of the task cleanup steps, in that Mephisto couldn't receive confirmation from the worker that they're submitting the task:

Yeah I noticed this too after posting, I fixed it in vaibhavad/Mephisto@5600321. I can re-run the setup, but I'm pretty sure this isn't the root cause.

Wow this looks like an instance of a very peculiar race condition.

It indeed is. I see it once in 30-40 HITs.

Is it possible both agent 80 and 81 are the same worker?

Unlikely, as I had only one browser window upon per Worker Sandbox account. This is also seen very rarely, and can be connected to the issue of actually being in task but status as waiting.

Mephisto won't connect these into a task.

Is there a way to change this? I want to give the Workers an option to pair with themselves after 60 seconds if they can't find a partner.

This is the biggest thing I'm going to look into, your logs look mostly consistent except for this part.

I'll try to run the simulation again and record Heroku logs too in case it helps.

This isn't the first time I've seen this not work though, and the issue was tied to a particular browser version.

I'm using Google Chrome, Google Chrome Incognito, and Safari to open three different Worker Sandbox accounts. I'll specifically check Heroku logs for each of these and let you know. I do get proper disconnect logs while running locally (architect and provider both), so I assumed it was a Heroku issue.

Thanks a lot for looking into this! :)

from mephisto.

JackUrb avatar JackUrb commented on June 18, 2024

Note - for testing if this is just a heroku issue, it's possible to launch with the HerokuArchitect but still use the MockProvider, thus going to links like your-task-name.herokuapp.com?worker_id=x&assignment_id=1, this way you don't need to wait for the MTurk-related loading or any of that flow.

from mephisto.

JackUrb avatar JackUrb commented on June 18, 2024

Diving into this there's definitely an interaction where HerokuArchitect isn't properly keeping liveliness checks against the frontend. I don't have any idea why the connection close isn't being registered by the socket protocol that HerokuArchitect uses but I've been able to reproduce on my end as well.

The next step is to confirm that the server is trying to send a ping packet to the disconnected agents (which happens here:

). If the failure of that send doesn't force a socket to close, then we'll need to add a timeout to that ping that will close the socket if no heartbeat has been received after a send (and clear it when a heartbeat does come in here:
} else if (packet["packet_type"] == PACKET_TYPE_HEARTBEAT) {
).

from mephisto.

vaibhavad avatar vaibhavad commented on June 18, 2024

Hi @JackUrb,

Looks like the stale agent states is not the cause (or the only cause) leading to problematic pairing. I tried MockProvider with LocalArchitect. LocalArchitect was detecting disconnects immediately, but still pairing was happening with disconnected agents. Here are the relevant logs (logs from both mephisto and mephisto-server are mixed as both were running locally) -

[2020-12-21 02:13:41,645][mephisto.operations.supervisor][DEBUG] - Agent statuses received - {'67': 'disconnect', '68': 'disconnect', '69': 'disconnect', '70': 'waiting'}
Adding message to agent queue { packet_type: 'heartbeat',
  msg_id: '5208dac7-105e-4c68-9b3b-50bbd67bbdbc',
  receiver_id: '70',
  sender_id: 'mephisto_server',
  data:
   { last_mephisto_ping: 1608497021640,
     status: 'waiting',
     state: { wants_act: false, done_text: null, task_done: false } } }
[12-21 02:13:43] p32037 {supervisor.py:527} DEBUG - Incoming request to register agent 665-5.
[2020-12-21 02:13:43,596][mephisto.operations.supervisor][DEBUG] - Incoming request to register agent 665-5.
[12-21 02:13:43] p32037 {supervisor.py:405} DEBUG - Worker 5 is being assigned one of 16 units.
[2020-12-21 02:13:43,609][mephisto.operations.supervisor][DEBUG] - Worker 5 is being assigned one of 16 units.
[12-21 02:13:43] p32037 {supervisor.py:425} DEBUG - Created agent 71, y.
[2020-12-21 02:13:43,617][mephisto.operations.supervisor][DEBUG] - Created agent 71, y.
Updating a final status, was disconnect and want to set to in task
Assignment 338 is launching with ['69', '71']

Here is the complete log file, if you need it -
mephisto_logs2.txt

Note - I ran this with mephisto-server on HTTPS, along with other changes to support HTTPS (current code on vaibhavad/Mephisto)

from mephisto.

JackUrb avatar JackUrb commented on June 18, 2024

Hi @vaibhavad - thanks for the detailed logs here! These are exactly what I needed. The pairing of 70 and 71 would have been appropriate, so looking at how it ended up instead connecting with 69. I suspect the issue here is really related to disconnecting specifically while in the waiting state. The agent isn't expecting to act while waiting, and as such it may not be clearing the associated unit on a disconnect like would be expected. I'm drafting a fix.

from mephisto.

JackUrb avatar JackUrb commented on June 18, 2024

Alright I think I have a fix for this in #347, as I could repro this occurrence before but can't anymore.

from mephisto.

vaibhavad avatar vaibhavad commented on June 18, 2024

Thanks for the elegant fix! I can no longer reproduce the issue in preliminary testing.

from mephisto.

vaibhavad avatar vaibhavad commented on June 18, 2024

Apart from this, I also observed some other unusual behaviors. Unfortunately, I don't have specific logs for these, only some screenshots.

  1. The agent status is waiting (as reported by Heroku server) but the agent is actually in task. This is evident from Waiting to pair with a task... message which is rendered only when agent status is waiting
Screen Shot 2020-12-04 at 1 08 56 PM
  1. Sometimes two agents are waiting, but are not paired with each other (Agent 80 and 81)
    Screen Shot 2020-12-04 at 2 07 29 AM

Hi @JackUrb, have you observed this race condition issue in live tasks on your own? I encountered one of its rare occurrences yesterday when one of our data collection tasks was live. I forked Mephisto a long time ago, so I wanted to know if you have pushed any specific changes to identify and prevent this issue.

from mephisto.

JackUrb avatar JackUrb commented on June 18, 2024

We made a lot of changes that were causing strange status propagation in the last weeks (#382, #372, #366), and this is likely no longer possible.

from mephisto.

vaibhavad avatar vaibhavad commented on June 18, 2024

Great, thanks!

from mephisto.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.