Giter VIP home page Giter VIP logo

Comments (26)

otoolep avatar otoolep commented on June 6, 2024 1

OK, when you get this into this state again can I see a full, recursive directory listing of the data directory on each node?

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

Thanks for the report. Some questions.

  • In step 3, how do you stop the nodes?
  • You mean you are executing the "Recovery Process"? By setting the peers.json file? But nothing is actually different in peers.json?

You mean that by adding the sync flag you sometimes get the log output you added in the gist?

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

In step 3, how do you stop the nodes?

Sending a ctrl+c signal to stop gracefully

You mean you are executing the "Recovery Process"? By setting the peers.json file? But nothing is actually different in peers.json?

Exactly, I'm testing the recovery functionality by setting the peers.json, but nothing is actually different (no suffrage change, no removal or addition)

You mean that by adding the sync flag you sometimes get the log output you added in the gist?

Yes.

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

Here it is

├───127.0.0.18101
│   │   db.sqlite
│   │   db.sqlite-shm
│   │   db.sqlite-wal
│   │   raft.db
│   │
│   ├───raft
│   │       peers.info
│   │
│   └───rsnapshots
│       │   3-3-1709231804147.db
│       │
│       └───3-3-1709231804147
│               meta.json
│
├───127.0.0.18102
│   │   db.sqlite
│   │   db.sqlite-shm
│   │   db.sqlite-wal
│   │   raft.db
│   │
│   ├───raft
│   │       peers.info
│   │
│   └───rsnapshots
│       │   4-4-1709231801294.db
│       │
│       └───4-4-1709231801294
│               meta.json
│
└───127.0.0.18103
    │   db.sqlite
    │   db.sqlite-shm
    │   db.sqlite-wal
    │   raft.db
    │
    ├───raft
    │       peers.info
    │
    └───rsnapshots
        │   3-3-1709231801184.db
        │
        └───3-3-1709231801184
                meta.json

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

Is that the state of the directories while the nodes are emitting "system cannot find the path specified" messages?

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

Is your cluster also getting write traffic during this time?

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

It's hard to believe the sync flag could be the cause. It's purely a read operation -- it waits until one variable in the Raft system is equal to, or greater, than another variable in the Raft system. There is no writes, and the variables in question do not grab locks (not at the rqlite level).

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

One hypothesis I have is that writes are taking place on your cluster, while the recovery is in progress. This could cause the behaviour you are seeing.

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

Is that the state of the directories while the nodes are emitting "system cannot find the path specified" messages?

Yes

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

Is your cluster also getting write traffic during this time?

No. The only steps being conducted are the ones I mentioned, which means the database is not even populated, it is quite a simple start-stop-configure_peers-start test where during the each node's start I'm monitoring the readyz output with the sync flag. These are the database instances in the broken state:
TestDatabase.zip
I tried again reproducing the issue removing the sync flag but I coudn't, maybe it is the determining factor or makes it really rare to cause the issue.

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

OK, using the sync flag might change the timing of the whole sequence, so it's possible it affects the test output, but it's difficult to see how it's a direct cause.

How reproducible is this for you?

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

When I initiate a Run Until Failure, simetimes it fails in the second attempt, simetimes in the 16th attempt, it varies. I just managed to port this scenario I was describing to python and I was able to hit the issue. I put in the "rqlite-8.22.1-issue" branch => https://github.com/dwco-z/rqlite-throubleshooting/tree/rqlite-8.22.1-issue
I have also uploaded the logs (node1.log, node2.log and node3.log)

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

What version of Python are you running? I could execute your repo last time, but not this time.

I'm running Python 3.8.10

$ python3 main.py 
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
starting rqlited...
starting rqlited...
starting rqlited...
stopping rqlited...
stopping rqlited...
stopping rqlited...
Traceback (most recent call last):
  File "main.py", line 82, in <module>
    start_all_instances(all_instances, joinAddresses)
  File "main.py", line 55, in start_all_instances
    future.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "main.py", line 49, in start_manager
    manager.start_rqlited(joinAddresses)
  File "/home/philip/repos/rqlite-throubleshooting/rqlite_manager.py", line 47, in start_rqlited
    self.process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, creationflags=subprocess.CREATE_NEW_PROCESS_GROUP)
AttributeError: module 'subprocess' has no attribute 'CREATE_NEW_PROCESS_GROUP'

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

Runs if I patch like so:

$ git diff rqlite_manager.py
diff --git a/rqlite_manager.py b/rqlite_manager.py
index f4cb3c8..00e6860 100644
--- a/rqlite_manager.py
+++ b/rqlite_manager.py
@@ -31,7 +31,7 @@ class RqliteManager:
     def start_rqlited(self, join_addresses=""):
         print('starting rqlited...')
         cmd = [
-            "rqlited.exe",  # Path to the rqlited executable
+            "/home/philip/repos/rqlite/src/github.com/rqlite/rqlite/cmd/rqlited/rqlited",  # Path to the rqlited executable
             f"-http-addr={self.host}:{self.http_port}",
             f"-raft-addr={self.host}:{self.raft_port}",
             f"-raft-log-level=DEBUG",
@@ -44,7 +44,7 @@ class RqliteManager:
             cmd.insert(-2, "-join=" + ",".join(join_addresses))
         # print(cmd)
         # exit()
-        self.process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, creationflags=subprocess.CREATE_NEW_PROCESS_GROUP)
+        self.process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
 
         # Start a thread to asynchronously log stdout to a file
         self.log_thread = threading.Thread(target=self._log_stdout, daemon=True)
@@ -59,7 +59,7 @@ class RqliteManager:
     def stop_rqlited(self):
         print('stopping rqlited...')
         if self.process:
-            self.process.send_signal(signal.CTRL_BREAK_EVENT)
+            self.process.send_signal(signal.SIGTERM)
             try:
                 self.process.wait(30)
             except:

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

Nice, I seem to have hit it -- good stuff.

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

What version of Python are you running? I could execute your repo last time, but not this time.

I'm running Python 3.8.10

$ python3 main.py 
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
starting rqlited...
starting rqlited...
starting rqlited...
stopping rqlited...
stopping rqlited...
stopping rqlited...
Traceback (most recent call last):
  File "main.py", line 82, in <module>
    start_all_instances(all_instances, joinAddresses)
  File "main.py", line 55, in start_all_instances
    future.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "main.py", line 49, in start_manager
    manager.start_rqlited(joinAddresses)
  File "/home/philip/repos/rqlite-throubleshooting/rqlite_manager.py", line 47, in start_rqlited
    self.process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, creationflags=subprocess.CREATE_NEW_PROCESS_GROUP)
AttributeError: module 'subprocess' has no attribute 'CREATE_NEW_PROCESS_GROUP'

I'm using Python 3.12 this time

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

Nice, I seem to have hit it -- good stuff.

Great

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

I believe #1714 will fix this.

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

I do not see anything particularly different between earlier versions and 8.22. Start-up timing may have changed, which brought out this issue in later releases. Basically there was an extra call to "snapshot reap" and it was in the problematic place -- that was a bug which you were hitting occasionally.

Thanks for the Python script -- made it very easy to repro.

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

Oh that's interesting, I throught it could be something with new code. Thanks! I'm glad the script helped, I really appreciate your prompt action handling this issue. I'll test the most recent version and I'll let you know if something comes up.

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

@otoolep I downloaded the 8.22.2 rqlited.exe artifact from appveyor and started using it. I noticed that the the issue is still occurring on my environment. I put the executable in that python 'rqlite-throubleshooting' repository and I could get the "Exception: something is wrong, leader is not ready..." there as well in the 19th attempt, and then in the 2nd attempt.

MD5 hash of .\rqlited.exe:
23e93ae49e554ef691c6514165b4a7c5

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

I can't run Windows binaries.

Is this still the same issue? Are you saying that if I run your script I'll still see the same issue as you initially reported?

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

Oh right. Yes, it seems to be the same failed to install snapshot and I haven't changed the code so you might be able to hit the issue on your environment. I have updated the node1.log, node2.log and node3.log files in the repository with the issue happening in the new version => https://github.com/dwco-z/rqlite-throubleshooting/tree/rqlite-8.22.1-issue

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

OK, I see the logs. This appears to be a different issue. It may have a similar root cause, but the failure is happening at a different place in the start-up sequence (the logs show this). I'll need to look into this, could be a few days before I make any progress.

from rqlite.

otoolep avatar otoolep commented on June 6, 2024

Opened new issue at #1717

from rqlite.

dwco-z avatar dwco-z commented on June 6, 2024

Ok, if you need me to test a change or something else, just let me know. Thanks!

from rqlite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.