Comments (26)
OK, when you get this into this state again can I see a full, recursive directory listing of the data directory on each node?
from rqlite.
Thanks for the report. Some questions.
- In step 3, how do you stop the nodes?
- You mean you are executing the "Recovery Process"? By setting the peers.json file? But nothing is actually different in peers.json?
You mean that by adding the sync
flag you sometimes get the log output you added in the gist?
from rqlite.
In step 3, how do you stop the nodes?
Sending a ctrl+c signal to stop gracefully
You mean you are executing the "Recovery Process"? By setting the peers.json file? But nothing is actually different in peers.json?
Exactly, I'm testing the recovery functionality by setting the peers.json, but nothing is actually different (no suffrage change, no removal or addition)
You mean that by adding the sync flag you sometimes get the log output you added in the gist?
Yes.
from rqlite.
Here it is
├───127.0.0.18101
│ │ db.sqlite
│ │ db.sqlite-shm
│ │ db.sqlite-wal
│ │ raft.db
│ │
│ ├───raft
│ │ peers.info
│ │
│ └───rsnapshots
│ │ 3-3-1709231804147.db
│ │
│ └───3-3-1709231804147
│ meta.json
│
├───127.0.0.18102
│ │ db.sqlite
│ │ db.sqlite-shm
│ │ db.sqlite-wal
│ │ raft.db
│ │
│ ├───raft
│ │ peers.info
│ │
│ └───rsnapshots
│ │ 4-4-1709231801294.db
│ │
│ └───4-4-1709231801294
│ meta.json
│
└───127.0.0.18103
│ db.sqlite
│ db.sqlite-shm
│ db.sqlite-wal
│ raft.db
│
├───raft
│ peers.info
│
└───rsnapshots
│ 3-3-1709231801184.db
│
└───3-3-1709231801184
meta.json
from rqlite.
Is that the state of the directories while the nodes are emitting "system cannot find the path specified" messages?
from rqlite.
Is your cluster also getting write traffic during this time?
from rqlite.
It's hard to believe the sync
flag could be the cause. It's purely a read operation -- it waits until one variable in the Raft system is equal to, or greater, than another variable in the Raft system. There is no writes, and the variables in question do not grab locks (not at the rqlite level).
from rqlite.
One hypothesis I have is that writes are taking place on your cluster, while the recovery is in progress. This could cause the behaviour you are seeing.
from rqlite.
Is that the state of the directories while the nodes are emitting "system cannot find the path specified" messages?
Yes
from rqlite.
Is your cluster also getting write traffic during this time?
No. The only steps being conducted are the ones I mentioned, which means the database is not even populated, it is quite a simple start-stop-configure_peers-start test where during the each node's start I'm monitoring the readyz output with the sync flag. These are the database instances in the broken state:
TestDatabase.zip
I tried again reproducing the issue removing the sync flag but I coudn't, maybe it is the determining factor or makes it really rare to cause the issue.
from rqlite.
OK, using the sync flag might change the timing of the whole sequence, so it's possible it affects the test output, but it's difficult to see how it's a direct cause.
How reproducible is this for you?
from rqlite.
When I initiate a Run Until Failure, simetimes it fails in the second attempt, simetimes in the 16th attempt, it varies. I just managed to port this scenario I was describing to python and I was able to hit the issue. I put in the "rqlite-8.22.1-issue" branch => https://github.com/dwco-z/rqlite-throubleshooting/tree/rqlite-8.22.1-issue
I have also uploaded the logs (node1.log, node2.log and node3.log)
from rqlite.
What version of Python are you running? I could execute your repo last time, but not this time.
I'm running Python 3.8.10
$ python3 main.py
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
starting rqlited...
starting rqlited...
starting rqlited...
stopping rqlited...
stopping rqlited...
stopping rqlited...
Traceback (most recent call last):
File "main.py", line 82, in <module>
start_all_instances(all_instances, joinAddresses)
File "main.py", line 55, in start_all_instances
future.result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "main.py", line 49, in start_manager
manager.start_rqlited(joinAddresses)
File "/home/philip/repos/rqlite-throubleshooting/rqlite_manager.py", line 47, in start_rqlited
self.process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, creationflags=subprocess.CREATE_NEW_PROCESS_GROUP)
AttributeError: module 'subprocess' has no attribute 'CREATE_NEW_PROCESS_GROUP'
from rqlite.
Runs if I patch like so:
$ git diff rqlite_manager.py
diff --git a/rqlite_manager.py b/rqlite_manager.py
index f4cb3c8..00e6860 100644
--- a/rqlite_manager.py
+++ b/rqlite_manager.py
@@ -31,7 +31,7 @@ class RqliteManager:
def start_rqlited(self, join_addresses=""):
print('starting rqlited...')
cmd = [
- "rqlited.exe", # Path to the rqlited executable
+ "/home/philip/repos/rqlite/src/github.com/rqlite/rqlite/cmd/rqlited/rqlited", # Path to the rqlited executable
f"-http-addr={self.host}:{self.http_port}",
f"-raft-addr={self.host}:{self.raft_port}",
f"-raft-log-level=DEBUG",
@@ -44,7 +44,7 @@ class RqliteManager:
cmd.insert(-2, "-join=" + ",".join(join_addresses))
# print(cmd)
# exit()
- self.process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, creationflags=subprocess.CREATE_NEW_PROCESS_GROUP)
+ self.process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
# Start a thread to asynchronously log stdout to a file
self.log_thread = threading.Thread(target=self._log_stdout, daemon=True)
@@ -59,7 +59,7 @@ class RqliteManager:
def stop_rqlited(self):
print('stopping rqlited...')
if self.process:
- self.process.send_signal(signal.CTRL_BREAK_EVENT)
+ self.process.send_signal(signal.SIGTERM)
try:
self.process.wait(30)
except:
from rqlite.
Nice, I seem to have hit it -- good stuff.
from rqlite.
What version of Python are you running? I could execute your repo last time, but not this time.
I'm running Python 3.8.10
$ python3 main.py /usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " starting rqlited... starting rqlited... starting rqlited... stopping rqlited... stopping rqlited... stopping rqlited... Traceback (most recent call last): File "main.py", line 82, in <module> start_all_instances(all_instances, joinAddresses) File "main.py", line 55, in start_all_instances future.result() File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result return self.__get_result() File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs) File "main.py", line 49, in start_manager manager.start_rqlited(joinAddresses) File "/home/philip/repos/rqlite-throubleshooting/rqlite_manager.py", line 47, in start_rqlited self.process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, creationflags=subprocess.CREATE_NEW_PROCESS_GROUP) AttributeError: module 'subprocess' has no attribute 'CREATE_NEW_PROCESS_GROUP'
I'm using Python 3.12 this time
from rqlite.
Nice, I seem to have hit it -- good stuff.
Great
from rqlite.
I believe #1714 will fix this.
from rqlite.
I do not see anything particularly different between earlier versions and 8.22. Start-up timing may have changed, which brought out this issue in later releases. Basically there was an extra call to "snapshot reap" and it was in the problematic place -- that was a bug which you were hitting occasionally.
Thanks for the Python script -- made it very easy to repro.
from rqlite.
Oh that's interesting, I throught it could be something with new code. Thanks! I'm glad the script helped, I really appreciate your prompt action handling this issue. I'll test the most recent version and I'll let you know if something comes up.
from rqlite.
@otoolep I downloaded the 8.22.2 rqlited.exe artifact from appveyor and started using it. I noticed that the the issue is still occurring on my environment. I put the executable in that python 'rqlite-throubleshooting' repository and I could get the "Exception: something is wrong, leader is not ready..." there as well in the 19th attempt, and then in the 2nd attempt.
MD5 hash of .\rqlited.exe:
23e93ae49e554ef691c6514165b4a7c5
from rqlite.
I can't run Windows binaries.
Is this still the same issue? Are you saying that if I run your script I'll still see the same issue as you initially reported?
from rqlite.
Oh right. Yes, it seems to be the same failed to install snapshot and I haven't changed the code so you might be able to hit the issue on your environment. I have updated the node1.log, node2.log and node3.log files in the repository with the issue happening in the new version => https://github.com/dwco-z/rqlite-throubleshooting/tree/rqlite-8.22.1-issue
from rqlite.
OK, I see the logs. This appears to be a different issue. It may have a similar root cause, but the failure is happening at a different place in the start-up sequence (the logs show this). I'll need to look into this, could be a few days before I make any progress.
from rqlite.
Opened new issue at #1717
from rqlite.
Ok, if you need me to test a change or something else, just let me know. Thanks!
from rqlite.
Related Issues (20)
- Leader continues trying to heartbeat removed node HOT 8
- Null pointer panic HOT 7
- Add "vtypes" field to API response to indicate actual value types
- Read-only node showing failed to open store: set log info: failed to get last command index HOT 4
- Can't create a empty database in rqlite shell HOT 3
- Leader election times seem too long with max current term HOT 16
- Recovery process can result in old snapshot getting sent to node HOT 43
- Rust sqlx-rqlite HOT 1
- CLI support Home and End key
- Build and upload binaries automatically HOT 4
- Which version of Sqlite dialect is supported? HOT 4
- PRAGMA foreign_keys are turning them off even we turn them on. HOT 4
- connect rqlite over unix sockets HOT 2
- how to debug random "database disk image is malformed" error? HOT 10
- rqlited + DNS client 100% CPU usage after network disconnecting (windows) HOT 21
- CTRL-C should stop the process HOT 1
- [FeatureRequest] Make number of retries configurable for /nodes and potentially other relevant http calls HOT 7
- Build from source fails in Windows VM HOT 7
- Synchronisation bug related to http.Server.AllowedOrigin HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rqlite.