rook / rook Goto Github PK
View Code? Open in Web Editor NEWStorage Orchestration for Kubernetes
Home Page: https://rook.io
License: Apache License 2.0
Storage Orchestration for Kubernetes
Home Page: https://rook.io
License: Apache License 2.0
You need a ceph.conf file to start a rados client. With the tools as they are it is hard to create one.
I'd like to see an option to the rook tool that emits a usable ceph.conf file either to a file or on stdout.
Likewise I'd like to see a similar command that facilitates generating a keyring file for the cephx keys.
when castled starts up, it'll overwrite any existing ceph config files with it's hard coded values for logging levels. This makes it impossible to enable DEBUG logging without rebuilding the binary.
We should make it easier to enable DEBUG logging, especially for bootstrapping scenarios. Perhaps a --debug switch to castled?
If the state stored in etcd is lost it may not be possible to recover the cluster.
Number 2 is a problem because the discovery.etcd.io token names etcd members which are nolonger valid. This could be fixed by an enhanced discovery service.
Steps to reproduce:
Expected - either:
Dan
When etcd membership changes in the castle, the leader should update the discovery service, so it reflects the latest etcd membership.
Look into using ccache to speed up clean build. Also passing -j parameter to make and to nested projects like rocksdb. Finally for circleci, cache the docker image.
The process management needs to be cleaned up. The ProcManager needs to use the Executor to launch processes instead of launching processes itself. This will also ensure the logging from child processes is captured consistently.
github.com/coreos/pkg/capnslog or something that support log levels, etc. Also would be good to tie that logging to ceph logging levels.
Now that jemalloc/jemalloc#442 (comment) is fixed we need to reenable jemalloc
its not clear why we need to use that. maybe it should be removed completely.
hardware discovery and orchestration currently rely on the disk serial number for a constant identity. However, there is not always a serial number available as seen in a container. We need to rely solely on uuids for a stable disk identity.
When building on Windows with
build/run make -j4 cross
A build error occurs:
# github.com/quantum/castle/pkg/util/proc
pkg/util/proc/procmanager.go:141: undefined: syscall.Kill
That functionality is only available on linux. This dependency is occuring due to castlectl pulling in the proc package for Executor functionality. It doesn't need procmanager that comes along with it (that contains the linux only syscall.Kill).
Consider splitting into new packages or using build tags for linux only.
on macOS we rsync the source tree from the host to the build container. this is a one way sync. if the vendor directory is not populated before calling build/run
it will be populated inside the container but never copied back to the host. as a result each time build/run
runs it will have to fetch the vendor directory again.
An easy workaround is to call make vendor
on the host before calling build/run
. We need a better solution. We could bind mount just the vendor directory but this issue gets in the way Masterminds/glide#642.
Also related is that when we install glide during make it goes in the tools dir. Same issue as above. also glide is install for the host arch, so if we bind mount it it will be the wrong arch for the host on the mac.
We can't log to /var/log/castle since we don't run privileged.
the release build currently builds rkt ACIs but does not publish them. We need a home for storing them that works with rkt trust
. S3 seems like an obvious choice but there are some issues with using HTTP/S and rkt trust
, see appc/spec#319
Also https://coreos.com/rkt/docs/latest/signing-and-verification-guide.html
When a node comes up, its devices should be configured with osds even if the orchestration fails the monitor configuration. The osd config does at least require mon quorum.
For example, say we already have one node in the cluster with a single monitor running. Now two new machines come online and the orchestration chooses to increase from 1 to 3 monitors. If the new monitors fail to start, there is no reason to skip configuration of osds on nodes 2 and 3, assuming the first mon is still healthy.
We can not use the mocked types from cephmgr/client/test to test the cephmgr/client package itself because of a cyclical dependency. This indicates a layering issue.
We could move some of the helper methods (e.g. pool.go, auth.go) to a package that is a peer of cephmgr/client. Essentially, usage of the cephmgr/client interfaces should not be within the same package, or else they cannot be tested using the mocked implementations of those interfaces.
/castle/pkg/cephmgr/client
> go test
# github.com/quantum/castle/pkg/cephmgr/client
import cycle not allowed in test
package github.com/quantum/castle/pkg/cephmgr/client (test)
imports github.com/quantum/castle/pkg/cephmgr/client/test
imports github.com/quantum/castle/pkg/cephmgr/client
FAIL github.com/quantum/castle/pkg/cephmgr/client [setup failed]
Device initialization for OSDs takes on the order of minutes. For the agent to configure all devices before returning to the orchestration leader causes several problems:
Completion of OSD configuration is a process that is quite independent from core orchestration. The only thing the orchestrator really cares about is whether the agent is actively configuring the OSDs. There are no orchestration failover scenarios that OSDs need to worry about, unlike the more critical etcd and mon services.
Upon a request to configure OSDs, the agent can immediately return success to the orchestrator that it is working on the configuration. There is no need to wait for all the OSDs to complete.
For applications that need to be guaranteed there is storage available (ie. OSD configuration is complete), the orchestrator can provide a helper to signal progress of available OSDs.
Storage pools are currently created with the default settings for replication. castlectl should also support operations on erasure code profiles, so that erasure coded storage pools can be created.
The listing of storage pool details should also include information about replication/erasure code profile.
This is benign but should be removed nonetheless
Oct 24 16:03:26 castle00 rkt[1528]: sh: lsb_release: not found
Looks like we need to set the timezone in the container to match the host.
bassam@bassamQ [AWS production] ~/Projects/src/github.com/quantum/castle (master)
> stat pkg/clusterd/inventory/hardware.go
16777220 119333050 -rw-r--r-- 1 bassam staff 0 1988 "Oct 18 22:28:40 2016" "Oct 9 17:05:44 2016" "Oct 9 17:05:44 2016" "Oct 9 17:05:44 2016" 4096 8 0 pkg/clusterd/inventory/hardware.go
bassam@bassamQ [AWS production] ~/Projects/src/github.com/quantum/castle (master)
> build/run stat pkg/clusterd/inventory/disk.go
File: 'pkg/clusterd/inventory/disk.go'
Size: 9098 Blocks: 24 IO Block: 4096 regular file
Device: fe02h/65026d Inode: 2501685 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 501/ UNKNOWN) Gid: ( 20/ dialout)
Access: 2016-10-19 20:24:29.319294497 +0000
Modify: 2016-10-19 21:14:35.000000000 +0000
Change: 2016-10-19 20:21:08.969798242 +0000
Birth: -
We would like castled to automatically manage disk devices, including partitioning them, formatting them etc. For example
castled --data-devices=/dev/sdb,/dev/sdc
castled should prepare these disks anyway it wants including formatting them for bluestore etc.
There are two considerations that worth highlighting for this approach:
Its our goal to minimize dependencies on the host distro to support running in minimal containers or hosts like CoreOS.
One possible approach to balance the two issues is to add a new verb to castled:
castled prepare --data-devices=/dev/sdb,/dev/sdc
this would require root privs. and also requires some linux tools like sgdisk to be available. castled can also support a flag that automatically prepares disks if no prepared
castled --data-devices=/dev/sdb,/dev/sdc --data-devices-prepare=auto
--data-device-prepare
should default to "auto" but do nothing (and not require root) if the disks are already prepared. It can also be set to "disabled" or "false"
This would enable the caller to decide whether to prepare the disks ahead of time, and run castled in a non-root account, or give castled root privs and let it auto-prepare.
WRT to tools needed by castled to partition devices, it seems wise to use alpine linux in our containers for tools like sgdisk, lsblk. While its possible to write these in go/cgo and remove the dependency (see bassam/rook-old@b07e06e for an example of a cgo lsblk), we should hold off on doing that now until we understand all the dependencies we need. Also #73 is related to this.
The public/private ip addresses are passed to the command line parameters, but only the private ip is currently used everywhere. We need to utilize the public ip wherever it is public networking, such as for mons and osds.
The demo vagrantfile generates a new etcd discovery token every time a machine comes up. We need to use the same discovery token to get the cluster going.
The timeout for the orchestration leader is two minutes when waiting for a node to respond. If a node has many disks on which to configure osds, it will take longer than this timeout and fail the orchestration even though the osds will still succeed.
A simple way to extend the timeout is a type of heartbeat after each osd is configured. This will signal to the leader that the node is still working and the timeout can be reset.
We need a simple vagrant file to bring up castle for dev/test scenarios
We need to enable bluestore for high-performance testing of the castle clusters. Filestore should still be the default option.
castled --store=[filestore | bluestore]
systemd stop castled.service
Oct 13 22:29:03 castle00 rkt[2122]: 2016-10-13 22:29:03.708259 I | Node 172.20.20.10 has age of 0s
Oct 13 22:29:08 castle00 rkt[2122]: 2016-10-13 22:29:08.709403 I | Discovered 1 nodes
Oct 13 22:29:08 castle00 rkt[2122]: 2016-10-13 22:29:08.710259 I | Node 172.20.20.10 has age of 0s
Oct 13 22:29:12 castle00 systemd[1]: Stopping Castle Daemon - software defined storage...
Oct 13 22:29:12 castle00 systemd[1]: castled.service: Killing process 2213 (castled) with signal SIGKILL.
Oct 13 22:29:12 castle00 systemd[1]: Stopped Castle Daemon - software defined storage.
core@castle00 ~ $
When a castled server which is already a member of etcd cluster fails, the etcdmgr should create an instance of embedded etcd on an existing healthy node to replace the failed one.
When running in rkt on CoreOS this is the output of systemctl status castled
├─machine.slice
│ └─castled.service
│ ├─2065 /usr/bin/castled daemon --type=mon -- --foreground --cluster=castlecluster --name=mon.mon0 --mon-data=/tmp/mon0/mon.mon0 --conf=/tmp/mon0/castlecluster.config --public-addr=172.20.20.10:6790
│ └─2233 /usr/bin/castled
Note that process 2065 is not a child of 2233
Also when process 2233 is killed, 2065 remains. This tells me something is wrong with how we start child processes
Usage information should be displayed when there is an error:
> bin/castled
2016-10-12 13:00:22.717331 I | cluster max size is: 1
2016-10-12 13:00:22.744193 I | currentNodes: []
2016-10-12 13:00:22.744218 I | current localURL: http://127.0.0.1:2379
2016-10-12 13:00:22.744258 I | creating a new embedded etcd...
2016-10-12 13:00:22.744338 I | conf: {e16d06178d2c471eb4b77c9489f00be7 [{http <nil> 127.0.0.1:2380 false }] [{http <nil> 127.0.0.1:2379 false }] [{http <nil> 127.0.0.1:2380 false }] [{http <nil> 127.0.0.1:2379 false }] /tmp/etcd-data}
2016-10-12 13:00:22.744356 I | client urls to set listeners for: [{http <nil> 127.0.0.1:2379 false }]
Error: listen tcp 127.0.0.1:2379: bind: address already in use
Usage:
castled [flags]
castled [command]
Available Commands:
version Print the version number of castled
Flags:
--devices string comma separated list of devices to use
--discovery-url string etcd discovery URL. Example: http://discovery.castle.com/26bd83c92e7145e6b103f623263f61df
--etcd-members string etcd members to connect to. Overrides the discovery URL. Example: http://10.23.45.56:2379
--force-format true to force the format of any specified devices, even if they already have a filesystem. BE CAREFUL!
-h, --help help for castled
--location string location of this node for CRUSH placement
--private-ipv4 string private IPv4 address for this machine (default "127.0.0.1")
Use "castled [command] --help" for more information about a command.
castled error: listen tcp 127.0.0.1:2379: bind: address already in use
Oct 13 22:07:02 castle00 rkt[2122]: 2016-10-13 22:07:02.895783 I | Running command: lsblk --all -n -l --output KNAME
Oct 13 22:07:02 castle00 rkt[2122]: 2016-10-13 22:07:02.895985 I | error while discovering hardware: failed to list all devices: Failed to complete lsblk all: exec: "lsblk": executable file not found in $PATH
Oct 13 22:07:02 castle00 rkt[2122]: waiting for ctrl-c interrupt...
Starting the bluestore osds in the demo vagrant environment, we intermittently see a the osd fail with the following error.
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.147976 c937080 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.177217 c937080 -1 bluestore(/var/lib/castled/osd0) _read_fsid unparsable uuid
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.189627 c937080 -1 bdev(/var/lib/castled/osd0/block) open open got: (22) Invalid argument
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.189686 c937080 -1 OSD::mkfs: ObjectStore::mkfs failed with error -22
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.189758 c937080 -1 ** ERROR: error creating empty object store in /var/lib/castled/osd0: (22) Invalid argument
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.191071 I | ERROR: failed to config osd on device sdd. failed to initialize OSD at /var/lib/castled/osd0: failed osd mkfs for OSD ID 0, UUID 27eaf968-5e1c-4ddd-a967-6e02291c3c4e, dataDir /var/lib/castled/osd0: failed to run osd: exit status 1
use a machine id is not as friendly as using a node name. we could possibly default to the host name or something like that.
instead of CASTLE_ to match the binary name.
When an etcd cluster size of greater than one is specified, the first machine will wait for more machines. However, later machines will come up and attempt to continue even though quorum is not formed.
Specifying devices by name is not a reliable choice mechanism. The most common scenario is to bring up OSDs on all devices except the system disk so we should make that the default.
etcdmgr should be able to add new etcd members to the etcd cluster quorum dynamically as the castle cluster grows.
etcdmgr should be able to remove existing etcd members from the etcd cluster quorum dynamically as the castle cluster shrinks.
I only see private_ipv4. we need to add public.
Each Castled server creates an instance of en etcd client and passes it to other parts of the code as a part of clusterd context. Since clusterd supports dynamic resizing of etcd cluster, the original etcd client could get outdated. This problem would lead to timeout and failure of castled.
The fix should update the etcd client in the context when a change occurs in the etcd membership.
We are hurting without a health command in the castle tool. There's no insight into the health of a cluster.
In a production environment we will expect two independent networks to be configured:
For this ceph configuration, see http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/.
castled currently requires /etc/machine-id to be available. When running in a scratch container this is not he case. we should support getting a node identity on the command line and env.
To support quick/dirty dev scenarios where you may have manually copied some files into the vendor directory, it would be nice if the Makefile could be given an explicit option to skip vendoring (glide install) so the temporary dirty dev changes are not overwritten.
If no port number is specified on rook's --api-server-endpoint option value, we should default to trying the default port.
The default port number is not easy to remember and this would make the tool easier to use.
Embedded etcd cluster sometimes returns a list of current client endpoints before they get fully initialized. It leads to failure of castled.
I'd like a command-line way to learn about the monitors in a cluster (where [ip, hostname], how many, health, other stats). I'm thinking about something that works like the "node ls" command.
Perhaps "rook monitor ls" or just "rook monitors".
When running a build inside a container using build/run make -j4
hitting CTRL+C sometimes hangs . An explicit docker stop cross-build
will stop it. This is likely due to the make program not handing signals correctly when its started inside a container as pid 1.
Currently CASTLED_DATA_DEVICES
supports devices name such as sdb which are not stable and could vary on every boot. We need to support different options for specifying which disks to use and not use for storage.
One option would be to let the user use ALL disks for storage except the system disk. It would be easy to find the system disk and exclude it. For example,
CASTLED_DATA_DEVICES=all
For this case, the system disk should be excluded and become an INFO message, instead of:
Oct 24 16:03:26 castle00 rkt[1528]: 2016-10-24 16:03:26.685353 I | ERROR: failed to config osd on device sdd. failed device sdd. device sdd already formatted with ext4
Another option would be to support an arbitrary filtering criteria for disks based on information obtainable by libblkid or lsblk, for example:
CASTLED_DATA_DEVICES=“SUSBSYSTEM=block:scsi:pci,SIZE>=5TB"
When castle running on Ubuntu, creating storage pools fails because it appears the mons have a dependency on crushtool existing in $PATH.
the castled API handler log shows:
failed to create new pool '{Name:jaredPool1 Number:0}': mon_command osd pool create failed, buf: , info: crushtool check failed with -22: crushtool: exec failed: (2) No such file or directory: cephd: Invalid argument
This ceph mailing list explains that mons needs the crushtool:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003392.html
Here's the codepath where it's used:
https://github.com/quantum/ceph/blob/47b199109bbff1db37ddff9461652e30d79df330/src/mon/OSDMonitor.cc#L4849
This works on CoreOS because we have some ceph tools embedded in the image: https://github.com/quantum/coreos-overlay/blob/master/sys-cluster/ceph/ceph-9999.ebuild#L76
There may be more unexpected dependencies in the ceph code to the ceph tools still out there.
repro steps on Ubuntu:
./bin/linux_amd64/castled
./bin/linux_amd64/castlectl pool create --pool-name="mypool1"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.