Giter VIP home page Giter VIP logo

azure-resource-manager-dse's People

Contributors

benofben avatar cpoczatek avatar raks100 avatar scotthds avatar simongdavies avatar stinkymatt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azure-resource-manager-dse's Issues

Switch to OpenJDK

OpenJDK 7 recently became a supported VM for DSE. OpenJDK 8 is on deck for 4.7.2. We should switch the templates to use that.

Rack Awareness

Support Rackaware DC configuration by setting up the config file and snitch

IP Cluster Size Limitation

The template uses a single subnet. 10.0.0.1-10.0.0.4 are reserved by Azure. 10.0.0.5-10.0.0.255 are useable.

10.0.0.5 is currently used by OpsCenter. 10.0.0.6 and up are used by DSE nodes. So, a maximum of 255-6=249 DSE nodes can be created.

In the future we'll want to spread deployment over multiple broadcast groups.

Template Refactor

The templates have over a year of history and during that time have accumulated a lot of cruft. I'm working on a major rewrite with the goal of compartamentalizing logic so it is reusable. With that complete, we'll have a resvised simple template and then begin work on a multi data center template (capable of deploying n nodes in each of m data centers/regions).

To reduce parameters passed between sub-templates, I'm making a lot of assumptions. I'm going to use this issue to document that logic and thought process.

The first step is to create a template that is simple and creates an OpsCenter instance. To remove dependences, opsCenterNode.json will create its own storage account for the OpsCenter node (possibly nodes at a later date if we add HA). It will also always bind to the private IP 10.0.0.6. The vnet will be passed in as a parameter and the region to deploy to will be determined from where the vnet is.

So, we're presuming the existence of a vnet and associated subnets. Besides that, I think the only other parameter we need is username/pw.

ssh key

Mahesh has modified the templates to accept either a password or an ssh key. Need to update opscenter.sh to work with the ssh key. It only works with a password currently.

SSL Certificate Error

If SSL is enabled for OpsCenter (by uncommenting lines in opscenter.sh), the user will see an SSL certificate error. This is because the SSL certificate OpsCenter uses is self signed.

There is no obvious fix for this as the servers are created on demand with new IPs and fingerprints.

Ephemeral Drive Owned by Cassandra User

Ideally the ephemeral drive would be owned by the Cassandra user. However, the Cassandra user is created by the OpsCenter deploy and the ephemeral is mounted in the node script before that.

DataStax VM Image

Add a "custom extension" to provide username and password. This will improve template reporting as well as eliminating the need for users to register at datastax.com

Azure Timeout

There's an intermittent issue where Azure times out when creating a cluster. We don't see this when creating 4 or 12 node clusters. It occurs ~50% of the time when creating 36 node clusters. Mahesh believes this is due to an outdated driver in the Ubuntu image and has requested that canonical update the image with a patch he is providing.

HTTP 401, 304 and 404 for dse apt-get

For some reason the script is generating a lot of HTTP errors. Need to understand why. Here are the stats for my user for the last week:
200 5,755 59.208%

404 3,822 39.321%

304 132 1.358% 

401 11 0.113%

Only the 401 seem to generate an error visible to the client.

Pre-runtime Template Input Validations

There are various validations (is there a name conflict, is a password of sufficient strength, is a username valid) that are performed on the backend but not in the web UI. Ideally this would happen in the web UI.

Linux Extensions Don’t Propagate Failure

Mahesh has asked me to return an error code if the Java install fails. It's likely that as part of this we should break out the java install for all three templates.

Tune Templates by Machine Type

Parameters that would vary depending on the machine type include:

  • flush writers
  • compaction throughput
  • DataStax Agent heap from 128mb to 512mb
  • DataStax Agent queue size.

Synchronous DSE Install

Currently opscenter.sh makes a REST call to OpsCenter at the end of the script. The script then exits and the operation the REST call begins (installing a DSE cluster) continues asynchronously.

Azure will then return complete and successful, even if DSE is still installing (and potentially fails). Potentially we may want to make this behavior synchronous and monitor for errors in the DSE deployment.

... or we may wish to keep them decoupled. There are arguments for each...

Set ulimit above default

In a discussion with a existing Azure DataStax user today, it was noted that ulimits are too low by default. It was suggested that the templates should set those higher for system, cassandra and *. I don't see a downside to doing this.

Multi Datacenter Provision Fails

I had customized pointer to a file instead of web url

The original entry:-
"https://raw.githubusercontent.com/DSPN/azure-resource-manager-dse/master/extensions/opsCenter.py",

Updated in the /home//azure-resource-manager-dse/multidc/opscenter.py file:-
"file:///home/indersingh/azure-resource-manager-dse/extensions/opsCenter.py",

The error while DSE resources were being created:-
Error Type:
Microsoft.Compute/virtualMachines/extensions

• Resource Id
/subscriptions//resourceGroups//providers/Microsoft.Compute/virtualMachines/opscenter/extensions/installopscenter

• StatusMessage
{"status":"Failed","error":{"code":"ResourceDeploymentFailure","message":"The resource operation completed with terminal provisioning state 'Failed'.","details":[{"code":"VMExtensionProvisioningError","message":"VM has reported a failure when processing extension 'installopscenter'. Error message: "Enable failed: <urlopen error [Errno 2] No such file or directory: '/home//azure-resource-manager-dse/extensions/opsCenter.py'>"."}]}}

What is the best way to implement my customizations (dse data folders and other C* parameters and be able to implement these changed from my PC to Azure in "multidc". Should I update the online file (git hub) and then download it on my machine and then revert changes online (git hub)?

Or how do I ensure that changes made on my local machine are used by the python script while building the DSE cluster in Azure using these scripts?

Thanks,
Inder

Extension File Permissions

Mahesh has made me aware of a potential issue where a non root user could gain root access by finding the root password in some extension log files. He's providing me with a list of files and I'll add a chmod command to the extension to modify those.

The longer term solution is to use a secure field in the extension. This is a feature that is not yet available.

Preconfigured Backups

The template deploys DSE nodes configured to use ephemeral storage and attaches a data disk that can be used for data backups in the event of a cluster failure resulting in the loss of the data on the ephemeral disks. Ideally we would automate this backup process.

Intermittent Linux Extension Fails to Run

Since October 2015 we’ve been observing an issue where the Linux extension that installs DSE fails to run at all. The /var/lib/waaagent/Microsoft… directory is not even created. We had a similar issue that was attributed to a bug in Fabric and is believed to be fixed. It’s unclear whether this is a recurrence of that bug or an entirely new bug. It also occurs intermittently.

Storage Account Name Conflicts

Deletion of a storage account takes some time (one estimate is 12 minutes) after the command is entered. Given that, it is currently necessary to give new clusters a different name than previously created clusters to avoid a name collision.

Support for SSH keys

The template uses username/password for provisioning cluster nodes in the cluster. Ideally it would offer an option to use an SSH key.

Hash Key Mismatch

We've seen this on apt-get update and apt-get install zulu-8. The error message is:

Hit http://security.ubuntu.com trusty-security/main Translation-en
Hit http://security.ubuntu.com trusty-security/universe Translation-en
Fetched 6,205 B in 2s (2,295 B/s)
W: Failed to fetch http://repos.azulsystems.com/ubuntu/dists/stable/main/binary-amd64/Packages  Hash Sum mismatch
E: Some index files failed to download. They have been ignored, or old ones used instead.
root@dc0vm1:/var/lib/waagent/Microsoft.OSTCExtensions.CustomScriptForLinux-1.3.0.1/download/0# 

Storage Account Limitation

We understand that storage accounts have a limit of 40 attached drives. This is forcing us to break nodes over different subnets for clusters greater than 40 nodes and introduces a lot of complexity.

We would really like to see this limitation abstracted away/otherwise removed in Azure as it will simplify the templates substantially.

apt-get failure with HTTP 401

There is an intermittent issue where apt-get fails. When using my DataStax credentials I see this issue fairly frequently. It manifest as a 401 authorization error in OpsCenter. Interestingly, when using Matt's credentials, I don't see this error.

Intermittent OpsCenter Provision Failures

OpsCenter provisions currently fail in a number of circumstances. For instance, provisioning a large cluster with vnodes will cause the agents to timeout. We understand improvements to this process are coming, but for now users should be aware that it's a weak point.

Attempting to fix the cluster and retry the provision is both complex and time consuming. The best way to deal with these failures at the moment is to destroy the cluster and attempt the provision again.

Vnode Timeout

It turns out vnodes can cause OpsCenter provision to fail. Removing them until that's resolved with Spock.

Default Versions of DSE and OpsCenter

Set OpsCenter and DSE to use default versions rather than the newest so updates do not cause Marketplace offerings, etc to break if compatibility changes.

Changes to DSE Config

Chuck has suggested the following changes to the default config in the opscenter.sh json. I'll work on this.

[3:33 PM] Chuck Droukas: Here you go:
endpoint_snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
rpc_address: 0.0.0.0
hinted_handoff_enabled: 'false'

[3:34 PM] Chuck Droukas: For multi-DC:
broadcast_rpc_address: 10.0.0.X
 broadcast_rpc_address: 10.1.0.X
 num_tokens: 30
 phi_convict_threshold: 12
Remove:
# remove:
 initial_token: 4611686018427387901
Basically, Vnode setup.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.