appnexus / schema-tool Goto Github PK
View Code? Open in Web Editor NEWframework-agnostic, schema management tool for everyone
License: Other
framework-agnostic, schema management tool for everyone
License: Other
While running an alter, when encountering failure, we do one of two things:
-f
flag is provided to the commandI am proposing either an alternative behavior for no. 1 or a new behavior (driven by configuration) in which, when we encounter an error we also attempt to do a rollback (running the down
alter).
The open question is whether or not this should be done for the -f
option or not. An argument could be made either way as to the semantics of the -f
option. We might also be able to introduce a new option which ignores rollbacks on a force.
Since -f
means "force" I feel the semantics should imply that we drop safety features like automatic rollback and leave it as a "run at your own risk" kind of feature.
I'm unable to correctly run the schema console script after installing via python setup.py install
.
(schema-tool)➜ schema-tool git:(master) ✗ schema Traceback (most recent call last): File "/Users/vincent.schumaker/.virtualenvs/schema-tool/bin/schema", line 9, in <module> load_entry_point('schema-tool==0.3.0', 'console_scripts', 'schema')() File "/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/site-packages/pkg_resources.py", line 356, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/site-packages/pkg_resources.py", line 2431, in load_entry_point return ep.load() File "/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/site-packages/pkg_resources.py", line 2147, in load ['__name__']) File "/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/site-packages/schema_tool-0.3.0-py2.7.egg/schematool/schema.py", line 30, in <module> from command import CommandContext File "/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/site-packages/schema_tool-0.3.0-py2.7.egg/schematool/command/__init__.py", line 1, in <module> from context import CommandContext File "/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/site-packages/schema_tool-0.3.0-py2.7.egg/schematool/command/context.py", line 7, in <module> from db import MySQLDb, PostgresDb, MemoryDb, VerticaDb, HiveDb ImportError: No module named db
This seems to be related to an issue with the value of sys.path, which is listed below.
['/Users/vincent.schumaker/.virtualenvs/schema-tool/bin', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/site-packages/schema_tool-0.3.0-py2.7.egg', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python27.zip', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/plat-darwin', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/plat-mac', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/plat-mac/lib-scriptpackages', '/Users/vincent.schumaker/.virtualenvs/schema-tool/Extras/lib/python', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/lib-tk', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/lib-old', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/lib-dynload', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/site-packages']
Running the script manually, from the root of the git source directory works correctly and produces the following sys.path.
['/Users/vincent.schumaker/Documents/vschumaker/code/schema-tool/schematool', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/site-packages/schema_tool-0.3.0-py2.7.egg', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python27.zip', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/plat-darwin', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/plat-mac', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/plat-mac/lib-scriptpackages', '/Users/vincent.schumaker/.virtualenvs/schema-tool/Extras/lib/python', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/lib-tk', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/lib-old', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/lib-dynload', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/Users/vincent.schumaker/.virtualenvs/schema-tool/lib/python2.7/site-packages']
I suppose the issue is that the way imports are written relies on the fact that the current directory of the file being executed is implicitly added to sys.path; however, this won't work when the console_script that is installed via setuptools is used as an entry point.
maybe add an option to schema new to made the file name be the same as your current git branch?
We are using the schema tool to manage our Postgres DB schema at AppNexus for the AppNexus Exchange API. We ran into a slight issue because the way we set up our initial alters would CREATE DATABASE
on the new DB we were creating (sadly Postrgres has no concept of IF NOT EXISTS
for DB creation). For Postgres, schema init
tries to connect to the DB you have listed in your config.json
, and it fails if it can't connect. In the end we decided to remove the CREATE and DROP DB commands from the up/down files to make working with schema tool easier. This means that the DBAs must be informed to create your DB before running your initial alter in prod.
If this idea sounds alright, I would like to contribute a small update to the Readme for schema tool to advise users not to create / drop their DB in their initial alter for Postgres. Basically just some simple best practices to prevent users from shooting themselves in the foot like we did.
You can see the accompanying change I made to the DBA documentation here under NOTE: if using PostgreSQL
: https://corpwiki.appnexus.com/display/engineering/How+to+Make+Database+Changes+as+a+Developer
Long term, it might make sense to either have schema tool not error out on init
if it can't connect to the DB, or else maybe have it create that DB if it doesn't exist. I'm not sure what implications that would have for the code though.
Running schema init
on a postgres schema where the database exists but the revision schema hasn't been manually created throws the following error:
~/app-schemas_apm-schemas> schema init
Creating revision database
Error: Psycopg2 execution error: schema_name
. Query: SELECT EXISTS(SELECT 1 FROM pg_namespace WHERE nspname = %s) - Data: [u'revision']
.
Here's the config.json
:
{
"username": "bkogan",
"host": "localhost",
"db_name": "REDACTED",
"schema_name": "REDACTED",
"revision_db_name": "REDACTED",
"revision_schema_name": "REDACTED",
"history_table_name": "REDACTED",
"type": "postgres",
"pre_commit_hook": "pre-commit-hook-static-dba-files.sh",
"static_alter_dir": "DBA_FILES/"
}
Schema-tool version 0.3.0, postgres version 9.4.4.
In addition to issue #47, we also need to support 2.6 for the same reasons.
I've realized that the vertica support has been broken for some time now because of at least 2 changes in the vertica-python
library:
Those changes came in version 0.5.x of vertica-python
I've a fork and I'd be happy to contribute with a PR. I'll check first that there are not other changes that broke the support.
There are a few places in the codebase that reference git
commands. This should be extracted out into a vcs
module that can be extended to support any version control system.
Any version control actions should be turned off by default and enabled with the use_vcs
config key, of which the value will contain the VCS to be used (e.g git
, hg
, svn
, etc).
There's the ability to specify envs in alters and in your config to bypass certain alters. This needs documenting.
An alter should be run once and only once in an alter chain. Enforce this by adding a unique constraint on the alter_hash column on the history table.
Create a standard setup.py
file so the tool can properly be installed via pip
/easy_install
.
The tool provides linearity in the alter chain and history tracking in the DB but there is one thing the tool does not currently track and that is changes to existing files. If the tool is to enforce linearity as a way of ensuring proper execution on any environment (existing or new) then we must also know when files change once they have already been run.
This is a bit of an open question but the general ideas are:
We could also take a hash of the file-contents and insert that into the history table. We could then check that nothing has changed when running alters against various environment (warnings? errors?). If we integrate this into the check
command then we add it to the up
, down
, and rebuild
commands for free.
Simple way to run tests (shell script). We'll worry about getting this setup with a CI system (like TravisCI) when the project goes public (we can run manually until then).
When running schema up
, a chain of alters is constructed from the alter files.
schema-tool
then grabs the history of alters that have been run for a PostgreSQL DB via a simple SELECT *
: schematool/db/_pg.py
The history is then compared with the chain of alters, and it seems that the ordering is significant:
schematool/command/up.py
From the PostgreSQL docs:
If sorting is not chosen, the rows will be returned in an unspecified order. The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on.
It seems that this should be patched to include an ordering, probably by the primary key and the timestamp.
'schema resolve' working with references but not filenames:
usethi@92:/usr/local/adnxs/api-schemas[(no branch)]$ schema check
Divergent Branch:
This means that we have found alters that share a common parent. To fix
this you can run the 'resolve' command. When merging changes from your
feature-branch, ensure that you resolve your files that are in conflict
(not existing files that were previously in a good state).
Duplicate backref found (divergent branch): 139879308220-integration-timestamp-up.sql
Duplicate backref found (divergent branch): 139879511680-AS-1479-up.sql
usethi@92:/usr/local/adnxs/api-schemas[(no branch)]$ schema resolve 139879511680-AS-1479-up.sql
An exception has occurred... Sorry. You should file a ticket in
our issue tracker: http://github.com/appnexus/schema-tool/issues
Error: global name 'islice' is not defined
Traceback (most recent call last):
File "/usr/bin/schema", line 74, in main
globals()[handler](context).run()
File "/usr/local/adnxs/schema-tool/src/command/resolve.py", line 64, in run
if self._file_exists():
File "/usr/local/adnxs/schema-tool/src/command/resolve.py", line 104, in _file_exists
head = list(islice(my_file, 3))
NameError: global name 'islice' is not defined
usethi@92:/usr/local/adnxs/api-schemas[(no branch)]$ schema resolve 139879511680
Resolving
Re-Checking...
Everything looks good!
When the psycopg2 is missing, schema-tool throws Postgres module not found/loaded. Please make sure all dependencies are installed
There should be a more informative error message.
Had several people come to be about the updated contrib script. It would seem that some of the directories are a bit backwards and this script could use a bit of attention. I working / fixed version I patched together for a co-worker is below. A proper update should be made (with comments and cleanup and what not):
#!/bin/bash
# pre-commit-hook-static-dba-files.sh
#
# Purpose: Automatically commit static up/down files whenever an alter is
# added or updated.
#
# Usage:
#
# Place this script in the same directory as config.json, and update
# config.json to have the following value:
#
# "pre_commit_hook": "pre-commit-hook-static-dba-files.sh"
#
# Last modified: 16 July 2014
# Utility function to write to stderr
echoerr() { echo "$@" 1>&2; }
# Utility function to exit if last command exited non-zero.
exit_if_err() {
if [[ $? -ne 0 ]]
then
exit 1
fi
}
# The schema-tool has to be run from the directory where config.json lives,
# which may not be the same location as where the root of the project. So, we
# have to move the working directory of the script to the location of
# config.json (which is the same real path as this hook), but before doing so
# save the current working directory to a variable so that Git operations can
# be performed.
HOOK_DIR="$( cd "$(dirname "$0")" ; pwd -P )"
ORIG_DIR="$HOOK_DIR/../.."
cd $ORIG_DIR
STAGED_FILES=$(cd $ORIG_DIR && git diff --cached --name-only --relative --diff-filter=ACMR)
STATIC_ALTER_DIR=$(/usr/bin/env python2.7 -c "import json; print json.loads(open('config.json').read())['static_alter_dir']")
if [[ $? -ne 0 ]]
then
echoerr 'No static_alter_dir property found in config.json, but is required.'
exit 1
fi
# Remove the .sql files in the static alter directory from the diff output
STAGED_FILES_SQL_ONLY=$(echo "$STAGED_FILES" | grep -v "$STATIC_ALTER_DIR" | grep -E '\.sql')
SEEN=()
for f in $STAGED_FILES_SQL_ONLY
do
NODE=$(grep -oE '^[0-9]+' <(echo $(basename $f)))
if [[ $? -ne 0 ]]
then
echo "Skipping invalid filename: $(basename $f)" 1>&2
continue
fi
SKIP=0
for var in "${SEEN[@]}"
do
if [[ "${var}" == "$NODE" ]]
then
SKIP=1
fi
done
if [[ $SKIP -eq 1 ]]
then
continue
fi
SEEN+=($NODE)
UP_RESULT=$(schema gen-sql -q -w $NODE)
exit_if_err
DOWN_RESULT=$(schema gen-sql -q -w -d $NODE)
exit_if_err
# Add the up and down files to git
ADD_UP=$(cd $ORIG_DIR && git add "$UP_RESULT")
exit_if_err
ADD_DOWN=$(cd $ORIG_DIR && git add "$DOWN_RESULT")
exit_if_err
echo "Added file to commit (up): $UP_RESULT"
echo "Added file to commit (down): $DOWN_RESULT"
done
This tool should be installable via PIP, whatever that takes to do... idk.
Currently we enforce a strictly linear alter-chain (history) but this may be overkill for some smaller projects. As such we should modify the current validation system to be extensible and/or interchangeable.
I can foresee allowing divergent alter-chains with optional resolutions (along the same line as the current one) with predictable mechanics for running alters. While more "risky" than a strictly linear alter-chain, it does lend itself well to small projects.
Open Questions
Need to add unit-tests to test the running of the down
command. The following need to be tested:
all
undoes all current altersbase
undoes all but "base" alter-n
option works as expected-f
causes errors to be ignoredNote: may need to change how the MemoryDb
works in order to make certain things pass/fail when running mock alters
Need to add unit-tests to test the running of the resolve
command. The following need to be tested:
ref
Note: May need to implement #6 (Version Control Agnosticism) to fully test this command
Both the up
and the down
commands support the ability to provide a reference (alter) to run up/down to (inclusively). Similar functionality should be added to the rebuild
command and should run down to (including) and up from (also including) the given reference (alter).
Need to add unit-tests to test the running of the init
command. The following need to be tested:
-f
option drops and re-creates databaseNote: May need to implement #6 (Version Control Agnosticism) to fully test this command
Write unit tests for the new
command. Should test the following:
backref
backref
created / usedif
optionRight now, the script won't run unless there is a file named conf.json
in the same directory. Make the name and location of this configurable.
Following the wiki (https://corpwiki.appnexus.com/display/engineering/How+To+Use+The+%27schema%27+Script) and got to the point of 'schema rebuild' which fails with:
Bringing all the way down
Downgraded
Bringing all the way back up
Running alter: 137823105930-init-up.sql
An exception has occurred... Sorry. You should file a ticket in
our issue tracker: http://github.com/appnexus/schema-tool/issues
Error: 2, No such file or directory
Anyway to get more details about which file/directory its looking for?
BTW, the file 137823105930-init-up.sql does exist in the directory.
Need to add unit-tests to test the running of the up
command. The following need to be tested:
ref
option causes to run up to, and including, ref given-n
option works as expected-f
causes errors to be ignoredNote: may need to change how the MemoryDb
works in order to make certain things pass/fail when running mock alters
See issue #36 for context.
Rather than throwing an exception all the way up the stack we should
-v
option how much of it we show vs a normal failure)psql
or mysql
)It might also make sense to provide a configuration option that allows you to provide a path to the specific pgsql
or mysql
binary that you would like to use, with the default value (if no config value provided) being whatever the script can locate on the current path.
See: #60 (comment)
Environment flags are needed that can be added to the meta-data of an alter under the env
key. The data MUST contain a comma delimited list of environments in either a whitelist or a blacklist format, but MUST NOT be in both formats. The +
symbol will be used for the whitelist and the -
symbol will be used for blacklists. The symbol will be prefixed in front of the environment name.
Examples:
Only the sand environment is affected
-- env: +sand
Every environment affected except for sand and dev
-- env: -sand, -dev
The tool should read in the SCHEMA_ENV
environment variable and add it to the CommandContext
. The commands will be responsible for determining how the specified environment effects the current command. For commands like up
, down
, and rebuild
it will determine whether a given alter should actually be executed.
Open (for discussion) Questions
What happens when no environment is specified? Is there a default environment?
Some DBAs like to have more control over how alters are run against production databases (especially in large-scale or mission-critical environments) and not perform their operations and custom workflows through a tool. As such, the tool should provide a way to give the DBAs full control over the alter while still maintaining the features associated with the tool in terms of history tracking.
The easiest way to approach this is through SQL generation in which the original file is altered to include (or appended with) insert/update/delete statements to the alter-history table.
It would be ideal to also include VCS-hooks to help automate this workflow if the user so chooses to use it. This should also build off of the existing gen-sql
commad which's original purpose was along these lines.
$ schema rebuild -f
Bringing all the way down
An exception has occurred... Sorry. You should file a ticket in
our issue tracker: http://github.com/appnexus/schema-tool/issues
Error: too many values to unpack
Traceback (most recent call last):
File "/Users/bzajac/bin/schema", line 95, in main
globals()handler.run()
File "/Volumes/Code/schema-tool/schematool/command/rebuild.py", line 32, in run
DownCommand(self.context).run()
File "/Volumes/Code/schema-tool/schematool/command/down.py", line 64, in run
for (_, alter_id, _) in history:
ValueError: too many values to unpack
I've ran into a ValueError issue when attempting to run schema up
on a database with too many alters applied. The for loop iterator attempting to check each alter_id
errors out when the dict object is too large to iterate. The proposal here is to update these for loops on these dicts to utilize iteritems()
(items()
for python3) to get around the ValueError.
StackTrace:
Traceback (most recent call last):
File "/usr/bin/schema", line 95, in main
globals()[handler](context).run()
File "/usr/local/adnxs/schema-tool/schematool/command/up.py", line 67, in run
for (_, alter_id, _) in history:
ValueError: too many values to unpack
Hey there, when doing schema -v or schema version, it looks like it pulls from constants.py => VERSION, which is listed as 0.2.18. Can cause some confusion with that number not changing.
Postgres goes a step further with the abstraction of databases than MySQL, using both databases (logically separated) and schemas (logical groupings, not necessarily separated). The postgres integration only creates the schema (or it should). Because it's not creating the database, this can be a little confusing when running "init".
Init should create the database and schemas if they do not already exist and it should log properly to the user to let them know what it is doing.
In order to do some proper testing, we really need a test-database so that we're not dependent upon any external resources. Ideally, since our DB usage is super simple, we could create an in-memory database that is only good for the life of the python process. This way we can stub information as is necessary for a given test run.
The new test database should:
memory-db
It would be nice if the user didn't have to worry about what version of Python they are running to use the tool. Python 3 is the future (right??) so we should support it. In addition, this is meant to be used as a general tool on someone's computer, not as something that needs to be run inside a virtual-env or some nonsense.
The easiest solution may to just edit the .travis.yml
file and fix the test errors. We may want to get our test-coverage up before attempting this. :-)
So if I have a feature branch with a pending schema change inside a file, unapply that change before git checkout master
Cool idea?
Currently the meta-data parsing is weak sauce. We just read the first three lines and assume that it contains the correct data. The new parsing needs to scan more (possibly all) of the file with some rules about when it should stop processing (possibly first line of non-comment or whitespace?).
We also need to add constraints so that we know that all of the "required" meta-data has been collected. This solution should also be easily extendible for adding new meta-data keys.
Need to add unit-tests to test the running of the check
command. The following need to be tested:
up
/down
pair are missingNote: may need to change how the MemoryDb
works in order to make certain things pass/fail when running mock alters
Hi,
It appears whenever I make a change to my schema I am unable to run schema up or down without rerunning the pip install command/rebuilding the project because I get the following error:
"Running alter: 143267219460-update_template_table-up.sql
An exception has occurred... Sorry. You should file a ticket in
our issue tracker: http://github.com/appnexus/schema-tool/issues
Error: 2, No such file or directory"
Additionally, this is what my schema list looks like (are there supposed to be 3 arrows or is something off?):
*[142746732000] input_tables
-> *[142843051770] update_double_columns
-> *[143040473890] output-tables
-> [143267219460] update_template_table
Thanks!
Dan
Now that this project is open source (and on GitHub) we should really be adding support for TravisCI to run on pull requests and such.
Whenever we manage schema migrations for non production environments, we invariably need to follow up with stub data inserts. Integrating stub data with our schema tool makes sense for two reasons:
I propose that we track stub data in a separate directory of our schema repos, and add a new command analogous to running up
and then also applying stub data sql files from some conventional project directory.
It may seem desirable to also add commands to add/remove only stub data, without affecting table definitions, but I argue that dev/test environments should be essentially immutable and defined at the granularity of create/destroy. We don't need to track a stub data alter chain. The more we fiddle with stub data, the harder it is to reproduce our dev/test environment. I think the proposed command with rebuild -f should be adequate.
Need to add unit-tests to test the running of the list
command. The following need to be tested:
Note: may need to change how the MemoryDb
works in order to make certain things pass/fail when running mock alters
When running list
we hit a max recursion error due to the number of alters and the way that the __str__
method is implemented on Node
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.