Giter VIP home page Giter VIP logo

amazon-redshift-utils's Introduction

Amazon Redshift Utilities

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that uses columnar storage to minimise IO, provide high data compression rates, and offer fast performance. This GitHub provides a collection of scripts and utilities that will assist you in getting the best performance possible from Amazon Redshift.

Admin Scripts

In the AdminScripts directory, you will find a collection of utilities for running diagnostics on your Cluster

Admin Views

In the AdminViews directory, you will find a collection of views for managing your Cluster, generating Schema DDL, and ...

Stored Procedures

In the StoredProcedures directory, you will find a collection of stored procedures for managing your Cluster or just to use as examples

Column Encoding Utility

In order to get the best performance from your Redshift Database, you must ensure that database tables have the correct Column Encoding applied (http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html). Column Encoding specifies which algorithm is used to compress data within a column, and is chosen on the basis of the datatype, the unique number of discrete values in the column, and so on. When the COPY command (http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) is used to load data into a table, column encoding will be analyzed and applied by default. Other tables may be loaded via Extract/Load/Transform/Load (ELT) processes, and these tables may require having the column encoding updated at some point.

The Redshift Column Encoding Utility gives you the ability to apply optimal Column Encoding to an established Schema with data already loaded. When run, it will analyze an entire schema or individual tables. The ANALYZE COMPRESSION (http://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE_COMPRESSION.html) command is used to determine if any of the columns in the table require updating, and if so a script is generated to convert to the optimal structure.

Analyze & Vacuum Utility

The Redshift Analyze Vacuum Utility gives you the ability to automate VACUUM and ANALYZE operations. When run, it will analyze or vacuum an entire schema or individual tables. This Utility Analyzes and Vacuums table(s) in a Redshift Database schema, based on certain parameters like unsorted, stats off and size of the table and system alerts from stl_explain & stl_alert_event_log. By turning on/off '--analyze-flag' and '--vacuum-flag' parameters, you can run it as 'vacuum-only' or 'analyze-only' utility. This script can be scheduled to run VACUUM and ANALYZE as part of regular maintenance/housekeeping activities, when there are less database activities (quiet period).

Cloud Data Warehousing Benchmark

The Cloud DW Benchmark consists of a set of workloads used to characterize and study the performance of Redshift running a variety of analytic queries. The DDL to set up the databases, including COPY utility commands to load the data from a public S3 directory, as well as the queries for both single user and multi-user throughput testing are provided.

Unload/Copy Utility

The Redshift Unload/Copy Utility helps you to migrate data between Redshift Clusters or Databases. It exports data from a source cluster to a location on S3, and all data is encrypted with Amazon Key Management Service. It then automatically imports the data into the configured Redshift Cluster, and will cleanup S3 if required. This utility is intended to be used as part of an ongoing scheduled activity, for instance run as part of a Data Pipeline Shell Activity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html).

Simple Replay Utility

The Simple Replay Utility helps you to collect and replay cluster workloads. It reads the user activity log files (when audit is enabled) and generates sql files to be replayed. There are two replay tools. One that replays at a arbitrary concurrency and other that tries to reproduce the original cadence of work.

Automation Module

This project includes code that is able to run the Amazon Redshift Utilities via AWS Lambda. By using a Lambda function scheduled via a CloudWatch Event (http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatchEvents.html), you can ensure that these valuable utilities run automatically and keep your Redshift cluster running well.

Snapshot Manager

This project is now deprecated. The automatic capture and management of cluster snapshots is handled by AWS. Documentation is available

WLM Query Monitoring Rule (QMR) Action Notification Utility

This project enables a scheduled Lambda function to pull records from the QMR action system log table (stl_wlm_rule_action) and publish them to an SNS topic. This utility can be used to send periodic notifications based on the WLM query monitoring rule actions taken for your unique workload and rules configuration.

Investigations

This project includes a number of detailed investigations into various types of Redshift edge cases, nuances, and workload scenarios.

Authentication

You can provide a Redshift password as a base64 encoded KMS encrypted string in most tool configurations, or alternatively you can use .pgpass file or $PGPASS environment variable based authentication. In each module, or to package all of the modules for Lambda based automation, the use of .pgpass will require that you rebuild the module using the build.sh script, but then should work as expected.

Please note that this feature was added due to requests by customers, but does not represent the most secure solution. It stores the password in plaintext, which depending on how modules are deployed may be a security threat. Please use with caution!

Running utilities

From the command line, you can run the utilities from the src directory with:

python3 ./<folder>/<utility> <args>

Docker executions

The Dockerfile provides an environment to execute the following utilities without having to install any dependencies locally:

  • Analyze & Vacuum Utility
  • Unload/Copy Utility
  • Column Encoding Utility

You can do this by building the image like so:

docker build -t amazon-redshift-utils .

And then executing any one of the 3 following commands (filling in the -e parameters as needed):

docker run --net host --rm -it -e DB=my-database .... amazon-redshift-utils analyze-vacuum
docker run --net host --rm -it -e DB=my-database .... amazon-redshift-utils column-encoding
docker run --net host --rm -it -e CONFIG_FILE=s3://.... amazon-redshift-utils unload-copy

The docker entrypoint scripts work off of environment variables, so you'd want to provide those in your run scripts above.

For convenience, you can create a .env file locally and upload them to the docker container via the --env-file argument. For example if your environment variables file is named redshift_utils.env then you could execute with:

docker run --net host --rm -it --env-file redshift_utils.env .... amazon-redshift-utils analyze-vacuum

docker run --net host --rm -it --env-file redshift_utils.env -e CONFIG_FILE=s3://<bucket_name>/<config_file>.json amazon-redshift-utils unload-copy

Please see the entrypoint scripts for the environment variable configuration references that are needed.


License

This project is licensed under the Apache-2.0 License.

amazon-redshift-utils's People

Contributors

adedotua avatar alexlsts avatar benkim05 avatar brandon-schur avatar devanathansabapathy1 avatar ericfe avatar etspaceman avatar fjavieralba avatar frankfarrell avatar gaganawhad avatar ianmeyers avatar iansrobinson avatar indu-bhagavatula avatar jiteshsoni avatar jklukas avatar joeharris76 avatar karle-nishi avatar kevinvirauddh avatar maryna-popova avatar mscaer avatar pvbouwel avatar ragsrina avatar rajsett avatar saeedma8 avatar sathiish-kumar avatar sgromoll avatar styerp avatar timjell avatar tpcstld avatar zach-data avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-redshift-utils's Issues

v_generate_tbl_ddl.sql doesn't always return primary key constraints

The part of the sql that returns constraint information doesn't always return all constraints.
Looks like that join condition between pg_class and pg_constraint c.relfilenode = con.conrelid is not always true for any given table. I did not find anything on Redshift file storage, however this Postgres documentation could explain it (https://www.postgresql.org/docs/9.0/static/storage-file-layout.html)
Note that while a table's filenode often matches its OID, this is not necessarily the case; some operations, like TRUNCATE, REINDEX, CLUSTER and some forms of ALTER TABLE, can change the filenode while preserving the OID. Avoid assuming that filenode and table OID are the same.

AnalyzeVacuumUtility password management

Currently this utility requires the password to be provided as an argument.

This means that the userid & password are vulnerable and can be picked up by anyone on the system and watching for processes, or they may be picked up by any endpoint security software and then kept in logs where others can then access them. Beyond making the systems that use this utility vulnerable to attack, it would also be a security policy violation in many locations.

I'd like to propose supporting either a .pgpass file or environmental variables for storing this password, ideally both. The python library pgpasslib (https://pgpasslib.readthedocs.io/en/latest/) is a simple module to assist in reading from the .pgpass file.

Vacuum delete only

The column encoding utility doesn't really run "vacuum delete only" when run with a --do-execute=True but writes it to the generated sql script.

  1. I don't understand why a vacuum delete is really needed since the script doesn't seem to do any update/delete operations
  2. If it is indeed needed then why one behavior when do-execute is true and another behavior when false?

Link - https://github.com/awslabs/amazon-redshift-utils/blob/master/src/ColumnEncodingUtility/analyze-schema-compression.py#L825

Value too long varchar(275) error with src/AdminViews/v_generate_tbl_ddl.sql

When running v_generate_tbl_ddl.sql on certain databases, the script errors with the message:

ERROR: value too long for type character varying(275)

When hacking at the script, I discovered that changing line 77 from this:
,'\t,' + pg_get_constraintdef(con.oid) AS ddl
to this:
,'\t,' + pg_get_constraintdef(con.oid)::VARCHAR(8000) AS ddl
addresses the problem for my use case. Unfortunately, I cannot speak to whether this would be the optimal way to solve this for all conditions due to the script's close ties with the deep inner workings of Redshift.

Column Encoding Utility

Hi,

Could the new table create SQL have the default values set as well. I am forced to manually add all default values to the generated SQL file.

Thanks.
JJ

RedShift UDF to access S3 resource

Can AWS RedShift resource be used to access S3 objects?

I'd like to write a function that will import any given drop file from s3 directly into the cluster the UDF is running on.

Something like

CREATE FUNCTION import_s3_object(full_path_to_s3 VARCHAR)
RETURNS varchar
IMMUTABLE AS $$
//COPY some_table FROM full_path_to_s3 
$$ LANGUAGE plpythonu
;

In doing so, I also would like the RedShift instance to not have to pass the credentials, but instead identify the role to be used that has the attached policies to access the file from RedShift.

To be explicit, I'd like my function to all together not have to do

COPY public.some_table FROM 's3://some_bucket/drop_file.gz'
CREDENTIALS 'aws_access_key_id=...;aws_secret_access_key=...;token=...'     delimiter '\t' gzip NULL AS '\000' TRUNCATECOLUMNS;"

I'd like the function run the query as such:

COPY public.some_table FROM 's3://some_bucket/drop_file.gz'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
AS '\000' TRUNCATECOLUMNS;"

So, having verbalized my question, my requirement is a function to

CREATE FUNCTION import_s3_object(S3_FULL_PATH VARCHAR)
RETURNS varchar
IMMUTABLE AS $$
//COPY public.some_table FROM 'S3_FULL_PATH'
//iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
//AS '\000' TRUNCATECOLUMNS;"
$$ LANGUAGE plpythonu
;

Unsafe migration order can cause lost data

Based on the description in https://github.com/awslabs/amazon-redshift-utils/tree/master/src/ColumnEncodingUtility#data-migration, there's potential for data loss. The table that's getting migrated won't get locked until the ALTER TABLE statement that first modifies it. In between, other transactions could be writing to the table. None of those updates will be present in the final table once the migration transaction commits.

Would you consider moving the rename of the existing table to be the first thing that happens in order to lock the table early and prevent the potential for loss of updates?

See https://www.simple.com/engineering/safe-migrations-with-redshift for a more detailed discussion of this possibility.

AnalyzeVacuumUtility produces missing shared library GLFW3 error running in Windows cmd

C:\Users\mc55168\Documents\Redshift\Analyze - Vacuum Utility>python analyze-vacuum-schema.py --db xxxxxdb --db-user admin --db-port 5439 --db-host us-east-xxxxx-us-cds-cpd1-redshift-xxxxxxxxxxxxxxx-xxxxx.xxxxxxxxxx.us-east-1.redshift.amazonaws.com --schema-name prod_tables --table-name cust_xxxx_xxxx_vers --output-file xxxxxxx_analyze_vacuum_cust_xxxx_xxxx_vers.log --debug True --ignore-errors False --slot-count 5 --vacuum-flag True --vacuum-parameter FULL --analyze-flag True --min-unsorted-pct 5 --max-unsorted-pct 50 --deleted-pct 15 --stats-off-pct 10 --max-table-size-mb 272417
Traceback (most recent call last):
File "analyze-vacuum-schema.py", line 44, in
import pg
File "C:\python27\lib\site-packages\pg_init_.py", line 1, in
from .core import (
File "C:\python27\lib\site-packages\pg\core.py", line 6, in
from . import glfw
File "C:\python27\lib\site-packages\pg\glfw.py", line 140, in
raise ImportError("Failed to load GLFW3 shared library.")
ImportError: Failed to load GLFW3 shared library.

v_space_used_per_tbl gives divide by zero error when encountering table with zero records

These lines cause the error. Need to handle scenarios where tables can be empty.

,ROUND((a.unsorted_rows::FLOAT / a.rows::FLOAT) * 100, 5) AS pct_unsorted
    ,CASE WHEN (a.unsorted_rows::FLOAT / a.rows::FLOAT) * 100 >= 20 
        THEN 'VACUUM recommended' ELSE 'n/a' END AS recommendation

Note: Thanks for publishing these utilities. I have been putting together a similar one but you have done it several times better

Column Encoding Utility

HI Eric ,

I am using the column encoding utility and is working great but has just 2 missing pieces like , provide couple of extra flags like , apply the grants which exist on the old table to new table and the other one is , if there are any objects (views) depends on the older table , then re-create the same and copy the necessary privileges .

Just add these 2 missing pieces and it will be a complete table migration script

Preserving PK / contraints

Hi there!

This utility is fantastic. Thanks for the hard work of getting it out! I ran it on our analytics DB in Redshift, but noticed that primary keys and other constraints were not preserved, which we need for certain features of the ETL process that we built. Could you work on preserving those features of tables in the recreate process?

Thanks!
Andrew

An error occured when executing v_generate_tbl_ddl

Hello

I try to run the script "v_generate_tbl_ddl" but I get the error below.
My config is :
Cluster Version 1.0.1066
SQL Workbench/J Build 119 (2016-01-31)
driver --> RedshiftJDBC41-1.1.13.1013.jar

An error occurred when executing the SQL command:
CREATE OR REPLACE VIEW admin.v_generate_tbl_ddl
AS
SELECT
schemaname
,tablename
,seq
,ddl
FROM
(
SELECT
schemaname
,tablename
,seq
,ddl
...

Amazon Invalid operation: syntax error at or near "SELECT";

Execution time: 0.05s

1 statement failed.

Convert repository to python package

First of all great work on this repo , it make redshift admin's work a easier .
Could be cool tough to change this repository tol be python package , which will allow for better programmatic experience .

Stuck using analyze-schema-compression

Hello

I've read few time the doc, a little bit of the code but don't get any results;

$ ./analyze-schema-compression.py --db '**' --db-user 'awsuser' --db-host '**.redshift.amazonaws.com' --db-port '5439' --comprows '200000' --do-execute true --analyze-table 'active_connections' --output-file  'report.txt' --debug true --force true --ignore-errors true
Password <awsuser>:
-- [7937] Redshift Column Encoding Utility Configuration
-- [7937] output_file: report.txt
-- [7937] db: ***
-- [7937] db_user: awsuser
-- [7937] db_host: ***.redshift.amazonaws.com
-- [7937] db_port: 5439
-- [7937] threads: 1
-- [7937] analyze_schema: None
-- [7937] analyze_table: active_connections
-- [7937] target_schema: None
-- [7937] debug: True
-- [7937] do_execute: True
-- [7937] query_slot_count: 1
-- [7937] ignore_errors: True
-- [7937] force: True
-- [7937] drop_old_data: False
-- [7937] comprows: 5824046
-- [7937] query_group: None
-- [7937] ssl_option: False
-- [7937] Connect [7937] **.redshift.amazonaws.com:5439:a***:awsuser
-- [7937] set search_path = '$user',public,None

Thanks again for the script

Error on View Permissions Query

When I try to select * from admin.v_get_view_priv_by_user I get an error:

ERROR:  cross-database references are not implemented: "facebookappmktg.facebook_adgroup_1234567__conversion_specs__action.type"
Query failed
PostgreSQL said: cross-database references are not implemented: "facebookappmktg.facebook_adgroup_1234567__conversion_specs__action.type"

I'm not sure what to make of that or how to begin debugging it.

Global variable values in ColumnEncodingUtility not passed as expected to multiprocessing pool on Windows

Using the column encoding utility on Windows leads to problems when providing values via the command line. An initial connection to Redshift is established from inside the main() function, but when analyze() is called by multiprocessing.map, the values in the global variables (db_user, db_pwd, etc...) that are passed to analyze() are not those provided in the command line, but rather the values defined by the block earlier in the script (lines 65-84). Some of these variables can be supplied by setting environment variables, however some (e.g. db_pwd, analyze_schema) are expected to be supplied via the command line. As a result, several variables required by analyze() are None.

The Python documentation recommends that for Windows compatibility, that arguments are passed explicitly to functions called via multiprocessing, rather than relying on global variables.

An example of code that works on Linux but not as expected on Windows:

from multiprocessing import Pool

str_ex = "Not Updated"
def f(i):
    print(str_ex)

def main():
    global str_ex
    str_ex = "Updated"
    p = Pool(3)
    p.map(f, range(3))

if __name__ == "__main__":
    main()

On Linux the function f() sees the updated string, whereas on Windows, "Not Updated" is printed three times.

top_queries.sql missing num_events column in subquery

This error is generated when I run the script. I added the bolded text below to get it running.

ERROR: 42703: column alrt.num_events does not exist */

left outer join ( select query, trim(split_part(event,':',1)) as event ,COUNT(*) AS num_events from STL_ALERT_EVENT_LOG where event_time >= dateadd(day, -7, current_Date) group by query, trim(split_part(event,':',1)) ) as alrt on alrt.query = stl_query.query

Issues with v_generate_tbl_ddl.sql

Hello --
Thanks so much for these great scripts.
I've been using them for various Redshift diagnostics. I think I have found an issue with v_generate_tbl_ddl. When I try to run it I get this error message:

SELECT
 schemaname
 ,tablename
 ,seq
 ,ddl
FROM
 (
 SELECT
  schemaname
  ,tablename
  ,seq
  ,ddl
 FROM
  (
  --DROP TABLE
  SELECT
   n.nspname AS s...

[Amazon](500310) Invalid operation: syntax error at or near "SELECT";

I've been trying to troubleshoot this and figure out the query issue but it is a rather large query.
A couple questions to start:

  1. Is there a reason to have the ORDER BY clauses in the subqueries? I don't think these actually do anything. I'm actually a bit surprised Redshift allows this; SQL Server doesn't.

    Examples:
    Line 81, ORDER BY a.attnum
    Line 93, ORDER BY seq
    Line 141, ORDER BY abs(a.attsortkeyord))
    Line 152, ORDER BY abs(a.attsortkeyord)
    Line 181, ORDER BY seq

    More info on why ORDER BY in subqueries isn't useful and could actually be harmful by having the database do extra unnecessary work :http://stackoverflow.com/questions/5119190/oracle-sql-order-by-in-subquery-problems.

    I realize the stackoverflow discussion above is about Oracle, but I think the general point is relevant for any RDBMS.

    I have found that sub-parts of the query will work if I remove the excess ORDER BY clauses.

  2. I think there are some extra parentheses?
    Examples:
    Line 127, UNION (SELECT
    Line 141, ORDER BY abs(a.attsortkeyord)) -- This time at the end, aside from the ORDER BY
    Line 157, WHERE c.relkind = 'r' )

  3. Does the query as a whole run correctly for you? I have been using SQL Workbench/J and just want to make sure it isn't some sort of query formatting or codepage issue.

Thanks!

Perf improvement opportunity long running transactions for ColumnEncodingUtility

ANALYZE COMPRESSION requires an exclusive lock to begin, which is held through the duration of the transaction once granted - blocking reads.

It would be better to implement this in a way such that the analyze compression per table is performed in a single transaction (including rebuild if enabled). This approach would reduce lock contention.

getting list of tables by deleted-pct in AnalyzeVacuumUtility uses deprecated column and doesn't seem to work anymore

Hi,

The query used in https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AnalyzeVacuumUtility/analyze-vacuum-schema.py#L251 is using svv_table_info.empty which Amazon says is deprecated (http://docs.aws.amazon.com/redshift/latest/dg/r_SVV_TABLE_INFO.html). From my testing, the column is always 0 for every table, even tables with a lot of deleted space to be reclaimed. So I believe this no longer works correctly. I opened a ticket on the redshift forum and they provided a new query:
select ae.event_time::date, qr.label, count(distinct ae.query) as query_count, count(*) as alert_count, max(ae.query) as sample_query from stl_alert_event_log ae, svl_query_report qr, stl_query sq where substring(ae.event, 1,30) = 'Scanned a large number of dele' and ae.event_time::date between (getdate()-7)::date and getdate()::date and lower(sq.querytxt) not like 'padb_fetch_sample:%' and ae.query = qr.query and ae.segment = qr.segment and ae.step = qr.step and ae.slice = qr.slice and ae.query = sq.query group by ae.event_time::date, qr.label order by ae.event_time::date desc, 4 desc , 3 desc;

Anyhow, just thought I would report this issue. This new query isn't quite the same as before, so probably requires changing the script a bit.

Best,
Chris

Issue with primary key text not showing after table has been populated

The primary key line (from CONSTRAINT LIST section) is only correctly generated from the script after the initial creation of a given table, once the table has been populated, running the same script the primary key line doesn't show. Even subsequently deleting all rows in the table and re-running the script doesn't bring back the primary key.

Example below:


  DROP TABLE IF EXISTS temp_table;
  CREATE TABLE temp_table 
  (
    id                  INTEGER,
    text          VARCHAR,
    PRIMARY KEY (id)
  )
  DISTSTYLE ALL SORTKEY (id);

SELECT *
FROM admin.v_generate_tbl_ddl
WHERE tablename = 'temp_table'
AND   schemaname = 'schema1';

-- Primary key is shown

INSERT INTO temp_table
SELECT *
FROM actual_table;

SELECT *
FROM admin.v_generate_tbl_ddl
WHERE tablename = 'temp_table'
AND   schemaname = 'schema1';

-- Primary key isn't shown

TypeError: int() argument must be a string or a number, not 'NoneType'

Hello

First thanks for the tools. I get an error when using analyze-schema-compression

Traceback (most recent call last):
  File "./analyze-schema-compression.py", line 1004, in <module>
    main(sys.argv)
  File "./analyze-schema-compression.py", line 995, in main
    configure(output_file, db, db_user, db_pwd, db_host, db_port, analyze_schema, target_schema, analyze_table, threads, do_execute, query_slot_count, ignore_errors, force, drop_old_data, comprows, query_group, debug, ssl_option)
  File "./analyze-schema-compression.py", line 691, in configure
    comprows = None if _comprows == -1 else int(_comprows)
TypeError: int() argument must be a string or a number, not 'NoneType'

I remove the if statement for comprows and query_slot_count and set my value to avoid the problem.

Cheers

goback_no_of_days default value

In the Documentation of VACUUM Phase 1:

it takes goback_no_of_days as arguments, the default value is 1. and use it in finding
recent logs by => l.event_time >= DATEADD(DAY, $(goback_no_of_days) , CURRENT_DATE)

This should be l.event_time >= DATEADD(DAY, - $(goback_no_of_days) , CURRENT_DATE) OR default value should be -1.

Compression Utility - Table name started by a number

Hi guys,

I'm sorry but i can't run the compression utility on tables that the name starts with a number, for example '8ball...'.

I'm getting this error:

(u'ERROR', u'42601', u'syntax error at or near ".8"', u'27', u'/home/ec2-user/padb/src/pg/src/backend/parser/parser_scan.l', u'699', u'yyerror', u'', u'')

Is there any way this can be quick fixed?

Thanks a lot

python3 support for analyze-vacuum-schema.py

Hi in order to support both python3 and python2 the following changes needed:

  • remove leading zeros in integers
  • transform print str(err) to print(str(err))
  • remove "from string import uppercase"

I have no problem creating a branch and pulll request if needed. I have the code written already

[UnloadCopyUtility] Can't migrate between regions

After exporting the DB successfully into a staging S3 location with the same region as the source, I get error when trying to import it into the target redshift from the staging S3 "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint," which is indicating of me trying to access the bucket of wrong region. The source redshift and S3 are in us-east-1, and the target is in eu-west-1. The exact command I ran was "python redshift-unload-copy.py ./config.json us-east-1". Just for the sake of it I tried eu-west-1, but it failed at an even earlier stage of trying to export the DB from source to the staging S3.

Can anyone else confirm inability to migrate from one region to another? If it is a confirmed issue, I will submit a pull request.

Thank you.

admin.v_space_used_per_tbl Issue

The recommended column on the above view doesn't seem to work. When you look at
select * from SVV_TABLE_INFO tables that are unsorted are still showing up as VACUUM recommended in the below. I haven't given it much thought yet but will review the code tonight

CASE WHEN a.rows = 0 THEN 'n/a'
            WHEN (a.unsorted_rows::FLOAT / a.rows::FLOAT) * 100 >= 20 THEN 'VACUUM recommended'
            ELSE 'n/a'
        END AS recommendation

I reviewed the code and it looks good. Here is the scenario I run the view provided and pick a table that has VACUUM RECOMMENDED then run

vacuum zeus.audit_data_validation; analyze zeus.audit_data_validation;

Run the view again expecting the message to go away but it is still there. Is there a delay of any nature?

ColumnEncodingUtility needs to surround table and column names in double quotes

Ran into two bugs when I ran the column encoding utility on our database:

  1. It gives syntax errors when it encounters a table with a space in it, for example "Table 200".

  2. It gives syntax errors if it encounters a column name that is a reserved keyword, e.g. "open".

I believe the fix to both of these issues is that it needs to surround table and column names in double quotes in all of the queries that it generates.

Handle IDENTITY columns

The insert-select part of the migration in ColumnEncodingUtility does not handle columns of type IDENTITY, since SELECT gets all the columns but INSERT refuses to insert any columns of the type IDENTITY.

ColumnEncodingUtility - do-execute flag does nothing.

Tried running the script a few different ways without having it apply the changes, but couldn't find a way.
Here is an example:

~/dev/amazon-redshift-utils/src/ColumnEncodingUtility$ python analyze-schema-compression.py --db dev --db-user usr --db-pwd XXX --db-host "XXX.us-west-2.redshift.amazonaws.com" --db-port 00000 --output-file $(pwd)/redshift_output --analyze-table events_backup --do-execute false
-- [10944] Connected to reporting.c0ccf5cctsju.us-west-2.redshift.amazonaws.com:5439:dev as reporting
-- [10944] Analyzing Table 'events_backup' for Columnar Encoding Optimisations with 1 Threads...
-- [10944] Analyzing 1 table(s) which contain allocated data blocks
-- [10944] Table events_backup contains 5 unoptimised columns
-- [10944] Analyzing Table 'events_backup'
-- [10944] Column Encoding will be modified for public.events_backup
-- [10944] No encoding modifications required for public.events_backup
begin;
lock table public.events_backup;
create table public.events_backup_$mig(
"date" timestamp    encode lzo 
,"type" varchar(64)   encode lzo 
[.....]
)
;
insert into public.events_backup_$mig  select * from public.events_backup;
analyze public.events_backup_$mig;
alter table public.events_backup rename to events_backup_20170208_yFQbbzwoxs_$old;
alter table public.events_backup_$mig rename to events_backup;
commit;
-- [10944] Performed modification of 1 tables
-- [10944] Processing Complete

According to documentation the script should only modify the contents when --do-execute is set to true
Am I missing something or is the script ignoring this rule?

ColumnEncodingUtility - new feature - diff of suggested/current encoding

I've been using this utility and it's been working great. I thought it would convenient if the utility could produce a diff of the suggested encodings and what is currently in-use for a table/schema.

Just thought I'd post this for feedback to see if others think this would be useful.

Thanks,
Patrick

Admin Views :- Create or Replace View failing

Minor issue master commit #54cc0fa77038e74303d1627a931d867bbe9a5bba

Following views are breaking: we have to explicitly drop them because Data Type is changing.

  • v_open_session.sql
  • v_extended_table_info.sql
  • v_space_used_per_tbl.sql

image

Vacuum "delete only" on table with interleaved key

Hello,
we have an issue with the script AnalyzeVacuumUtility/analyze-vacuum-schema.py

both phase
Extracting Candidate Tables for vacuum based on the alerts... (line 262 as 05/20)
and
Extracting Candidate Tables for vacuum ... (line 320 as 05/20)

require tables to not having interleaving sortkey for being valid for vacuum delete only

We have some huge tables with interleaved sort key, and we regulary delete rows on them, and the vacuum delete only is never launched by the script, so the tables got bigger every day.

Also the vacuum reindex is not launched either because the max intearleave skew is not very big

Do you think it would be good option to not check the table sort key when using parameter "delete only" on vacuum ?

Thank you

analyze-vacuum-schema.py not working

analyze-vacuum-schema.py is not detecting any tables to vaccum and analyze. Sample output for reference is provided here ... "Query Execution returned 0 Results"

while the queries in the above python script returns 0 values, if i run the following sql it fetches me atleast 10-15 tables ...

select * from SVV_TABLE_INFO where unsorted > 10

ImportError when running redshift-upload-copy.py utility

Hello,

I'm getting this trace when I attempt to run the utility:

[root@host UnloadCopyUtility]# python redshift-unload-copy.py config.json us-west-2
Traceback (most recent call last):
File "redshift-unload-copy.py", line 24, in
import pg
File "/usr/lib64/python2.6/site-packages/pg.py", line 31, in
from _pg import *
ImportError: /usr/lib64/python2.6/site-packages/_pg.so: undefined symbol: PQescapeIdentifier

Not sure what this entails, I verified my configuration's JSON looks correct. Anyone who's familar with this might be able to tell me what I'm missing?

rowcount in v_space_used_per_tbl.sql

What is the value field rowcount should be showing?

If the distribution style is EVEN or KEY, the value of rowcount field is number of rows in the table.

If the distribution style is ALL, the value of rowcount field is no of nodes * number of rows in the table. Is this correct?

DISTSTYLE not being set

DISTSTYLE isn't being set on tables. This is a problem for tables with DISTSTYLE ALL and EVEN

Support for PGPASSWORD and other environment variables

Folks at our organization typically use PGHOST, PGUSER, etc. when running command line tools to access Redshift or Postgresql. It would be nice to have analyze-schema-compression.py check for these so that the --db-host, etc. command line arguments don't need to be specified.

I'd be willing to make a PR for this. Do you consider PRs from the community?

Can We use the UnloadCopyUtlity to migrate data between two redshift clusters, between two different accounts.

We have an EC2 Instance that has vpc peering configured and it is connected to a different users's vpc. So, it is able to connect to the redshift cluster of the different user. Can we run this UnloadCopyUtility to migrate data in between different users's redshift cluster to our redshift cluster.

(Our redshift cluster is accessible over public ip, in future we might add one more n/w interface and assign that to our subnet and make sure that it can connect to our redshift cluster over private ip as well).

https://github.com/awslabs/amazon-redshift-utils/tree/master/src/UnloadCopyUtility

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.