Giter VIP home page Giter VIP logo

amazon-redshift-udfs's Introduction

Amazon Redshift UDFs

A collection of stored procedures and user-defined functions (UDFs) for Amazon Redshift. The intent of this collection is to provide examples for defining useful functions which extend Amazon Redshift capabilities and support migrations from legacy DB platforms.

Stored Procedures

Each procedure is allocated a folder. At minimal each procedure will have a <procedure_name>.sql file which you may use to deploy the procedure. Optionally, it may contain a README.md file for instructions on how to use the procedure and additional files used to test the procedure.

UDFs

Each function is allocated a folder. At minimal each function will have the following files which will be used by the deployFunction.sh script and testFunction.sh scripts:

  • function.sql - the SQL script to be run in the Redshift DB which creates the UDF. If a Lambda function, use the string :RedshiftRole for the IAM role to be passed in by the deployment script.
  • input.csv - a list of sample input parameters to the function, delimited by comma (,) and where strings are denoted with single-quotes. Note: input.csv file MUST have a trailing newline character.
  • output.csv - a list of expected output values from the function. Note: Output.csv file can be excluded when the UDF will not return a consistent output between deployments or runs, but an output.csv.sample file should be created for users to understand what to expect. See f_kms_decrypt & f_kms_encrypt for an example.

python-udfs

Python UDFs may include the following additional file:

  • requirements.txt - If your function requires modules not available already in Redshift, a list of modules. The modules will be packaged, uploaded to S3, and mapped to a library in Redshift. Note: requirements.txt file MUST have a trailing newline.

lambda-udfs

Lambda UDFs may include the following additional file:

  • lambda.yaml - [REQUIRED] a CFN template containing the Lambda function. The lambda function name should match the redshift function name with '_' replaced with '-' e.g. (f-upper-python-varchar). The template may contain additional AWS services required by the lambda function and should contain an IAM Role which can be assumed by the lambda service and which grants access to those additional services (if applicable). These samples will use "*" for IAM resource policies. In a production deployment, modify IAM Role policies to scope down access.

  • resources.yaml - a CFN template containing external resources which may be referenced by the Lambda function. These resources are for testing only.

  • requirements.txt - (Python Only) If your function requires modules not available already in the Lambda Python container. These modules will be packaged and uploaded to S3. You will need to include a AWS::Lambda::LayerVersion resource for each module in your lambda.yaml file. It will need to reference the module and version number. See f_glue_schema_registry_avro_to_json as an example. Note: requirements.txt file MUST have a trailing newline.

  • package.json - (NodeJS Only) If your function requires modules not available already in Lambda, a list of modules. The modules will be packaged, uploaded to S3, and mapped to your Lambda function. See f_mysql_lookup_nodejs for and example.

  • index.js (NodeJS Only) your javascript handler code. See f_upper_nodejs for and example.

  • pom.xml - (Java Only) Lists out dependencies as well as the name of your handler function. See f_upper_java for and example. See Maven Documentation for more details on writing a pom.xml file.

  • src/main/java/function/Handler.java - (Java Only) your java handler code. See f_upper_java for and example.

sql-udfs

SQL UDFs do not require any additional files.

Networking Considerations

Lambda functions can be deployed in a VPC if they need access to resources within the private network. In this case, you may need to add network connectivity (e.g. VPC Endpoints) to allow communication to AWS services. If your Lambda function is deployed in a VPC and uses AWS services (e.g. S3, DDB, Glue, etc.) in its code, please define the VPC Endpoint resources in your resources.yaml file.

Deployment & Testing

Located in the bin directory are tools to deploy and test your UDF functions.

deployFunction.sh

This script will orchestrate the deployment of the UDF to your AWS environment. This includes

  1. Looping through modules in a requirements.txt file (if present)
    • For Python UDFs, installs dependencies using the libraryInstall.sh script by uploading the packages to the $S3_LOC and creating the library in Redshift using the $REDSHIFT_ROLE.
    • For Lambda UDFs, installs dependencies as Lambda layers which are referenced in the Lambda CloudFormation using the S3Bucket and S3Key parameters (-s and -k, respectively).
  2. If deploying a nodeJS lambda UDF, using package.json to run npm install packaging the code and uploading the zip file to the $S3_LOC.
  3. If deploying a Java lambda UDF, using pom.xml to run mvn package packaging the code and uploading the jar to the $S3_LOC.
  4. If deploying a lambda UDF, using lambda.yaml to run aws cloudformation deploy and build the needed resources.
  5. Creating the UDF function in Redshift by executing the function.sql sql script. If deploying a lambda UDF, replacing the :RedshiftRole parameter.
./deployFunction.sh -t lambda-udfs -f "f_upper_python(varchar)" -c $CLUSTER -d $DB -u $USER -n $SCHEMA -r $REDSHIFT_ROLE

./deployFunction.sh -t python-udfs -f "f_ua_parser_family(varchar)" -c $CLUSTER -d $DB -u $USER -n $SCHEMA -r $REDSHIFT_ROLE -s $S3_LOC

./deployFunction.sh -t sql-udfs -f "f_mask_varchar(varchar,varchar,varchar)" -c $CLUSTER -d $DB -u $USER -n $SCHEMA

testFunction.sh

This script will test the UDF by

  1. Creating a temporary table containing the input parameters of the function.
  2. Loading sample input data of the function using the input.csv file.
  3. Running the function leveraging the sample data and comparing the output to the output.csv file.
./testFunction.sh -t lambda-udfs -f "f_upper_python(varchar)" -c $CLUSTER -d $DB -u $USER -n $SCHEMA

./testFunction.sh -t python-udfs -f "f_ua_parser_family(varchar)" -c $CLUSTER -d $DB -u $USER -n $SCHEMA

./testFunction.sh -t sql-udfs -f "f_mask_varchar(varchar,varchar,varchar)" -c $CLUSTER -d $DB -u $USER -n $SCHEMA

Contributing / Pull Requests

We would love your contributions. See the contributing page for more details on creating a fork of the project and a pull request of your contribution.

Pull requests will be tested using a Github workflow which leverages the above testing scripts. Please execute these script prior to submitting a pull request to ensure the request is approved quickly. When executed in the test enviornment the RedshiftRole will be defined as follows. You can create a similar role in your local environment for testing.

##Appendix

Redshift Role

For Lambda UDFs, These privileges ensure UDFs can invoke the Lambda Function as well as access the uploaded *.whl files located in s3.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "lambda:InvokeFunction"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:f-*",
                "arn:aws:s3:::<test bucket>/*"
            ]
        }
    ]
}

amazon-redshift-udfs's People

Contributors

bashsz avatar bgmello avatar dependabot[bot] avatar eeesk avatar ericfe avatar hyandell avatar ianmeyers avatar inohiro avatar joeharris76 avatar mmehrten avatar rajiv914 avatar rjvgupta avatar runningjon avatar saeed2402 avatar sdia avatar zach-data avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-redshift-udfs's Issues

Redshift can't open shared object file

I'm trying to install the h3 library in Redshift using Pip and the utility install script installPipModuleAsRedshiftLibrary.sh.

I am building h3 on Ubuntu 16.04. I am able to successfully upload the h3.zip file to my S3 bucket.

However, when I try to run the UDF, I get this error

OSError: /rdsdbdata/user_lib/0/0/288461.zip/h3/out/libh3.so.1: cannot open shared object file: Not a directory

Full stack trace from svl_udf_log:

File "h3.py", line 39, in <module>
File "__init__.py", line 443, in LoadLibrary
    return self._dlltype(name)
File "__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)```

The UDF I'm trying to use:

```CREATE OR REPLACE function get_hexbin(coord_lat double precision, coord_lon double precision, res INTEGER)
  RETURNS VARCHAR IMMUTABLE
AS $$
  from h3 import h3
  return 'hello world'
$$ LANGUAGE plpythonu; 

While installing h3 locally, I see this message about skipping wheel but then it uploads a zip file to S3 anyway:

Collecting h3
Saved /home/bhavika/Desktop/.h3/h3-3.1.0-cp36-cp36m-linux_x86_64.whl
Skipping h3, due to already being wheel.

import fails for murmur2 package

Hello,

I am trying to import murmur2 package as a library in Redshift database. I did following steps

  1. Run the module packer
    $ ./installPipModuleAsRedshiftLibrary.sh -m murmur2 -s s3://path/to/murmur2/lib

  2. Create library on redshift

CREATE OR REPLACE LIBRARY murmur2
LANGUAGE plpythonu
from 's3://path/to/murmur2/lib/murmur2.zip'
WITH CREDENTIALS AS 'aws_access_key_id=AAAAAAAAAAAAAAAAAAAA;aws_secret_access_key=SSSSSSSSSSSSSSSSS'
region 'us-east-1';

  1. Create function and query
create OR REPLACE function f_py_kafka_partitioner (s varchar, ps int)
  returns int
stable
as $$
  import murmur2
  m2 = murmur2.murmur64a(s, len(s), 0x9747b28c)
  return m2 % ps

$$ language plpythonu;

SELECT f_py_kafka_partitioner('jiimit', 100);

This gives following error :

[Amazon](500310) Invalid operation: ImportError: No module named murmur2. Please look at svl_udf_log for more information
Details: 
 -----------------------------------------------
  error:  ImportError: No module named murmur2. Please look at svl_udf_log for more information
  code:      10000
  context:   UDF
  query:     0
  location:  udf_client.cpp:366
  process:   padbmaster [pid=31381]
  -----------------------------------------------;

ua-parser not working: No module named _regexes. Please look at svl_udf_log for more information

I have created ua-parser.zip, following instructions that have been provided here.

When I extract to check, I see that it has "regexes.py" as well as "user_agent_parser.py"

After creating library in redshift, and trying to do simple user agent parsing, it fails with the following error:
"ImportError: No module named _regexes. Please look at svl_udf_log for more information".

The same code from the function I have tested in terminal using Python and it works.

So there must be some problem how Redshift handles this,

Script fails to find wheel files or zip results.

When the script is executed on an EC2 install it fails to define "wheelFile" and execute the find operation.

wheelFile=find . -name *.whl

enclosing the wildcard in quotes resolves the issue in this environment:

wheelFile=find . -name "*.whl"

Additionally, the script does not zip the files before transporting them with an S3 copy command.

I am proposing a pull request to resolve these.

Broken link to index.js for lambda-udfs/f_upper_nodejs

Hello,

Seems like you have a broken link to nodejs upper functionality. I was trying to figure out how to write a udf in GO and trying to find reference in NodeJS as I am more familiar on it. There seems to be couple of things missing/incomplete

  1. The comments are referring to Python in this directory. See line 3 on file function.sql

This sample function demonstrates how to create/use lambda UDFs in python

  1. index.js file is missing.
  2. I believe the docs was referring that package.json is needed but I don't see it

So can someone add the index.js on there? I would be happy to send a PR too for this

Questions from AWS customer

Hello Zach,

I have a customer that tried to use your script to install a Pip module and has some questions around the github script [https://github.com/aws-samples/amazon-redshift-udfs/blob/master/bin/PipLibraryInstaller/installPipModuleAsRedshiftLibrary.sh] developed by you.

Customer questions:

Q] I do have a lingering question, however, around why the installPipModuleAsRedshiftLibrary.sh script didn't work? In the future, it might be nice to use this, but it didn't seem like Redshift liked me uploading the wheel files it generated... Can you let me know if the Redshift engineers have a fix or workaround for this? Or let me know if I've used the script incorrectly?

Let me know.

thanks,
Varsha

RedShift function to import S3 file

Can AWS RedShift resource be used to access S3 objects?

I'd like to write a function that will import any given drop file from s3 directly into the cluster the UDF is running on.

Something like

CREATE FUNCTION import_s3_object(full_path_to_s3 VARCHAR)
RETURNS varchar
IMMUTABLE AS $$
//COPY some_table FROM full_path_to_s3 
$$ LANGUAGE plpythonu
;

In doing so, I also would like the RedShift instance to not have to pass the credentials, but instead identify the role to be used that has the attached policies to access the file from RedShift.

To be explicit, I'd like my function to all together not have to do

COPY public.some_table FROM 's3://some_bucket/drop_file.gz'
CREDENTIALS 'aws_access_key_id=...;aws_secret_access_key=...;token=...'     delimiter '\t' gzip NULL AS '\000' TRUNCATECOLUMNS;"

I'd like the function run the query as such:

COPY public.some_table FROM 's3://some_bucket/drop_file.gz'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
AS '\000' TRUNCATECOLUMNS;"

So, having verbalized my question, my requirement is a function to

CREATE FUNCTION import_s3_object(S3_FULL_PATH VARCHAR)
RETURNS varchar
IMMUTABLE AS $$
//EXCUTE: COPY public.some_table FROM 'S3_FULL_PATH'
//iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
//AS '\000' TRUNCATECOLUMNS;"
$$ LANGUAGE plpythonu
;

Unify this repo with Redshift Utils

Hi,

I've previously been working out some extensions to Redshift Utils for UDF's, and it would be great to unify these two repositories. The first item is an enciphering library that would complement your encryption UDF's quite nicely, and the second is a utility which will install any Pip module into a cluster as a Library. Is that something you'd be interested in getting setup?

Thx,

Ian

Request: example multi-file UDF

This and other sample sets have been really nice for understanding a one-file UDF. However, some projects are more complex. Please add a simple example of a multi-file UDF.

This request is inspired by a situation where it would be great to use the UAP User-Agent Parser project inside a Redshift UDF. This UDF would require code, a configuration file, and the UAP Python library. Walking through all of the options to make this work is confusing, as I'm not a Python run-time ninja.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.