hughsk / s3-sync Goto Github PK

A streaming interface for uploading multiple files to S3.

License: Other

JavaScript 100.00%

s3-sync's Introduction

s3-sync

A streaming upload tool for Amazon S3, taking input from a readdirp stream, and outputting the resulting files.

s3-sync is also optionally backed by a level database to use as a local cache for file uploads. This way, you can minimize the frequency you have to hit S3 and speed up the whole process considerably.

You can use this to sync complete directory trees with S3 when deploying static websites. It's a work in progress, so expect occasional API changes and additional features.

Installation

npm install s3-sync

Usage

`require('s3-sync').createStream([db, ]options)`

Creates an upload stream. Passes its options to knox, so at a minimum you'll need:

key: Your AWS access key.
secret: Your AWS secret.
bucket: The bucket to upload to.

The following are also specific to s3-sync:

concurrency: The maximum amount of files to upload concurrently.
retries: The maximum number of times to retry uploading a file before failing. By default the value is 7.
headers: Additional headers to include on each file.
hashKey: By default, file hashes are stored based on the file's absolute path. This doesn't work very nicely with temporary files, so you can pass this function in to map the file object to a string key for the hash.
acl: Use a custom ACL header. Defaults to public-read.
force: Force s3-sync to overwrite any existing files.

You can also store your local cache in S3, provided you pass the following options, and use getCache and putCache (see below) before/after uploading:

cacheDest: the path to upload your cache backup to in S3.
cacheSrc: the local, temporary, text file to stream to before uploading to S3.

If you want more control over the files and their locations that you're uploading, you can write file objects directly to the stream, e.g.:

var stream = s3sync({
    key: process.env.AWS_ACCESS_KEY
  , secret: process.env.AWS_SECRET_KEY
  , bucket: 'sync-testing'
})

stream.write({
    src: __filename
  , dest: '/uploader.js'
})

stream.end({
    src: __dirname + '/README.md'
  , dest: '/README.md'
})

Where src is the absolute local file path, and dest is the location to upload the file to on the S3 bucket.

db is an optional argument - pass it a level database and it'll keep a local cache of file hashes, keeping S3 requests to a minimum.

`stream.putCache(callback)`

Uploads your level cache, if available, to the S3 bucket. This means that your cache only needs to be populated once.

`stream.getCache(callback)`

Streams a previously uploaded cache from S3 to your local level database.

`stream.on('fail', callback)`

Emitted when a file has failed to upload. This will be called each time the file is attempted to be uploaded.

Example

Here's an example using level and readdirp to upload a local directory to an S3 bucket:

var level = require('level')
  , s3sync = require('s3-sync')
  , readdirp = require('readdirp')

// To cache the S3 HEAD results and speed up the
// upload process. Usage is optional.
var db = level(__dirname + '/cache')

var files = readdirp({
    root: __dirname
  , directoryFilter: ['!.git', '!cache']
})

// Takes the same options arguments as `knox`,
// plus some additional options listed above
var uploader = s3sync(db, {
    key: process.env.AWS_ACCESS_KEY
  , secret: process.env.AWS_SECRET_KEY
  , bucket: 'sync-testing'
  , concurrency: 16
  , prefix : 'mysubfolder/' //optional prefix to files on S3
}).on('data', function(file) {
  console.log(file.fullPath + ' -> ' + file.url)
})

files.pipe(uploader)

You can find another example which includes remote cache storage at example.js.

s3-sync's People

Contributors

Stargazers

Watchers

Forkers

christophercliff web5design sampsasaarela johnnyhalife hguillermo bheller84 dtmcmath gcorreaalves bixlabs johnnyshields andreialecu lyonzy heldinz jtblin qpre zectbynmo gluemonkey linuxl0s3r

s3-sync's Issues

Headers should be on a file basis not per options

I would think that it is more practical to be able to specify which set of files gets the headers not just all files that are synced. Am I missing something else here? All I see is the option to add additional headers to all files that are specified to be synced. Would be nice if it could be:

files: [
    {
        root: __dirname,
        src: 'here/some.js',
        dest: 'there/',
        gzip: true,
        compressionLevel: 9,
        headers: {
            'Content-Encoding': 'gzip'
        }
    }
]

peerDependencies dont allow me install the package

$ npm i --save level s3-sync readdirp
npm WARN peerDependencies The peer dependency [email protected] included from s3-sync will no
npm WARN peerDependencies longer be automatically installed to fulfill the peerDependency 
npm WARN peerDependencies in npm 3+. Your application will need to depend on it explicitly.
npm WARN deprecated [email protected]: Please update to the latest object-keys
|
> [email protected] install /home/rkmax/Development/BIX/Velo/node_modules/level/node_modules/leveldown
> prebuild --download

npm ERR! Linux 4.1.6-1-ARCH
npm ERR! argv "/home/rkmax/.nvm/versions/node/v0.12.7/bin/node" "/home/rkmax/.nvm/versions/node/v0.12.7/bin/npm" "i" "--save" "level" "s3-sync" "readdirp"
npm ERR! node v0.12.7
npm ERR! npm  v2.11.3
npm ERR! code EPEERINVALID

npm ERR! peerinvalid The package level does not satisfy its siblings' peerDependencies requirements!
npm ERR! peerinvalid Peer [email protected] wants [email protected]

npm ERR! Please include the following file with any support request:
npm ERR!     /home/rkmax/myproject/npm-debug.log

Cannot be used with frankfurt aws region because of v4 signature

Knox does not support v4 signatures, which is needed for new aws regions.

See: Automattic/knox#254

There has been an open PR for 8 months for knox to fix it: Automattic/knox#273 but the maintainers seem to disagree with it.

As a suggestion, I would recommend using aws-sdk directly, which can handle v4 signatures, is better maintained and was made a lot easier to use recently.

allow custom destination

the s3 endpoint is private cloud service，needs custom destination

Make the cache remote

Instead of maintaining your own local cache db, have the sync module:

Check for the cache db on S3. If exists, download and start it up
Check the delta and do the sync
Push the db back to S3 and remove the local copy

Does this make sense? Doing a little research today and found some other folks using this approach. It's nice because the module handles the caching for you and you don't have to deal with any local files.

doesn't remove files from bucket if they've been deleted locally

removing files from the bucket when they've been removed locally is pretty important for a syncing tool.

listen `finish` event of LevelWriteStream(db)

.once('close') will not emit because thepipe(es.pipeline) don't reemit the close event

Option to define my own filename/path for cache invalidation purposes

We use a system where we generate temporary files (gzip) that we want to upload, but for the purposes of cache invalidation we want them to be counted as the files we're gzipping, if we weren't gzipping them before and then choose to now.

Some way to define the file name and/or path that we think we're uploading, separate to the file name and/or path that we know we're uploading would be a pretty awesome feature to have!

allows custom url

var destination =
          protocol + '://'
        + subdomain
        + '.amazonaws.com/'
        + options.bucket
        + '/' + relative

I am using a private s3 service, allow custom url is helpful

Files are not recognized as changed if they are modified with something other than s3-sync

If I mix tools like s3cmd and s3-sync, I can end up PUTting a file without including the "x-amz-meta-syncfilehash" header. Then subsequent calls to s3-sync won't recognize the file as changed and just glides silently past it.

I'm sorry for being a bad citizen and not providing a simple test case. I'm actually using this through grunt-s3-sync, so it's a little awkward. I do have a proposed fix, though, that works for me. It's over at my fork so I guess I'll make a pull request.

Unsure which file has the issue when I get Error: Bad status code 400

>> [uploaded] https://s3.amazonaws.com/afr-prod/projects/politics/img/maps/Perth.png
Error: Bad status code: 400
>> [uploaded] https://s3.amazonaws.com/afr-prod/projects/politics/img/maps/Sydney.png

Would be nice to know what file got the 400, and if it's retrying or silently failing.

Changing buckets doesn't invalidate the cache

Had been uploading to bucket 'A'.
Uploaded to bucket 'B'.

Treated unchanged since last upload to 'A' as cached.

hash comparisons with s3 headers

Hey mate,

Just been updating some apps to use this module (thanks!) and ran in to an issue where the files were always being re-uploaded even if they already exist. Looks like it's because the S3 ETAG header never matches the generated md5:

if (res.statusCode === 404 || res.headers.etag !== '"' + details.md5 + '"') return uploadFile(details, next)

Due to hashFile including the headers and destination in the hash. Commenting out these lines fixes it:

    hash.update(JSON.stringify([
        options.headers
      , destination
    ]))

It might be better to just hash the file contents at first to compare the ETAG and then update it again with the metadata before storing in leveldb. What do you think?

If I get some time this week I'll take a look and send a PR :)

files are uploaded as public-read-write

see: https://github.com/hughsk/s3-sync/blob/master/index.js#L114
and the corresponding docs: http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#CannedACL

The docs say this about public-read-write:

Owner gets FULL_CONTROL. The AllUsers group gets READ and WRITE access. Granting this on a bucket is generally not recommended.

And the AllUsers group is defined as this:

Access permission to this group allows anyone to access the resource. The requests can be signed (authenticated) or unsigned (anonymous). Unsigned requests omit the Authentication header in the request.

So if I'm reading this correctly, anyone on the internet can overwrite files uploaded by this tool. That sounds like a pretty serious security issue.

path character is set incorrectly on windows

due to windows path nature of using back-slashes in paths, file and directory structures are not being uploaded properly with the s3 protocol.

hughsk / s3-sync Goto Github PK

s3-sync's Introduction

s3-sync

Installation

Usage

require('s3-sync').createStream([db, ]options)

stream.putCache(callback)

stream.getCache(callback)

stream.on('fail', callback)