Giter VIP home page Giter VIP logo

invenio-s3's Introduction

Invenio-S3

S3 file storage support for Invenio.

The package offers integration with any S3 REST API compatible object storage.

Further documentation is available on https://invenio-s3.readthedocs.io/

invenio-s3's People

Contributors

borellim avatar egabancho avatar lnielsen avatar mvidalgarcia avatar utnapischtim avatar wgresshoff avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

invenio-s3's Issues

File upload fails for big files

When uploading big files the number of total parts gets bigger than the maximum (1000) and makes the upload fail. This happens probably because s3fs doesn't use the file size to calculate the chunk size and it rather uses the default value (5Mb) if no other value is provided.

Maybe we can assume that the file size is passed as a parameter (somehow making it mandatory) and when instantiating the s3fs object pass the chunk size, default_block_size, which should be the max of 5Mb and file_size/1000 (5Mb is the smallest we can go). This works on my head, but I haven't tried it yet, and there might be some issues ☺️

Other alternatives are welcome.

Number of parts are not correctly calculated

Package version (if known): master

Describe the bug

When calculating the number of parts we use the default integer rounding

size // current_app.config['S3_MAXIMUM_NUMBER_OF_PARTS']

https://github.com/inveniosoftware/invenio-s3/blob/master/invenio_s3/storage.py#L26

Which can (and will) result in uploading a bigger number of parts than the maximum allowed number (max+1) when the floating part is smaller than .5 (3.1 will result in 3 rather than 4(

Expected behavior

The number of parts should never exceed S3_MAXIMUM_NUMBER_OF_PARTS

Make the region name configurable too

If the region name is not configured, it will automatically mapped to 'us-east-1'. Might be a problem for AWS users outside the US or users of other S3 implementations (Ceph S3 seems to work fine with 'us-east-1)

Upload speed to S3

Hello.
Is there anything that we can do to increase the upload speed to an S3 service via invenio-s3?

I compared the upload speed obtained in our app versus a direct upload to S3 with boto3 (from the same machine that serves our app), and I am getting different results. For a 1 GB file, when uploading through our app we see first 150-200 Mbps data transfer from the browser for about 1 minute, with gunicorn sitting at 99% CPU; then for about 2 minutes we see no upload from the browser, while gunicorn sits at 10-15% CPU, until the browser finally receives a 200 response (total 3 minutes). With a direct upload to S3 via boto3, instead, it takes about 13 seconds in total.

To simplify testing, I'm using a simple Flask view, in which I have the following lines that do the job:

        f = request.files['file']
        s3fs = S3FSFileStorage('s3://test_s3/test-file-2')
        s3fs.initialize(size=0, acl='private')
        s3fs.update(f.stream, acl='private')

In the real app, we actually create a record with invenio_deposit.api.Deposit.create(), then attach the file to the record, but we see the same speed as in this simple test.

Our setup is: Apache2 acting as front line server, with a reverse proxy to gunicorn on the same machine. Setting or not DEBUG=True in config.py does not seem to make a difference for this.

We are actually using our own fork of invenio-s3, with some changes that we needed to make it work (I opened PR #8 in case you find them useful), but I don't think they are relevant to issue.

I also found some code to profile requests to gunicorn: I'll paste below the result, but I'm not quite sure how to interpret it.

Thanks a lot in advance for the help!

[POST] URI /s3/upload
         130142856 function calls (130133607 primitive calls) in 226.399 seconds

   Ordered by: internal time, cumulative time
   List reduced from 1503 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      413   96.123    0.233   96.123    0.233 {method 'poll' of 'select.poll' objects}
   269586   15.689    0.000   15.689    0.000 {method 'read' of '_ssl._SSLSocket' objects}
  8372516   14.946    0.000   42.725    0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/werkzeug/wsgi.py:733(_iter_basic_lines)
 16745019   13.118    0.000   28.627    0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/tempfile.py:903(write)
        2   12.813    6.406   99.145   49.573 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/werkzeug/formparser.py:531(parse_parts)
 16737045   12.508    0.000   12.508    0.000 {method 'write' of '_io.BufferedRandom' objects}
 16745022   11.310    0.000   57.705    0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/werkzeug/formparser.py:427(parse_lines)
     2078    7.606    0.004    7.606    0.004 {method 'update' of '_hashlib.HASH' objects}
  1048577    5.407    0.000   19.415    0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/gunicorn/http/body.py:112(read)
  8372516    3.666    0.000   46.394    0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/werkzeug/wsgi.py:687(make_line_iter)
   131285    3.255    0.000    3.255    0.000 {method 'write' of '_ssl._SSLSocket' objects}
      206    3.100    0.015    3.100    0.015 {method 'read' of '_io.BufferedRandom' objects}
 16745019    2.995    0.000    2.997    0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/tempfile.py:792(_check)
  3434517    2.814    0.000    2.814    0.000 {method 'write' of '_io.BytesIO' objects}
  1312794    2.354    0.000   10.290    0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/gunicorn/http/unreader.py:21(read)
   132917    2.079    0.000    2.079    0.000 {method 'read' of '_io.BytesIO' objects}
    16386    1.822    0.000   22.031    0.001 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/gunicorn/http/body.py:199(read)
    16385    1.733    0.000    1.733    0.000 {method 'splitlines' of 'bytes' objects}
  8425799    1.617    0.000    1.617    0.000 {method 'append' of 'list' objects}
  8601967    1.136    0.000    1.136    0.000 {built-in method builtins.len}
  8391318    1.126    0.000    1.185    0.000 {method 'join' of 'bytes' objects}
  1048577    1.022    0.000    1.832    0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/gunicorn/http/unreader.py:53(unread)
  2362416    0.629    0.000    0.629    0.000 {method 'seek' of '_io.BytesIO' objects}
   131285    0.438    0.000    4.185    0.000 /usr/lib/python3.5/ssl.py:881(sendall)
  3716346    0.427    0.000    0.427    0.000 {method 'tell' of '_io.BytesIO' objects}
  1065394    0.409    0.000    0.409    0.000 {built-in method builtins.min}
  2113508    0.404    0.000    0.404    0.000 {method 'getvalue' of '_io.BytesIO' objects}
   269586    0.379    0.000   16.350    0.000 /usr/lib/python3.5/ssl.py:783(read)
   264250    0.362    0.000    6.925    0.000 /usr/lib/python3.5/ssl.py:907(recv)
       11    0.332    0.030    0.332    0.030 /usr/lib/python3.5/json/decoder.py:345(raw_decode)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.