Multipart uploads to s3 using aws-sdk v2 for ruby…

Nov 16, 2015

—

The Ruby guys over at AWS have done a great job at explaining file uploads to S3 but they left out how to perform multipart uploads citing reservation over “advanced use cases“.

Prerequisites:

identify an S3 bucket to upload a file to — use an existing bucket or create a new one
create or identify a user with an access key and secret access key
install version 2 of the aws-sdk gem
this example uses ruby 2.2.0, other versions may not be supported or work the same way

A multipart upload consists of four steps: setup a client connection to S3, a call to create_multipart_upload, one or more calls to upload_part, and finally, and if all works out, a call to complete_multipart_upload. Otherwise, the final step would be an abort_multipart_upload.

Before starting the upload, I will first establish a constant to define the file size before an upload will be executed using multipart or whether I will simply call upload_file.

# 100MB
PART_SIZE=1024*1024*100

Next, over-ride the File class and setup a method to return parts of the file:

class File
  def each_part(part_size=PART_SIZE)
    yield read(part_size) until eof?
  end
end

The rest of this script assumes that between 4 and 5 parameters were passed into the script from the command line. The reason we pass all of these in is so that we can keep credentials out of source control and so we can use this script with any bucket and any object. The optional ‘prefix’ parameter would be text that would be prepended to the key and facilitate organizing key objects into a directory structure.

(access,secret,bucket,localfile,prefix) = ARGV

Next, we need to establish an authenticated client connection to S3:

s3 = Aws::S3::Client.new(
  region: 'us-east-1',
  credentials: Aws::Credentials.new(access,secret),
)

This will establish a connection to the us-east-1 region using the access key and secret access key variables listed.

Next, I will setup the key with any path information removed and with the optional prefix prepended:

  filebasename = File.basename(localfile)

  if prefix
    key = prefix + "/" + filebasename
  else
    key = filebasename
  end

The next step is to open the File and determine whether or not it’s large enough to warrant a multipart upload:

  File.open(localfile, 'rb') do |file|
    if file.size > PART_SIZE
      puts "File size over #{PART_SIZE} bytes, using multipart upload..."

Next, we use create_multipart_upload to start the upload and return an upload_id needed to manage the rest of the process. Replace bucket and key with your s3 bucket and the intended name of the file on S3 (aka key).

      input_opts = {
        bucket: bucket,
        key:    key,
      }  

      mpu_create_response = s3.create_multipart_upload(input_opts)

If all worked well there, we can then upload the parts. I like to give some level of progress as the parts are uploaded so let’s add some code to do that as well:

      total_parts = file.size.to_f / PART_SIZE
      current_part = 1

Next, we can iterate through the file parts as we upload them:

      file.each_part do |part|

        part_response = s3.upload_part({
          body:        part,
          bucket:      bucket,
          key:         key,
          part_number: current_part,
          upload_id:   mpu_create_response.upload_id,
        })  

        percent_complete = (current_part.to_f / total_parts.to_f) * 100 
        percent_complete = 100 if percent_complete > 100 
        percent_complete = sprintf('%.2f', percent_complete.to_f)
        puts "percent complete: #{percent_complete}"
        current_part = current_part + 1 

      end

This part also computed the complete percentage which prints out a progress line after each part is uploaded. That part is optional but I’ve found my clients really appreciate it.

Once all of the parts are uploaded, we can finish the multipart upload with a call to complete_multipart_upload:

      input_opts = input_opts.merge({
          :upload_id   => mpu_create_response.upload_id,
      })   

      parts_resp = s3.list_parts(input_opts)

      input_opts = input_opts.merge(
          :multipart_upload => {
            :parts =>
              parts_resp.parts.map do |part|
              { :part_number => part.part_number,
                :etag        => part.etag }
              end 
          }   
        )   

      mpu_complete_response = s3.complete_multipart_upload(input_opts)

Note that we called list_parts to get a some required information about each part. This would allow us to make a multi-threaded upload client but that would require a different approach relative to what I have done here and we will leave that for advanced use cases (ha!).

Finally, be sure to close the if statement and provide the put_object method for files smallert than PART_SIZE:

    else
      s3.put_object(bucket: bucket, key: key, body: file)
    end 
  end

The script can then be called with the following format:

./upload_file.rb $access_key $secret_key $bucket $upload_file $optional_prefix

Now that completes the multipart upload. A critical piece that was not covered here is that if you do not complete an upload, it is consuming space that you are paying for and you either need to abort it or complete it to free up this space. I will share a utility that I have to clean these up in a future article.

Comments

7 responses to “Multipart uploads to s3 using aws-sdk v2 for ruby…”

Phil

June 7, 2017

Thanks! Can you share the full code (on github etc)?
Josh

June 7, 2017

Hey Phil, try this out and let me know what you think:

https://github.com/itsecureadmin/cloud-utilities/blob/master/upload_file.rb
Phil

June 7, 2017

Wow, I never expected you can reply that fast for a 2015 post! Thanks so much!! 🙂
Minh

November 9, 2017

Thanks Josh,

You just save my life
Josh

November 9, 2017

Glad I could help, Minh.
vabs

June 22, 2018

can we have something for a folder sync
Casey

July 13, 2018

Just turned this into a multi-threaded script for syncing S3 buckets using concurrent-ruby. The threading was easy, but the uploading part would have been tricky to piece together without this article. The thing just flies, and I’m glad I took the risk to not just use copy_object. Super awesome!

Multipart uploads to s3 using aws-sdk v2 for ruby…

Comments

7 responses to “Multipart uploads to s3 using aws-sdk v2 for ruby…”

Leave a Reply