Explaining Multipart Upload in AWS S3
The aim of this page📝 is to explain the concept of Multipart Upload in Amazon S3 and its usage in AWS EMR based on the particular example of analyzing S3 bucket usage statistics.
2 min readDec 12, 2023
- Multipart Upload is a feature in Amazon S3 that allows you to upload a single object as a set of parts.
- Each part is a contiguous portion of the object’s data.
- This feature is particularly useful when dealing with large objects or in situations where network connectivity is unstable.
- You start by initiating the multipart upload. Amazon S3 returns a response with an upload ID, which is a unique identifier for your multipart upload.
- You can upload these object parts independently and in any order.
- If transmission of any part fails, you can retransmit that part without affecting other parts.
- After you have uploaded all the parts, you complete the multipart upload.
- Upon receiving the complete multipart upload request, Amazon S3 constructs the object from the uploaded parts, and you can then access the object just as you would any other object in your bucket.
- This process provides several advantages such as improved throughput, quick recovery from network issues, the ability to pause and resume object uploads, and the ability to begin an upload before you know the final object size.
- It’s recommended to use multipart uploads for objects that reach a size of 100 MB or more.
- When using the EMRFS (EMR File System) S3-optimized committer, multipart uploads are always performed regardless of the file size.
- This differs from the default behavior of EMRFS, where the
fs.s3n.multipart.uploads.split.size
property controls the file size at which multipart uploads are triggered. - Multipart uploads must be enabled in Amazon EMR, and they are enabled by default.
- You can re-enable it if required.
- While the AMI can influence the software and libraries available on your EMR cluster, the configuration of specific services like EMRFS and its interaction with S3 is controlled through EMR settings and job parameters.
LINKS
- https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
- https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html
- https://docs.aws.amazon.com/cli/latest/reference/s3api/create-multipart-upload.html
- https://docs.aws.amazon.com/cli/latest/reference/s3api/upload-part.html
- https://docs.aws.amazon.com/cli/latest/reference/s3api/complete-multipart-upload.html