How to Streamline Your Data Archival Process using the Cloud
Data archiving is the process of moving data that is no longer essential to a separate data store for long-term retention. Archived data consists of older data that might serve some importance to the organization, possibly for future reference or regulatory and compliance work.
Why do organizations need the cloud for streamlined data archiving process? Storing archived data in the cloud was found to be cheaper as compared to doing so on-premises where maintenance was costly and troublesome. Cloud archival eliminated the need to buy and upgrade physical disks/tape hardware systems, and also the need to purchase installed software to manage and store non-primary data. Furthermore, it is also easy to create policies in the cloud to streamline data archival process.
All organizations have these three types of data: transactional (describes business operations), performance, and master data. With cloud-based archival, the time taken to search for important but old data was reduced. Search data could be saved easily for predicting which exact search result the user wanted. Also, with data archived properly, a case-by-case review could be done quickly for legal and compliance procedures.
With benefits like storing data for the long-term and applying analytics to them, a cloud archival solution is very helpful for organizational productivity. Some archival strategies will be discussed with regards to Amazon S3, focusing on lowering storage costs and streamlining the workflow, ie. how to retrieve data as quickly as possible without incurring gigantic bills. The exact steps on the graphical user interface of AWS will be shown afterwards.
There are three major costs associated with storing data in S3 (as of 2019, Amazon Web Services for Asia Pacific Singapore region): storage, API and data transfer. S3 Standard-Infrequent Access Storage ($0.02 per GB) is cheaper than S3 Standard Storage ($0.025 per GB). For the APIs operating on the files, 10,000 read requests cost around $0.005, while write requests are $0.05 for 10,000 requests. Transferring data from the Internet into S3 Buckets is free, whereas the other way round is subject to pricing if the data exceeds 1GB. Check out the Amazon S3 pricing page for more details.
Strategies for Amazon S3
Ensure EC2 and S3 are in the same AWS region. Data transfer is free between an EC2 instance and S3 bucket in the same region. Processing the data in the same region eliminates the S3 to EC2 inter-region data transfer cost. If an S3 bucket is in a different region from the EC2 instance, and assuming that each file in the S3 bucket is downloaded on an average of 3 times per month, (3 x $0.02 = $0.06 per GB), the S3 cost of inter-region data migration would be tripled.
Avoid starting with Amazon Glacier right away. Glacier is typically used by advanced software developers, who understand their application’s storage requirements well, and also the various requirement changes over the development cycle. For developers who plan not to access certain objects anymore, then they may begin with the Infrequent Access storage class which is more suitable for their needs.
|S3 Standard offers high durability, availability, and performance object storage for frequently accessed data. Because it delivers low latency and high throughput, S3 Standard is appropriate for a wide variety of use cases, including cloud applications, dynamic websites, content distribution, mobile and gaming applications, and big data analytics.
||S3 Standard-IA is for data that is accessed less frequently, but requires rapid access when needed. S3 Standard-IA offers the high durability, high throughput, and low latency of S3 Standard, with a low per GB storage price and per GB retrieval fee. This combination of low cost and high performance make S3 Standard-IA ideal for long-term storage, backups, and as a data store for disaster recovery files.
||S3 Glacier is a secure, durable, and low-cost storage class for data archiving. Glacier can reliably store any amount of data at costs that are competitive with or cheaper than on-premises solutions. To keep costs low yet suitable for varying needs, S3 Glacier provides three retrieval options that range from a few minutes to hours.
When using an S3 versioned bucket, the “lifecycle rules” feature lets us delete old versions no longer needed for use. By default in S3, all data is kept forever and incurs billing costs for as long as the data is kept in storage. In most cases, developers want to keep older version only for a certain time and setting up a lifecycle rule for that is very appropriate for this situation. When uploading a vast number of large objects onto S3, any interrupt to the uploading process might result in some parts of the objects not being visible to the user, yet the user still has to pay for it. After 7 days and the upload is still incomplete, one should either restart the upload process or cancel it completely.
The Infrequent Access (IA) storage class utilizes the same API and performs as well as the regular S3 storage. Infrequent Access (aka IA) is cheaper than the standard S3 storage ($0.007 per GB per month vs $0.03 per GB per month for S3), however, retrieval costs $0.01 per GB on IA whereas it is free on S3.
If developers have some S3 objects being downloaded on average 20% of the time in a month, it would make more sense to them to keep that object in the Infrequent Access class. It is recommended to access stored objects in S3 only when an EC2 instance goes out or when data migration is needed. The monthly cost saving for 1 GB of IA stored data is equal to S3 Standard Cost – IA Standard Cost – IA Access Costs. IA class requires a minimum data size of 128KB and a minimum of 30 days of storage. Migrating data to and from the S3 standard class uses one API call which will incur costs on the billing.
IA has several advantages over Glacier. It has a more user friendly interface. Furthermore, recovering data stored in Glacier vaults would take very long, possibly a month. Any increase in the speed of data transfer would be more expensive. Recovering 1TB in an hour will require the peak transfer rate of 998 GB per hour which costs $7186. Recovering that in 2 hours will cost $3592.
API calls have their costs charged per object regardless of the object’s size. Uploading 1 byte costs the same as uploading 1GB using API calls, hence it is recommended that developers do not upload a large object in numerous small parts. If 10GB is upload to S3 in a single file, the API cost is negligible. Compared to uploading the same 10GB in 5MB parts, the cost amounts to around $0.01, and if the 10GB is uploaded in 10KB parts, the cost exponentially rises to $5.00. Hence, the recommended approach is not to upload big sized objects in small segments.
Databases like DynamoDB or MySQL are more suitable if developers possess numerous tiny files. A database is designed for grouping small objects together before uploading to S3. S3 file names are not databases, hence it is recommended not to over-rely on S3 LIST calls. Designing, populating and then uploading a database to S3 would be the better way to use S3 for data archiving.