Compress files saved in Google cloud storage - google-cloud-storage

Is it possible to compress a file already saved in Google cloud storage?
The files are created and populated by Google dataflow code. Dataflow cannot write to a compressed file but my requirement is to save it in compressed format.

Writing to compressed files is not supported on the standard TextIO.Sink because reading from compressed files is less scalable -- the file can't be split across multiple workers without first being decompressed.
If you want to do this (and aren't worried about potential scalability limits) you could look at writing a custom file-based sink that compresses the files. You can look at TextIO for examples and also look at the docs how to write a file-based sink.
The key change from TextIO would be modifying the TextWriteOperation (which extends FileWriteOperation) to support compressed files.
Also, consider filing a feature request against Cloud Dataflow and/or Apache Beam.

Another option could be to change your pipeline slightly.
Instead of your pipeline writing directly to GCS, you could write to a table(s) in BigQuery, and then when your pipeline is finished simply kick off a BigQuery export job to GCS with GZIP compression set.
https://cloud.google.com/bigquery/docs/exporting-data
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract.compression

You could write an app (perhaps using App Engine or Compute Engine) to do this. You would configure notifications on the bucket so your app is notified when a new object is written, and then runs, reads the object, compresses it, and overwrites the object and sets the Content-Encoding metadata field. Because object writes are transactional the compressed form of your object wouldn't become visible until it's complete. Note that if you do this any apps/services that consume the data would need to be able to handle either compressed or uncompressed formats. As an alternative, you could change your dataflow setup so it outputs to a temporary bucket, and set up notifications for that bucket to cause your compression program to run -- and that program then would write the compressed version to your production bucket and delete the uncompressed object.

Related

Read very large files from Google cloud storage usign Java

I am trying to read a very large file (running to GB-s) from Google cloud storage bucket. I read it as Blob, and then open an InputStream out of the Blob.
"Blob blob = get_from_bucket("my-file");
ReadChannel channel = blob.reader();
InputStream str = Channels.newInputStream(channel); "
My question is, is the entire file moved to Blob object in one go or is it done in chunks? In the former case, it could lead to Out of Memory , right?
Is there a way to read the object from bucket just like we do with FileInpuStream so that I can read files irrespective of size of the file?
You can use the streaming API, but, be careful: there isn't CRC enforced on this transfert mode. Some bit can be corrupted, and you can process data with errors.
If you process audio or video, it's not too important. If you handle big file of financial data with lot of numbers, I don't recommend this way.

Is MongoDB a good choice for storing a huge set of text files?

I'm currently building a system (with GCP) for storing large set of text files of different sizes (1kb~100mb) about different subjects. One fileset could be more than 10GB.
For example:
dataset_about_some_subject/
- file1.txt
- file2.txt
...
dataset_about_another_subject/
- file1.txt
- file2.txt
...
The files are for NLP, and after pre-processing, as pre-processed data are saved separately, will not be accessed frequently. So saving all files in MongoDB seems unnecessary.
I'm considering
saving all files into some cloud storage,
save file information like name and path to MongoDB as JSON.
The above folders turn to:
{
name: dataset_about_some_subject,
path: path_to_cloud_storage,
files: [
{
name: file1.txt
...
},
...
]
}
When any fileset is needed, search its name in MongoDB and read the files from cloud storage.
Is this a valid way? Will there be any I/O speed problem?
Or is there any better solution for this?
And I've read about Hadoop. Maybe this is a better solution?
Or maybe not. My data is not that big.
As far as I remember, MongoDB has a maximum object size of 16 MB, which is below the maximum size of the files (100 MB). This means that, unless one splits, storing the original files in plaintext JSON strings would not work.
The approach you describe, however, is sensible. Storing the files on cloud storage such as S3 or Azure, is common, not very expensive, and does not require a lot of maintenance comparing to having your own HDFS cluster. I/O would be best by performing the computations on the machines of the same provider, and making sure the machines are in the same region as the data.
Note that document stores, in general, are very good at handling large collections of small documents. Retrieving file metadata in the collection would thus be most efficient if you store the metadata of each file in a separate object (rather than in an array of objects in the same document), and have a corresponding index for fast lookup.
Finally, there is another aspect to consider, namely, whether your NLP scenario will process the files by scanning them (reading them all entirely) or whether you need random access or lookup (for example, a certain word). In the first case, which is throughput-driven, cloud storage is a very good option. In the latter case, which is latency-driven, there are document stores like Elasticsearch that offer good fulltext search functionality and can index text out of the box.
I recommend you to store large file using storage service provide by below. It also support Multi-regional access through CDN to ensure the speed of file access.
AWS S3: https://aws.amazon.com/tw/s3/
Azure Blob: https://azure.microsoft.com/zh-tw/pricing/details/storage/blobs/
GCP Cloud Storage: https://cloud.google.com/storage
You can rest assured that for the metadata storage you propose in mongodb, speed will not be a problem.
However, for storing the files themselves, you have various options to consider:
Cloud storage: fast setup, low initial cost, medium cost over time (compare vendor prices), datatransfer over public network for every access (might be a performance problem)
Mongodb-Gridfs: already in place, operation cost varies, data transfer is just as fast as from mongo itself
Hadoop cluster: high initial hardware and setup cost, lower cost over time. Data transfer in local network (provided you build it on-premise.) Specialized administration skills needed. Possibility to use the cluster for parrallel calculations (i.e. this is not only storage, this is also computing power.) (As a rule of thumb: if you are not going to store more than 500 TB, this is not worthwile.)
If you are not sure about the amount of data you cover, and just want to get started, I recommend starting out with gridfs, but encapsulate in a way that you can easily exchange the storage.
I have another answer: as you say, 10GB is really not big at all. You may want to also consider the option of storing it on your local computer (or locally on one single machine in the cloud), simply on your regular file system, and executing in parallel on your cores (Hadoop, Spark will do this too).
One way of doing it is to save the metadata as a single large text file (or JSON Lines, Parquet, CSV...), the metadata for each file on a separate line, then have Hadoop or Spark parallelize over this metadata file, and thus process the actual files in parallel.
Depending on your use case, this might turn out to be faster than on a cluster, or not exceedingly slower, especially if your execution is CPU-heavy. A cluster has clear benefits when the problem is that you cannot read from the disk fast enough, and for workloads executed occasionally, this is a problem that one starts having from the TB range.
I recommend this excellent paper by Frank McSherry:
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

DynamoDB vs ElasticSearch vs S3 - which service to use for superfast get/put 10-20MB files?

I have backend that recieves, stores and serves 10-20 MB json files. Which service should I use for superfast put and get (I cannot break the file in smaller chunks)? I dont have to run queries on these files just get them, store them and supply them instantly. The service should scale to tens of thousands of files easily. Ideally I should be able to put the file in 1-2 seconds and retrieve it in the same time.
I feel s3 is the best option and elastic search the second best option. Dyanmodb doesnt allow such object size. What should I use? Also, is there any other service? Mongodb is a possible solution but i dont see that on AWS, so something quick to setup would be great.
Thanks
I don't think you should go for Dynamo or ES for this kind of operation.
After all, what you want is to store and serve it, not going into the file's content which both Dynamo and ES would waste time to do.
My suggestion is to use AWS Lambda + S3 to optimize for cost
S3 does have some small downtime after putting till the file is available though ( It get bigger, minutes even, when you have millions of object in a bucket )
If downtime is important for your operation and total throughput at any given moment is not too huge, You can create a server ( preferably EC2) that serves as a temporary file stash. It will
Receive your file
Try to upload it to S3
If the file is requested before it's available on S3, serve the file on disk
If the file is successfully uploaded to S3, serve the S3 url, delete the file on disk

How to automatically edit over 100k files on GCS using Dataflow?

I have over 100 thousand files on Google Cloud Storage that contain JSON objects and I'd like to create a mirror maintaining the filesytem structure, but with some fields removed from the content of files.
I tried to use Apache Beam on Google Cloud Dataflow, but it splits all files and I can't maintain the structure anymore. I'm using TextIO.
The structure I have is something like reports/YYYY/MM/DD/<filename>
But Dataflow outputs to output_dir/records-*-of-*.
How can I make Dataflow not split the files and output them with the same directory and file structure?
Alternatively, is there a better system to do this kind of edits on a large number of files?
You can not directly use TextIO for this, but Beam 2.2.0 will include a feature that will help you write this pipeline yourself.
If you can build a snapshot of Beam at HEAD, you can already use this feature. Note: the API may change slightly between the time of writing this answer and the release of Beam 2.2.0
Use Match.filepatterns() to create a PCollection<Metadata> of files matching the filepattern
Map the PCollection<Metadata> with a ParDo that does what you want to each file using FileSystems:
Use the FileSystems.open() API to read the input file and then standard Java utilities for working with ReadableByteChannel.
Use FileSystems.create() API to write the output file.
Note that Match is a very simple PTransform (that uses FileSystems under the hood) and another way you can use it in your project is by just copy-pasting (the necessary parts of) its code into your project, or studying its code and reimplementing something similar. This can be an option in case you're hesitant to update your Beam SDK version.

We are trying to persist logs in S3 using Kinesis firehose. However I would like to merge each stream of data into 1 big file. How would I do that?

Should I be using lambda or use spark streaming to merge each incoming streaming file into 1 big file in s3. ?
Thanks
Sandip
You can't really append files in S3, you would read in the entire file, add the new data and then write the file back out - either with a new name or the same name.
However, I don't think you really want to do this - sooner or later, unless you have a trivial amount of data coming in on firehose, your s3 file is going to be too big to be constantly reading, appending new text and sending back to s3 in an efficient and cost-efficient manner.
I would recommend you set the firehose limits to the longest time/largest size interval (to at least cut down on the number of files you get), and then re-think whatever processing you had in mind that makes you think you need to constantly merge everything into a single file.
You will want to use an AWS Lambda to transfer your Kinesis Stream data to the Kinesis Firehose. From there, you can use Firehose to append the data to S3.
See the AWS Big Data Blog for a real-life example. The GitHub page provides a sample KinesisToFirehose Lambda.