I am trying to read a very large file (running to GB-s) from Google cloud storage bucket. I read it as Blob, and then open an InputStream out of the Blob.
"Blob blob = get_from_bucket("my-file");
ReadChannel channel = blob.reader();
InputStream str = Channels.newInputStream(channel); "
My question is, is the entire file moved to Blob object in one go or is it done in chunks? In the former case, it could lead to Out of Memory , right?
Is there a way to read the object from bucket just like we do with FileInpuStream so that I can read files irrespective of size of the file?
You can use the streaming API, but, be careful: there isn't CRC enforced on this transfert mode. Some bit can be corrupted, and you can process data with errors.
If you process audio or video, it's not too important. If you handle big file of financial data with lot of numbers, I don't recommend this way.
Related
I have backend that recieves, stores and serves 10-20 MB json files. Which service should I use for superfast put and get (I cannot break the file in smaller chunks)? I dont have to run queries on these files just get them, store them and supply them instantly. The service should scale to tens of thousands of files easily. Ideally I should be able to put the file in 1-2 seconds and retrieve it in the same time.
I feel s3 is the best option and elastic search the second best option. Dyanmodb doesnt allow such object size. What should I use? Also, is there any other service? Mongodb is a possible solution but i dont see that on AWS, so something quick to setup would be great.
Thanks
I don't think you should go for Dynamo or ES for this kind of operation.
After all, what you want is to store and serve it, not going into the file's content which both Dynamo and ES would waste time to do.
My suggestion is to use AWS Lambda + S3 to optimize for cost
S3 does have some small downtime after putting till the file is available though ( It get bigger, minutes even, when you have millions of object in a bucket )
If downtime is important for your operation and total throughput at any given moment is not too huge, You can create a server ( preferably EC2) that serves as a temporary file stash. It will
Receive your file
Try to upload it to S3
If the file is requested before it's available on S3, serve the file on disk
If the file is successfully uploaded to S3, serve the S3 url, delete the file on disk
Should I be using lambda or use spark streaming to merge each incoming streaming file into 1 big file in s3. ?
Thanks
Sandip
You can't really append files in S3, you would read in the entire file, add the new data and then write the file back out - either with a new name or the same name.
However, I don't think you really want to do this - sooner or later, unless you have a trivial amount of data coming in on firehose, your s3 file is going to be too big to be constantly reading, appending new text and sending back to s3 in an efficient and cost-efficient manner.
I would recommend you set the firehose limits to the longest time/largest size interval (to at least cut down on the number of files you get), and then re-think whatever processing you had in mind that makes you think you need to constantly merge everything into a single file.
You will want to use an AWS Lambda to transfer your Kinesis Stream data to the Kinesis Firehose. From there, you can use Firehose to append the data to S3.
See the AWS Big Data Blog for a real-life example. The GitHub page provides a sample KinesisToFirehose Lambda.
Is it possible to compress a file already saved in Google cloud storage?
The files are created and populated by Google dataflow code. Dataflow cannot write to a compressed file but my requirement is to save it in compressed format.
Writing to compressed files is not supported on the standard TextIO.Sink because reading from compressed files is less scalable -- the file can't be split across multiple workers without first being decompressed.
If you want to do this (and aren't worried about potential scalability limits) you could look at writing a custom file-based sink that compresses the files. You can look at TextIO for examples and also look at the docs how to write a file-based sink.
The key change from TextIO would be modifying the TextWriteOperation (which extends FileWriteOperation) to support compressed files.
Also, consider filing a feature request against Cloud Dataflow and/or Apache Beam.
Another option could be to change your pipeline slightly.
Instead of your pipeline writing directly to GCS, you could write to a table(s) in BigQuery, and then when your pipeline is finished simply kick off a BigQuery export job to GCS with GZIP compression set.
https://cloud.google.com/bigquery/docs/exporting-data
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract.compression
You could write an app (perhaps using App Engine or Compute Engine) to do this. You would configure notifications on the bucket so your app is notified when a new object is written, and then runs, reads the object, compresses it, and overwrites the object and sets the Content-Encoding metadata field. Because object writes are transactional the compressed form of your object wouldn't become visible until it's complete. Note that if you do this any apps/services that consume the data would need to be able to handle either compressed or uncompressed formats. As an alternative, you could change your dataflow setup so it outputs to a temporary bucket, and set up notifications for that bucket to cause your compression program to run -- and that program then would write the compressed version to your production bucket and delete the uncompressed object.
I am storing large files using Google Cloud Storage. Sometimes I want to retrieve the full file, but often I want to retrieve just a specific range of bytes from the file.
If I know the byte offset and length that I need to retrieve, is there any way to just retrieve those bytes rather than the full file? My motivation is to reduce time and bandwidth required to load the data.
This is a feature that is offered by Amazon's S3 and that I have been using for a while. I am hoping that the same feature is offered by Google so that I can migrate from S3 to Google Cloud Storage.
Regards,
Oscar
You can specify a Range header in the GET request, such as:
range: bytes=123-456
This works for both the XML API (https://cloud.google.com/storage/docs/xml-api-overview) and the JSON API (https://cloud.google.com/storage/docs/json_api/v1/).
Is it good to save an image in database with type BLOB?
or save only the path and copy the image in specific directory?
Which way is the best (I mean good performance for the database and the application) and why?
What are your requirements?
In the vast majority of cases saving the path will be better, simply because of the sheer size of the files compared to the rest of data (bulge the DB by GBs due to image inclusion). Consider adding an indirection, eg. save the path as a name and a reference to a storage resource (eg. a storage_id referencing a row in storages tables) and the path attached to the 'storage'. This way you can easily move files (copy all files, then update the storage path, rather than update 1MM individual paths).
However, if your requirements include consistent backup/restore and/or disaster recoverability, is often better to store images in the DB. Is not easier, nor more convenient, but is simply going to be required. Each DB has its own way of dealing with this problem, eg. in SQL Server you would use a FILESTREAM type which allows remote access via file access API. See FILESTREAM MVC: Download and Upload images from SQL Server for an example.
Also, a somehow dated but none the less interesting paper on the topic: To BLOB or Not to BLOB.