How to automatically edit over 100k files on GCS using Dataflow? - google-cloud-storage

I have over 100 thousand files on Google Cloud Storage that contain JSON objects and I'd like to create a mirror maintaining the filesytem structure, but with some fields removed from the content of files.
I tried to use Apache Beam on Google Cloud Dataflow, but it splits all files and I can't maintain the structure anymore. I'm using TextIO.
The structure I have is something like reports/YYYY/MM/DD/<filename>
But Dataflow outputs to output_dir/records-*-of-*.
How can I make Dataflow not split the files and output them with the same directory and file structure?
Alternatively, is there a better system to do this kind of edits on a large number of files?

You can not directly use TextIO for this, but Beam 2.2.0 will include a feature that will help you write this pipeline yourself.
If you can build a snapshot of Beam at HEAD, you can already use this feature. Note: the API may change slightly between the time of writing this answer and the release of Beam 2.2.0
Use Match.filepatterns() to create a PCollection<Metadata> of files matching the filepattern
Map the PCollection<Metadata> with a ParDo that does what you want to each file using FileSystems:
Use the FileSystems.open() API to read the input file and then standard Java utilities for working with ReadableByteChannel.
Use FileSystems.create() API to write the output file.
Note that Match is a very simple PTransform (that uses FileSystems under the hood) and another way you can use it in your project is by just copy-pasting (the necessary parts of) its code into your project, or studying its code and reimplementing something similar. This can be an option in case you're hesitant to update your Beam SDK version.

Related

Merging files within the Azrue Data Lake container

Consider below scenario:
I want my data flow like following
import container ---> databricks (Transform)--> export container
Current situation after I am done with the transformation process
container:
---import
--folder
--mydata.csv
---export
--folder
--part-1-transformed-mydata.csv
--part-2-transformed-mydata.csv
--part-3-transformed-mydata.csv
--initial.txt
--success.txt
--finish.txt
I want below structure:
---import
--folder
--mydata.csv
---export
--folder
--transformed-mydata.csv
What should be preferred way (considering data is of few GBs <10) within data-bricks or I am happy to use any functionality in Data Factory as I am using this data-bricks notebook as a step in pipeline .
Note : I am using Apache Spark 3.0.0, Scala 2.12 in data-bricks with 14 GB Memory, 4 Cores. Cluster type is standard
you will either need to repartition the data into a single partition (note this defeats the point of using a distributed computing platform)
or after the files are generated simply run a command to concat them all into a single file this might be problematic if each file has a header you will need to account for that in your logic.
it might be better to think of the export folder as the "file" if that makes sense. doesn't solve your problem but unless you need to produce a single file for some reason most consumer wont have an issue reading the data in a directory

How to read large files from HTTP response in Apache Beam?

Apache Beam's TextIO can be used to read JSON files in some filesystems, but how can I create a PCollection out of a large JSON (InputStream) resulted from a HTTP response in Java SDK?
I don't think there's a generic built-in solution in Beam to do this at the moment, see the list of supported IOs.
I can think of multiple approaches to this, whichever works for you may depend on your requirements:
I would probably first try to build another layer (probably not in Beam) that saves the HTTP output into a GCS bucket (maybe splitting it into multiple files in the process) and then use Beam's TextIO to read from the GCS bucket;
depending on the properties of the HTTP source you can consider:
writing your own ParDo that reads the whole response in a single step, splits it and outputs the split elements separately. Then further transforms would parse the JSON or do other stuff;
implementing you own source, that will be more complicated but probably work better for very large (unbounded) responses;

How to show data from the RDF archive in scala flink

I am looking for an approach to load and print data from .n3 files of .tar.gz archive in scala. Or should I extract it?
If you want to download the file, it is located on
http://wiki.knoesis.org/index.php/LinkedSensorData
Could anyone describe how can I print the data on the screen from this archive using scala?
The files that you are dealing with are large. I therefore suggest you import it into an RDF store of some sort rather than try and parse it yourself. You can use GraphDB, Blazegraph, Virtuso and the list goes on. A search for RDF stores should give many other options. Then you can use SPARQL to query the RDF store (which is like SQL for relational databases).
For finding a Scala library that can access RDF data you can see this related SO question, though it does not look promising. I would suggest you look at Apache Jena, a Java library.
You may also want to look at the DBPedia Extraction Framework where they extract data from Wikipedia and store it as RDF data using Scala. It is certainly not exactly what you are trying to do, but it could give you insight into the tools they used for generating RDF from Scala and related issues.

Use of spark to optimize S3 to S3 transfer

I am learning spark/scala and trying to experiment with the below scenario using scala language.
Scenario: Copy multiple files from one S3 bucket folder to another S3 bucket folder.
Things done so far:
1) Use AWS S3 SDK and scala:
- Create list of files from S3 source locations.
- Iterate through the list, pass the source and target S3 locations from step 1 and use S3 API copyObject to copy each of these files to the target locations (configured).
This works.
However, I am trying to understand if I have large number of files inside multiple folders, is this the most efficient way of doing or can I use spark to parallelize this copy of files?
The approach that I am thinking is:
1) Use S3 SDK to get the source paths similar to what's explained above
2) Create an RDD for each of the files using sc.parallelize() - something on these lines?
sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
3) Can I use sc.wholeTextFiles in some way to make this work?
I am not sure how to achieve this as of now.
Can you please help me understand if I am thinking in the right direction and also is this approach correct?
Thanks
I think AWS did not make it complicated though.
We had the same problem, we transferred around 2TB close to 10 mins.
If you want to transfer from one bucket to another bucket, better to use the built-in functionality to transfer within s3 itself.
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
AWS CLI Command Example:
aws s3 sync s3://sourcebucket s3://destinationbucket
If you want to do it programmatically you can use all SDK's to invoke the same type of command. I would avoid reinventing the same wheel.
Hope it helps.
I have a code snipped, cloudCp which uses spark for a high-performance parallelised upload; it'd be similar to do something for copy, where you'd drop to the AWS lib for that operation.
But: you may not need to push out work to many machines, as each of the PUT/x-copy-source calls may be slow, but it doesn't use any bandwidth. You could just start a process with many many threads & a large HTTP client pool and just run them all on in that process. Take the list, sort by largest few first and then shuffle the rest at random to reduce throttling effects. Print out counters to help profile...

Compress files saved in Google cloud storage

Is it possible to compress a file already saved in Google cloud storage?
The files are created and populated by Google dataflow code. Dataflow cannot write to a compressed file but my requirement is to save it in compressed format.
Writing to compressed files is not supported on the standard TextIO.Sink because reading from compressed files is less scalable -- the file can't be split across multiple workers without first being decompressed.
If you want to do this (and aren't worried about potential scalability limits) you could look at writing a custom file-based sink that compresses the files. You can look at TextIO for examples and also look at the docs how to write a file-based sink.
The key change from TextIO would be modifying the TextWriteOperation (which extends FileWriteOperation) to support compressed files.
Also, consider filing a feature request against Cloud Dataflow and/or Apache Beam.
Another option could be to change your pipeline slightly.
Instead of your pipeline writing directly to GCS, you could write to a table(s) in BigQuery, and then when your pipeline is finished simply kick off a BigQuery export job to GCS with GZIP compression set.
https://cloud.google.com/bigquery/docs/exporting-data
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract.compression
You could write an app (perhaps using App Engine or Compute Engine) to do this. You would configure notifications on the bucket so your app is notified when a new object is written, and then runs, reads the object, compresses it, and overwrites the object and sets the Content-Encoding metadata field. Because object writes are transactional the compressed form of your object wouldn't become visible until it's complete. Note that if you do this any apps/services that consume the data would need to be able to handle either compressed or uncompressed formats. As an alternative, you could change your dataflow setup so it outputs to a temporary bucket, and set up notifications for that bucket to cause your compression program to run -- and that program then would write the compressed version to your production bucket and delete the uncompressed object.