How to read large files from HTTP response in Apache Beam? - apache-beam

Apache Beam's TextIO can be used to read JSON files in some filesystems, but how can I create a PCollection out of a large JSON (InputStream) resulted from a HTTP response in Java SDK?

I don't think there's a generic built-in solution in Beam to do this at the moment, see the list of supported IOs.
I can think of multiple approaches to this, whichever works for you may depend on your requirements:
I would probably first try to build another layer (probably not in Beam) that saves the HTTP output into a GCS bucket (maybe splitting it into multiple files in the process) and then use Beam's TextIO to read from the GCS bucket;
depending on the properties of the HTTP source you can consider:
writing your own ParDo that reads the whole response in a single step, splits it and outputs the split elements separately. Then further transforms would parse the JSON or do other stuff;
implementing you own source, that will be more complicated but probably work better for very large (unbounded) responses;

Related

Beam I/O connector (python) for SOAP and/or REST

We are building a data warehouse which combines data from a lot (n) of sources. Theses sources are made available to us in various ways like, csv files, direct database access, SOAP and REST.
.csv and direct access are covered extensively in the Apache Beam documentation, however there seems to be little (if any) coverage of SOAP and REST. We have no problem getting REST and SOAP data first, then use the data to instantiate a pCollection and let the pipeline handle it from there. However, a pattern is emerging that we think calls for a different approach:
Query an endpoint to get a list of other endpoints that serve the actual data.
Iterate over said list to retrieve the data
In some cases, the total amount of records retrieved like this will be in the tens of millions (possibly more in the future).
The question: How can we retrieve this data efficiently using Beam, making use of parallel processing? Do we need to write a custom python I/O connector and if so, where can we find an example of a SOAP or REST (or any http request really, We searched extensively but all we found was a single link and even that was stale)
Or alternatively: The lack of documentation and questions about this subject makes us think that Beam is in fact not the correct tool for this particular job. Is this correct?
Not sure if entirely relevant: We use Google Dataflow as a runner and Biqquery for storage (for now).

How to read large CSV with Beam?

I'm trying to figure out how to use Apache Beam to read large CSV files. By "large" I mean, several gigabytes (so that it would be impractical to read the entire CSV into memory at once).
So far, I've tried the following options:
Use TextIO.read(): this is no good because a quoted CSV field could contain a newline. In addition, this tries to read the entire file into memory at once.
Write a DoFn that reads the file as a stream and emits records (e.g. with commons-csv). However, this still reads the entire file all at once.
Try a SplittableDoFn as described here. My goal with this is to have it gradually emit records as an Unbounded PCollection - basically, to turn my file into a stream of records. However, (1) it's hard to get the counting right (2) it requires some hacky synchronizing since ParDo creates multiple threads, and (3) my resulting PCollection still isn't unbounded.
Try to create my own UnboundedSource. This seems to be ultra-complicated and poorly documented (unless I'm missing something?).
Does Beam provide anything simple to allow me to parse a file the way I want, and not have to read the entire file into memory before moving on to the next transform?
The TextIO should be doing the right thing from Beam's prospective, which is reading in the text file as fast as possible and emitting events to the next stage.
I'm guessing you are using the DirectRunner for this, which is why you are seeing a large memory footprint. Hopefully this isn't too much explanation: The DirectRunner is a test runner for small jobs and so it buffers intermediate steps in memory rather then to disk. If you are still testing your pipeline, you should use a small sample of your data until you think it is working. Then you can use the Apache Flink runner or Google Cloud Dataflow runner which will both write intermediate stages to disk when needed.
In general, splitting csv files with quoted newlines is hard as it may require arbitrary look-back to determine whether a given newline is or is not in a quoted segment. If you can arrange such that the CSV has no quoted newlines, TextIO.read() works well. Otherwise
If you're using BeamPython, consider the dataframe operation apache_beam.dataframe.io.read_csv which will handle quotation correctly (and efficiently).
In another language, you can either use that as a cross-language transform, or create a PCollection of file paths (e.g. via FileIO.MatchAll) followed by a DoFn that reads and emits rows incrementally using your CSV library of choice. With the exception of a direct/local runner, this should not require reading the entire file into memory (though it will cause each individual file to be read by a single worker, possibly limiting parallelism).
You can use the logic in Text to Cloud Spanner for handling new lines while reading a CSV.
This template reads data from a CSV and writes to Cloud Spanner.
The specific files containing the logic to read CSV with newlines are in ReadFileShardFn and SplitIntoRangesFn.

How to automatically edit over 100k files on GCS using Dataflow?

I have over 100 thousand files on Google Cloud Storage that contain JSON objects and I'd like to create a mirror maintaining the filesytem structure, but with some fields removed from the content of files.
I tried to use Apache Beam on Google Cloud Dataflow, but it splits all files and I can't maintain the structure anymore. I'm using TextIO.
The structure I have is something like reports/YYYY/MM/DD/<filename>
But Dataflow outputs to output_dir/records-*-of-*.
How can I make Dataflow not split the files and output them with the same directory and file structure?
Alternatively, is there a better system to do this kind of edits on a large number of files?
You can not directly use TextIO for this, but Beam 2.2.0 will include a feature that will help you write this pipeline yourself.
If you can build a snapshot of Beam at HEAD, you can already use this feature. Note: the API may change slightly between the time of writing this answer and the release of Beam 2.2.0
Use Match.filepatterns() to create a PCollection<Metadata> of files matching the filepattern
Map the PCollection<Metadata> with a ParDo that does what you want to each file using FileSystems:
Use the FileSystems.open() API to read the input file and then standard Java utilities for working with ReadableByteChannel.
Use FileSystems.create() API to write the output file.
Note that Match is a very simple PTransform (that uses FileSystems under the hood) and another way you can use it in your project is by just copy-pasting (the necessary parts of) its code into your project, or studying its code and reimplementing something similar. This can be an option in case you're hesitant to update your Beam SDK version.

We are trying to persist logs in S3 using Kinesis firehose. However I would like to merge each stream of data into 1 big file. How would I do that?

Should I be using lambda or use spark streaming to merge each incoming streaming file into 1 big file in s3. ?
Thanks
Sandip
You can't really append files in S3, you would read in the entire file, add the new data and then write the file back out - either with a new name or the same name.
However, I don't think you really want to do this - sooner or later, unless you have a trivial amount of data coming in on firehose, your s3 file is going to be too big to be constantly reading, appending new text and sending back to s3 in an efficient and cost-efficient manner.
I would recommend you set the firehose limits to the longest time/largest size interval (to at least cut down on the number of files you get), and then re-think whatever processing you had in mind that makes you think you need to constantly merge everything into a single file.
You will want to use an AWS Lambda to transfer your Kinesis Stream data to the Kinesis Firehose. From there, you can use Firehose to append the data to S3.
See the AWS Big Data Blog for a real-life example. The GitHub page provides a sample KinesisToFirehose Lambda.

Is Cassandra good for storing files?

I'm developing a php platform that will make huge use of images, documents and any file format that will come in my mind so i was wondering if Cassandra is a good choice for my needs.
If not, can you tell me how should i store files? I'd like to keep using cassandra because it's fault-tolerant and uses auto-replication among nodes.
Thanks for help.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
You should be OK with files of 10MB. In fact, DataStax Brisk puts a filesystem on top of Cassandra if I'm not mistaken: http://www.datastax.com/products/enterprise.
(I'm not associated with them in any way- this isn't an ad)
As fresh information, Netflix provides utilities for their cassandra client called astyanax for storing files as handled object stores. Description and examples can be found here. It can be a good starting point to write some tests using astyanax and evaluate Cassandra as a file storage.