How to read large CSV with Beam? - apache-beam

I'm trying to figure out how to use Apache Beam to read large CSV files. By "large" I mean, several gigabytes (so that it would be impractical to read the entire CSV into memory at once).
So far, I've tried the following options:
Use TextIO.read(): this is no good because a quoted CSV field could contain a newline. In addition, this tries to read the entire file into memory at once.
Write a DoFn that reads the file as a stream and emits records (e.g. with commons-csv). However, this still reads the entire file all at once.
Try a SplittableDoFn as described here. My goal with this is to have it gradually emit records as an Unbounded PCollection - basically, to turn my file into a stream of records. However, (1) it's hard to get the counting right (2) it requires some hacky synchronizing since ParDo creates multiple threads, and (3) my resulting PCollection still isn't unbounded.
Try to create my own UnboundedSource. This seems to be ultra-complicated and poorly documented (unless I'm missing something?).
Does Beam provide anything simple to allow me to parse a file the way I want, and not have to read the entire file into memory before moving on to the next transform?

The TextIO should be doing the right thing from Beam's prospective, which is reading in the text file as fast as possible and emitting events to the next stage.
I'm guessing you are using the DirectRunner for this, which is why you are seeing a large memory footprint. Hopefully this isn't too much explanation: The DirectRunner is a test runner for small jobs and so it buffers intermediate steps in memory rather then to disk. If you are still testing your pipeline, you should use a small sample of your data until you think it is working. Then you can use the Apache Flink runner or Google Cloud Dataflow runner which will both write intermediate stages to disk when needed.

In general, splitting csv files with quoted newlines is hard as it may require arbitrary look-back to determine whether a given newline is or is not in a quoted segment. If you can arrange such that the CSV has no quoted newlines, TextIO.read() works well. Otherwise
If you're using BeamPython, consider the dataframe operation apache_beam.dataframe.io.read_csv which will handle quotation correctly (and efficiently).
In another language, you can either use that as a cross-language transform, or create a PCollection of file paths (e.g. via FileIO.MatchAll) followed by a DoFn that reads and emits rows incrementally using your CSV library of choice. With the exception of a direct/local runner, this should not require reading the entire file into memory (though it will cause each individual file to be read by a single worker, possibly limiting parallelism).

You can use the logic in Text to Cloud Spanner for handling new lines while reading a CSV.
This template reads data from a CSV and writes to Cloud Spanner.
The specific files containing the logic to read CSV with newlines are in ReadFileShardFn and SplitIntoRangesFn.

Related

Is there an optimal way for writing lots of tiny files with PySpark?

I have a job that requires having to write a single JSON file to s3 for each row in a Spark dataframe (which then gets picked up by another process).
df.repartition(col("id")).write.mode("overwrite").partitionBy(col("id")).json(
f"s3://bucket/path/to/file"
)
These datasets often consist of 100k rows (sometimes 1m+) and take a very long time to write. I understand that large numbers of small files is not great for read performance but is this also the case for writes? Or is there something that can be done with partitioning to speed things up?
Please don't do this, you will only suffer pain. S3 was designed to be cheap long-term storage optimized for large files. It was design so the 'prefix' (directory path) leads to a bucket that provides files. If you want to optimize reads and writes you want to develop several buckets to write to at the same time. This means you want to actually modify the directory path(prefix) to the bucket with the most amount of variation to increase the number of buckets that you write to.
Example of mulitple files being written to the same bucket:
S3:/mydrive/mystuff/2020-12-31
S3:/mydrive/mystuff/2020-12-30
S3:/mydrive/mystuff/2020-12-29
This is because they all share the same bucket prefix --> S3:/mydrive/mystuff/
What if instead you flipped the part that changes? Now you have different buckets being used as you are writing to different buckets.(prefix is different)
S3:2020-12-31/mydrive/mystuff/
S3:2020-12-30/mydrive/mystuff/
S3:2020-12-29/mydrive/mystuff/
This change will help with read/write speed as different buckets will be used. It will not solve the problem that S3 doesn't actually use directories to direct you to files. As I said a prefix is actually just a pointer to the bucket. It then searches against all files you have written, to find the file that exists in your bucket. This is why tons of small files makes things worse, the lookup time for files takes longer and longer the more files you write. Because this lookup is expensive it's much faster to write larger files and make the cost of lookup minimized.

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Idempotent streams or preventing duplicate rows using PipelineDB

My application produces rotating log files containing multiple application metrics. The log file is rotated once a minute, but each file is still relatively large (over 30MB, with 100ks of rows)
I'd like to feed the logs into PipelineDB (running on the same single machine) which Countiuous View can create for me exactly the aggregations I need over the metrics.
I can easily ship the logs to PipelineDB using copy from stdin, which works great.
However, a machine might occasionally power off unexpectedly (e.g. due to power shortage) during the copy of a log file. Which means that once back online there is uncertainty how much of the file has been inserted into PipelineDB.
How could I ensure that each row in my logs is inserted exactly once in such cases? (It's very important that I get complete and accurate aggregations)
Notice each row in the log file has a unique identifier (serial number created by my application), but I can't find in the docs the option to define a unique field in the stream. I assume that PipelineDB's design is not meant to handle unique fields in stream rows
Nonetheless, are there any alternative solutions to this issue?
Exactly once semantics in a streaming (infinite rows) context is a very complex problem. Most large PipelineDB deployments use some kind of message bus infrastructure (e.g. Kafka) in front of PipelineDB for delivery semantics and reliability, as that's not PipelineDB's core focus.
That being said, there are a couple of approaches you could use here that may be worth thinking about.
First, you could maintain a regular table in PipelineDB that keeps track of each logfile and the line number that it has successfully written to PipelineDB. When beginning to ship a new logfile, check it against this table to determine which line number to start at.
Secondly, you could separate your aggregations by logfile (by including a path or something in the grouping) and simply DELETE any existing rows for that logfile before sending it. Then use combine to aggregate over all logfiles at read time, possibly with a VIEW.

spark save simple string to text file

I have a spark job that needs to store the last time it ran to a text file.
This has to work both on HDFS but also on local fs (for testing).
However it seems that this is not at all so straight forward as it seems.
I have been trying with deleting the dir and getting "can't delete" error messages.
Trying to store a simple sting value into a dataframe to parquet and back again.
this is all so convoluted that it made me take a step back.
What's the best way to just store a string (timestamp of last execution in my case) to a file by overwriting it?
EDIT:
The nasty way I use it now is as follows:
sqlc.read.parquet(lastExecution).map(t => "" + t(0)).collect()(0)
and
sc.parallelize(List(lastExecution)).repartition(1).toDF().write.mode(SaveMode.Overwrite).save(tsDir)
This sounds like storing simple application/execution metadata. As such, saving a text file shouldn't need to be done by "Spark" (ie, it shouldn't be done in distributed spark jobs, by workers).
The ideal place for you to put it is in your driver code, typically after constructing your RDDs. That being said, you wouldn't be using the Spark API to do this, you'd rather be doing something as trivial as using a writer or a file output stream. The only catch here is how you'll read it back. Assuming that your driver program runs on the same computer, there shouldn't be a problem.
If this value is to be read by workers in future jobs (which is possibly why you want it in hdfs), and you don't want to use the Hadoop API directly, then you will have to ensure that you have only one partition so that you don't end up with multiple files with the trivial value. This, however, cannot be said for the local storage (it gets stored on the machine where the worker executing the task is running), managing this will simply be going overboard.
My best option would be to use the driver program and create the file on the machine running the driver (assuming it is the same that will be used next time), or, even better, to put it in a database. If this value is needed in jobs, then the driver can simply pass it through.

Recover standard out from a failed Hadoop job

I'm running a large Hadoop streaming job where I process a large list of files with each file being processed as a single unit. To do this, my input to my streaming job is a single file with a list of all the file names on separate lines.
In general, this works well. However, I ran into an issue where I was partially through a large job (~36%) when Hadoop ran into some files with issues and for some reason it seemed to crash the entire job. If the job had completed successfully, what would have been printed to standard out would be a line for each file as it was completed along with some stats from my program that's processing each file. However, with this failed job, when I try to look at the output that would have been sent to standard out, it is empty. I know that roughly 36% of the files were processed (because I'm saving the data to a database), but it's not easy for me to generate a list of which files were successfully processed and which ones remain. Is there anyway to recover this logging to standard out?
One thing I can do is look at all of the log files for the completed/failed tasks, but this seems more difficult to me and I'm not sure how to go about retrieving the good/bad list of files this way.
Thanks for any suggestions.
Hadoop captures system.out data here :
/mnt/hadoop/logs/userlogs/task_id
However, I've found this unreliable, and Hadoop jobs dont usually use standard out for debugging, rather - the convetion is to use counters.
For each of your documents, you can summarize document characteristics : like length, number of normal ascii chars, number of new lines.
Then, you can have 2 counters: a counter for "good" files, and a counter for "bad" files.
It probably be pretty easy to note that the bad files have something in common [no data, too much data, or maybe some non printable chars].
Finally, you obviously will have to look at the results after the job is done running.
The problem, of course, with system.out statements is that the jobs running on various machines can't integrate their data. Counters get around this problem - they are easily integrated into a clear and accurate picture of the overall job.
Of course, the problem with counters is the information content is entirely numeric, but, with a little creativity, you can easily find ways to quantitatively describe the data in a meaningfull way.
WORST CASE SCENARIO : YOU REALLY NEED TEXT DEBUGGING, and you dont want it in a temp file
In this case, you can use MultipleOutputs to write out ancillary files with other data in them. You can emit records to these files in the same way as you would for the part-r-0000* data.
In the end, I think you will find that, ironically, the restriction of having to use counters will increase the readability of your jobs : it is pretty intuitive, once you think about it, to debug using numerical counts rather than raw text --- i find, quite often that much of my debugging print statements are, when cut down to their raw information content, are basically just counters...