Why the output data are different between 2 sinks from 1 same PCollection - apache-beam

I have a Apache beam application which running with spark runner in yarn cluster, it reads multiple inputs, does transforms and produce 2 outputs, one is in parquet and the other is in text file format.
In my transforms, one of the step is to generate a uuid to give to one attribute of my pojo, then I got a PCollection, after that from this PC I applied transforms to convert myPojo to String and Generic Record, and applied TextIO and ParquetIO to save to my storage.
Just now I observed one strange issue is that, in the output files, the uuid attribute is different between parquet data and text data for the same record!
I expect that they are from one same PCollection, they are just output into different formats, so the data must be same, right?
The issue happens only with big input file volume. In my unit test case, it gives me same value in both formats.
I assumed that there happened kinds of recalculation? When sink to different IOs. But I can't confirm.. anyone can help to explain?.
Thanks

Related

Reading Multiple folders parallely

i have multiple part folders each containing parquet files (ex given below). Now across a part-folder the schema can be different (either the num of cols or the datatype of certain col). My requirement is that i have to read all the part folders and finally create a single df according to a predefined passed schema.
/feed=abc -> contains multiple part folders based on date like below
/feed=abc/date=20221220
/feed=abc/date=20221221
.....
/feed=abc/date=20221231
Since i am not sure what type of changes are there in which part folders i am reading each part folder individually then comparing the schema with teh predefined schema and making the necessary changes i..e, adding/dropping col or typecasting the col datatype. Once done i am writing the result into a temp location and then moving on to the next part folder and repeating the same operation. Once all the part-folders are read i am reading the temp location at one go to get the final output.
Now i want to do this operation parallely, i.e., there will be parallel thread/process (?) which will read part-folders parallely and then execute teh logic of schema comparison and any changes necessary and write into a temp location . Is this thing possible ?
i searched for parallel processing of multi-dir in here but in majority of the scenarios they have same schema across dir so somehow they are using wildcard to read the input path location and create the df, but that is not going to work in my case. The problem statement in the below path is similar to mine but in my case the num of part folders to be read is random and sometimes over 1000. Moreover there are operation involved in comparing the fixing the col types as well.
Any help will be appreciated.
Reading multiple directories into multiple spark dataframes
Divide your existing ETL into two phases. The first one transforms existing data into the appropriate schema, and the second one reads the transformed data convenient way (with * symbols). Use Airflow (or Oozie) to start one data transformer application per directory. And after all instances of the data transformer are successfully finished, run the union app.

EMR-Spark is slow writing a DataFrame with an Array of Strings to S3

I'm trying to write a dataframe to S3 from EMR-Spark and I'm seeing some really slow write times where the writing comes to dominate the total runtime (~80%) of the script. For what it's worth, I've tried both .csv and .parquet formats, it doesn't seem make a difference.
My data can be formatted in two ways, here's the preferred format:
ID : StringType | ArrayOfIDs : ArrayType
(The number of unique IDs in the first column numbers in the low millions. ArrayOfIDs contains GUID formatted strings, and can contain anywhere from ~100 - 100,000 elements)
Writing the first form to S3 is incredibly slow. For what it's worth, I've tried setting the mapreduce.fileoutputcommitter.algorithm.version to 2 as described here: https://issues.apache.org/jira/browse/SPARK-20107 to no real effect.
However my data can also be formatted as an adjacency list, like this:
ID1 : StringType | ID2 : StringType
This appears to be much faster for writing to S3, but I am at a loss for why. Here are my specific questions:
Ultimately I'm trying to get my data into an Aurora RDS Postgres cluster (I was told firmly by those before me that the Spark JDBC connector is too slow for the job, which is why I'm currently trying to dump the data in S3 before loading it into Postgres with a COPY command). I'm not married to using S3 as an intermediate store if there are better alternatives for getting these data frames into RDS Postgres.
I don't know why the first schema with the Array of Strings is so much slower on write. The total data written is actually far less than the second schema on account of eliminating ID duplication from the first column. Would also be nice to understand this behavior.
Well, I still don't know why writing arrays directly from Spark is so much slower than the adjacency list format. But best practice seems to dictate that I avoid writing to S3 directly from Spark.
Here's what i'm doing now:
Write the data to HDFS (anecdotally, the write speed of the adjacency list vs the array now falls in line with my expectations).
From HDFS, use EMR's s3-dist-cp utility to wholesale write the data to S3 (this also seems reasonably performant with array typed data).
Bring the data into Aurora Postgres with the aws_s3.table_import_from_s3 extension.

How to add record numbers to TextIO file sources in Apache Beam or Dataflow

I am using Dataflow (and now Beam) to process legacy text files to replicate the transformations of an existing ETL tool. The current process adds a record number (the record number for each row within each file) and the filename. The reason they want to keep this additional info is so that they can tell which file and record offset the source data came from.
I want to get to a point where I have a PCollection which contains File record number and filename as additional fields in the value or part of the key.
I've seen a different article where the filename can be populated into the resulting PCollection, however I do not have a solution for adding the record numbers per row. Currently the only way I can do it is to pre-process the files before I start the Dataflow process (which is a shame since I would want to have Dataflow/Beam to do it all)

Spark: Is it possible to load an RDD from multiple files in different formats?

I have an heterogeneously-formatted input of files, batch mode.
I want to run a batch over a number of files. These files are of different formats, and they will have different mappings to normalize data (e.g. extract fields with different schema names or positions in the records, to a standard naming).
Given the tabular nature of the data, I'm considering using Dataframes (cannot use datasets due to the Spark version I'm bound to).
In order to apply different extraction logic to each file - do they need to be loaded each file in a separate dataframe, then apply extraction logic (extraction of some files, a process which is different per each file type, configured in terms of e.g. CSV/JSON/XML, position of fields to select (CSV), name of field to select (JSON), etc.), and then join datasets?
That would force me to iterate files, and act on each dataframe separately, and join dataframes afterwards; instead of applying the same (configurable) logic.
I know I could make it with RDD, i.e. loading all files into the RDD, emitting PairRDD[fileId, record], and then run a map where you would look the fileId to get the configuration to apply to that record, which tells you which logic to apply.
I'd rather use Dataframes, for all of the niceties it offers over raw RDDS, in terms of performance, simplicity and parsing.
Is there a better way to use Dataframes to address this problem than the one already explained? Any suggestions or misconceptions I may have?
I'm using Scala, though it should not matter to this problem.

Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition