Spark binary file and Delta Table - pyspark

I have batches of binary files (~3mb each) that I receive in batches of ~20000 files at a time. These files are used downstream for further processing, but I want to process them and store in Delta tables.
I can do this easily:
df = spark.read.format(“binaryFile”).load(<path-to-batch>)
df = df.withColumn(“id”, expr(“uuid()”)
dt = DeltaTable.forName(“myTable”)
dt.alias(“a”).merge(
df.alias(“a”),
“a.path = b.path”
).whenNotMatchedInsert(
values={“id”: “b.id”, “content”: “b.content”}
).execute()
This makes the table quite slow already, but later I need to query certain IDs, do collect and write them individually back to binary files.
Questions:
Would my table benefit from a batch column and partition?
Should I partition by id? I know this is not ideal, but might make querying individual rows easier?
Is there a better way to write the files out again, rather than .collect()? I have seen when I select about 1000 specific ids write them out that about 10 minutes is just for collect and then less than a minute to write. I do something like:
for row in df.collect():
with open(row.id, “wb”) as fw:
fw.write(row.content)

As uuid() returns random values, I'm afraid we cannot use it to compare existing data with new records. (Sorry if I misunderstood the idea)
I don't think using partition by id will help as the id column has obviously high cardinality.
Instead of using collect() which loads all records into Driver, I think it would be better if you can write the records in the Spark dataframe directly and simultaneously from all the worker nodes into a temporary location on ADLS first and then aggregate a few data files from that location.

Related

Modifying number of output files per write-partition with spark

I have a data source that consists of a huge amount of small files. I would like to save this partitioned by column user_id to another storage:
sdf = spark.read.json("...")
sdf.write.partitionBy("user_id").json("...")
The reason for this is I want another system to be able to delete only select users' data upon request.
This works, but, I still get many files within each partition (due to my input data). For performance reasons I would like to reduce the number of files within each partition, ideally simply to one (the process will run each day, so having an output file per user per day would work well).
How do I obtain this with pyspark?
You can use repartition to ensure that each partition gets one file
sdf.repartition('user_id').write.partitionBy("user_id").json("...")
This will make sure for each partition one file is created but in case of coalesce if there are more than one partition it can cause trouble.
Just add coalesce and no. of file you want.
sdf.coalesce(1).write.partitionBy("user_id").json("...")

Spark.SQL – Aggregation of separated data in parallel

My task is to aggregate data by an hour (and store each as a row in DB).
For aggregating one hour, there is no need to know what the other hours have.
The input is json files. Important point is that these files are stored in separated folders – folder for an hour.
I have 2 questions:
What is the right way to aggregate in such scenario – I'd want to "send" each hour data to different node/s and aggregate them separately in parallel – such that in the end I'll finish with a dataframe that contains only an aggregated result of each hour. I understand that simple partitioning doesn't return such dataframe.
How could I take advantage of that separated folders – is it worth to read each hour data separately, and then combine all with union? (while preserving the partition like here). Is it indeed saves the "group-by" operation?

Scala - Write data to file with row limit

I have an RDD with 30Million rows of data, Is there a way to save this into files of 1M each.
I think their is no direct way of doing it. one thing you can do is collect() your rdd and get the iterator from it and save it using normal file save using what scala provides. Something like this
val arrayValue = yourRdd.collect();
//Iterate the array and put it in file if it reaches the limit .
Note: This approach is not recommended if your data size id huge because collect() will bring all the records of RDD to driver code(Master).
You can do rdd.repartition(30). This will ensure that your data is about equally partitioned into 30 partitions and that should give you partitions which have roughly 1 Mil rows each.
Then you do simple rdd.saveAsTextFile(<path>) and Spark will create as many files as partitions under <path>. Or if you want more control over how and where your data is saved, you can do rdd.foreachPartition(f: Iterator[T] => Unit) and handle the logic of actually dealing with rows and saving then as you see fit within the function f passed to the foreachPartition. (Note that foreachPartition will run on each of your executor nodes and will not bring the data back to driver, which of course is a desirable thing).

Spark partitionBy much slower than without it

I tested writing with:
df.write.partitionBy("id", "name")
.mode(SaveMode.Append)
.parquet(filePath)
However if I leave out the partitioning:
df.write
.mode(SaveMode.Append)
.parquet(filePath)
It executes 100x(!) faster.
Is it normal for the same amount of data to take 100x longer to write when partitioning?
There are 10 and 3000 unique id and name column values respectively.
The DataFrame has 10 additional integer columns.
The first code snippet will write a parquet file per partition to file system (local or HDFS). This means that if you have 10 distinct ids and 3000 distinct names this code will create 30000 files. I suspect that overhead of creating files, writing parquet metadata, etc is quite large (in addition to shuffling).
Spark is not the best database engine, if your dataset fits in memory I suggest to use a relational database. It will be faster and easier to work with.

Using scala to dump result processed by Spark to HDFS

I'm a bit confused to find the right way to save data into HDFS after processing them with spark.
This is what I'm trying to do. I'm calculating min, max and SD of numeric fields. My input files have millions of rows, but output will have only around 15-20 fields. So, the output is a single value(scalar) for each field.
For example: I will load all the rows of FIELD1 into an RDD, and at the end, I will get 3 single values for FIELD 1(MIN, MAX, SD). I concatenated these three values into temporary string. In the end, I will have 15 to twenty rows, containing 4 columns in this following format
FIELD_NAME_1 MIN MAX SD
FIELD_NAME_2 MIN MAX SD
This is a snippet of the code:
//create rdd
val data = sc.textFile("hdfs://x.x.x.x/"+args(1)).cache()
//just get the first column
val values = data.map(_.split(",",-1)(1))
val data_double= values.map(x=>if(x==""){0}else{x}.toDouble)
val min_value= data_double.map((_,1)).reduceByKey((_+_)).sortByKey(true).take(1)(0)._1
val max_value= data_double.map((_,1)).reduceByKey((_+_)).sortByKey(false).take(1)(0)._1
val SD = data_double.stdev
So, i have 3 variables, min_value, max_value and SD that I want to store back to hdfs.
Question 1:
Since the output will be rather small, do I just save it locally on the server? or should I dump it to HDFS. Seems to me like dumping the file locally makes better sense.
Question 2:
In spark, I can just call the following to save an RDD into text file
some_RDD.saveAsTextFile("hdfs://namenode/path")
How do I accomplish the same thing in for a String variable that is not an RDD in scala? should I parallelize my result into an RDD first and then call saveAsTextFile?
To save locally just do
some_RDD.collect()
Then save the resulting array with something like from this question. And yes if the data set is small, and can easily fit in memory you should collect and bring it to the driver of the program. Another option if the data is a little to large to store in memory is just some_RDD.coalesce(numParitionsToStoreOn). Keep in mind coalesce also takes a boolean shuffle, if you are doing calculations on the data before coalescing, you should set this to true to get more parallelism on the calculations. Coalesce will reduce the number of nodes that store data when you call some_RDD.saveAsTextFile("hdfs://namenode/path"). If the file is very small but you need it on hdfs, call repartition(1), which is the same as coalesce(1,true), this will ensure that your data is only saved on one node.
UPDATE:
So if all you want to do is save three values in HDFS you can do this.
sc.parallelize(List((min_value,max_value,SD)),1).saveAsTextFile("pathTofile")
Basically you are just putting the 3 vars in a tuple, wrap that in a List and set the parallelism to one since the data is very small
Answer 1: Since you just need several scalar, I'd like to say storing them in local file system. You can first do val localValue = rdd.collect(), which will collect all data from workers to master. And then you call java.io to write things to disk.
Answer 2: You can do sc.parallelize(yourString).saveAsTextFile("hdfs://host/yourFile"). The will write things to part-000*. If you want to have all things in one file, hdfs dfs -getmerge is here to help you.