Sort data within one output directory created by partitionBy - scala

I have a big geospatial dataset partitionBy quadkey's level 5.
In each qk5 level directory, there are about 1-50 Gb of data, so it doesn't fit into one file. I want to benefit from pushdown filters when do my geospatial queries. So I want that files within one qk5 partition be sorted by higher qk resolution (let's say quadkey level 10).
Question: Is there are a way to sort data within partitionBy batch?
For example:
qk5=00001/
part1.parquet
part2.parquet
part3.parquet
part4.parquet
...
qk5=33333/
part10000.parquet
part20000.parquet
part30000.parquet
part40000.parquet
I want to have data from part1.parquet, part2.parquet, part3.parquet, part4.parquet to be sorted by column 'qk10'.
Here is the current code, but it only provides sorting within one particular partition (e.g. part1.parquet):
// Parquet save
preExportRdd.toDF
.repartition(partitionsNumber, $"salt")
.sortWithinPartitions($"qk10")
.drop("salt")
.write
.partitionBy("qk")
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(exportUrl)

The problem is that you don't sort your Dataframe globally by qk field and it causes for the same qk values to be distributed in different spark partitions.
During the write phase, due to partitionBy("qk"), the output written to a specific physical partition (folder) may arrive from different spark partitions, which causes your output data to be unsorted.
Try instead the following:
preExportRdd.toDF
.repartitionByRange(partitionsNumber, $"qk", $"qk10", $"salt")
.sortWithinPartitions($"qk10")
.drop("salt")
.write
.partitionBy("qk")
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(exportUrl)
The repartitionByRange will sort your Dataframe by the provided columns and split the sorted Dataframe to the desired number of partitions.

Related

Spark Range partitioning for unbalanced dataframe

I have dataframe with next schema:
provider_id: Int,
quadkey18: String,
data: Array[ComplexObject]
I need to save this dataframe partitioned by provider_id and quadkey5.
In general data field quite similar in terms of size across all providers.
However some providers (certain provider_id) have data array 100x-1000x times bigger than others.
I am trying to get balance dataset with next code:
df
.withColumn("qk_for_range",
when($"provider"===high_freq_provider_id,substring($"quadkey18",1,14))
otherwise substring($"quadkey18",1,10) )
.withColumn("quadkey5", substring($"quadkey18",1,5) )
.repartitionByRange(nrPartitions, $"provider", $"qk_for_range")
.drop("qk_for_range")
.write
.partitionBy("provider", "quadkey5")
.format("parquet")
.option("compression", "gzip")
.option("maxRecordsPerFile",(maxCountInPartition).toInt)
.mode(SaveMode.Overwrite)
.save(exportUrl)
However I got really huge partition parquet files (~ 1Gb), when I want to get smaller partitions (~200 mb).
I can decrease "maxRecordsPerFile" option, but in that case I would get a lot of small files for all "light" providers (those that have small data array per record).
My question is - how to break down "fat" partitions?

Eliminate duplicates (deduplication) in Streaming DataFrame

I have a Spark streaming processor.
The Dataframe dfNewExceptions has duplicates (duplicate by "ExceptionId").
Since this is a streaming dataset, the below query fails:
val dfNewUniqueExceptions = dfNewExceptions.sort(desc("LastUpdateTime"))
.coalesce(1)
.dropDuplicates("ExceptionId")
val dfNewExceptionCore = dfNewUniqueExceptions.select("ExceptionId", "LastUpdateTime")
dfNewExceptionCore.writeStream
.format("console")
// .outputMode("complete")
.option("truncate", "false")
.option("numRows",5000)
.start()
.awaitTermination(1000)
**
Exception in thread "main" org.apache.spark.sql.AnalysisException: Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;
**
This is also documented here: https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/structured-streaming-programming-guide.html
Any suggestions on how the duplicates can be removed from dfNewExceptions?
I recommend to follow the approach explained in the Structured Streaming Guide on Streaming Deduplication. There it says:
You can deduplicate records in data streams using a unique identifier in the events. This is exactly same as de-duplication on static using a unique identifier column. The query will store the necessary amount of data from previous records such that it can filter duplicate records. Similar to aggregations, you can use de-duplication with or without watermarking.
With watermark - If there is an upper bound on how late a duplicate record may arrive, then you can define a watermark on an event time column and deduplicate using both the guid and the event time columns. The query will use the watermark to remove old state data from past records that are not expected to get any duplicates any more. This bounds the amount of the state the query has to maintain.
An example in Scala is also given:
val dfExceptions = spark.readStream. ... // columns: ExceptionId, LastUpdateTime, ...
dfExceptions
.withWatermark("LastUpdateTime", "10 seconds")
.dropDuplicates("ExceptionId", "LastUpdateTime")
You can use watermarking to drop duplicates in a specific timeframe.

Spark Partition Dataset By Column Value

(I am new to Spark) I need to store a large number of rows of data, and then handle updates to those data. We have unique IDs (DB PKs) for those rows, and we would like to shard the data set by uniqueID % numShards, to make equal sized, addressable partitions. Since the PKs (unique IDs) are present both in the data and in the update files, it will be easy to determine which partition will be updated. We intend to shard the data and the updates by the same criteria, and periodically rewrite "shard S + all updates accumulated for shard S => new shard S". (We know how to combine shard S + updates = new shard S.)
If this is our design, we need to (1) shard a DataFrame by one of its columns (say: column K) into |range(K)| partitions where it is guaranteed that all rows in a partition have the same value in column K and (2) be able to find the Parquet file that corresponds to column_K=k, knowing k = row.uniqueID % numShards.
Is this a good design, or does Spark offer something out of the box that makes our task much easier?
Which Spark class/method should we use for partitioning our data? We are looking at RangePartitioner, but the constructor is asking for the number of partitions. We want to specify "use column_K for partitioning, and make one partition for each distinct value k in range(K)", because we have already created column_K = uniqueID % numShards. Which partitioner is appropriate for splitting on the value of one column of a DataFrame? Do we need to create a custom partitioner, or use partitionBy, or repartitionByRange, or...?
This is what we have so far:
import org.apache.spark.sql.functions._
val df = spark.read
.option("fetchsize", 1000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc(jdbc_url, "SCHEMA.TABLE_NAME", partitions, props)
.withColumn("SHARD_ID", col("TABLE_PK") % 1024)
.write
.parquet("parquet/table_name")
Now we need to specify that this DataFrame should be partitioned by SHARD_ID before it is written out as Parquet files.
This works:
val df = spark.read
.option("fetchsize", 1000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc(jdbc.getString("url"), "SCHEMA.TABLE_NAME", partitions, props)
.withColumn("SHARD_ID", col("TABLE_PK") % 1024)
.write
.partitionBy("SHARD_ID")
.parquet("parquet/table_name")

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

merge multiple small files in to few larger files in Spark

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")
val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")
val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")
val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")
val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")
val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")
val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?
I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:
INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date
You may want to try using the DataFrame.coalesce method; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion). So using the number of records you are inserting and the typical size of each record, you can estimate how many partitions to coalesce to if you want files of ~200MB.
The dataframe repartition(1) method works in this case.