I have a dataset I want to repartition evenly into 10 buckets per unique value of a column, and I want to size this result into a large number of partitions so that each is small.
col_1 is guaranteed to be one of the values in ["CREATE", "UPDATE", "DELETE"]
My code looks like the following:
df.show()
"""
+------+-----+-----+
| col_1|col_2|index|
+------+-----+-----+
|CREATE| 0| 0|
|CREATE| 0| 1|
|UPDATE| 0| 2|
|UPDATE| 0| 3|
|DELETE| 0| 4|
|DELETE| 0| 5|
|CREATE| 0| 6|
|CREATE| 0| 7|
|CREATE| 0| 8|
+------+-----+-----+
"""
df = df.withColumn(
"partition_column",
F.concat(
F.col("col_1"),
F.round( # Pick a random number between 0 and 9
F.random() * F.lit(10),
0
)
)
)
df = df.repartition(1000, F.col("partition_col"))
I see that most of my tasks run and finish with zero rows of data, I would expect the data to be evenly distributed on my partition_col into 1000 partitions?
It's important to understand that the mechanism Spark uses to distribute its data is based upon the hash value of the columns you provide to the repartition() call.
In this case, you have one column with random values between 0 and 9, combined with another column that only ever has one of 3 different values in it.
Therefore, you'll have 10 * 3 unique combinations of values going into the repartition() call. This means that when you call the underlying hash on this column, you'll only ever have 30 unique values from which Spark will do its modulus 1000 on top of it. Therefore, the most number of partitions you will ever have is 30.
You'll need to distribute your data into a greater number of random values if you want to go above partition counts of 30, or figure out another partitioning strategy entirely :)
Related
I have a dataframe with several columns, some of which are labeled PULocationID, DOLocationID, total_amount, and trip_distance. I'm trying to group by both PULocationID and DOLocationID, then count the combination each into a column called "count". I also need to take the average of total_amount and trip_distance and divide them into a column called "trip_rate". The end DF should be:
PULocationID
DOLocationID
count
trip_rate
123
422
1
5.2435
3
27
4
6.6121
Where (123,422) are paired together once for a trip rate of $5.24 and (3, 27) are paired together 4 times where the trip rate is $6.61.
Through reading some other threads, I'm able to group by the locations and count them using the below:
df.groupBy("PULocationID", 'DOLocationID').agg(count(lit(1)).alias("count")).show()
OR I can group by the locations and get the averages of the two columns I need using the below:
df.groupBy("PULocationID", 'DOLocationID').agg({'total_amount':'avg', 'trip_distance':'avg'}).show()
I tried a couple of things to get the trip_rate, but neither worked:
df.withColumn("trip_rate", (pyspark.sql.functions.col("total_amount") / pyspark.sql.functions.col("trip_distance")))
df.withColumn("trip_rate", df.total_amount/sum(df.trip_distance))
I also can't figure out how to combine the two queries that work (i.e. count of locations + averages).
Using this as an example input DataFrame:
+------------+------------+------------+-------------+
|PULocationID|DOLocationID|total_amount|trip_distance|
+------------+------------+------------+-------------+
| 123| 422| 10.487| 2|
| 3| 27| 19.8363| 3|
| 3| 27| 13.2242| 2|
| 3| 27| 6.6121| 1|
| 3| 27| 26.4484| 4|
+------------+------------+------------+-------------+
You can chain together the groupBy, agg, and select (you could also use withColumn and drop if you only need the 4 columns).
import pyspark.sql.functions as F
new_df = df.groupBy(
"PULocationID",
"DOLocationID",
).agg(
F.count(F.lit(1)).alias("count"),
F.avg(F.col("total_amount")).alias("avg_amt"),
F.avg(F.col("trip_distance")).alias("avg_distance"),
).select(
"PULocationID",
"DOLocationID",
"count",
(F.col("avg_amt") / F.col("avg_distance")).alias("trip_rate")
)
new_df.show()
+------------+------------+-----+-----------------+
|PULocationID|DOLocationID|count| trip_rate|
+------------+------------+-----+-----------------+
| 123| 422| 1| 5.2435|
| 3| 27| 4|6.612100000000001|
+------------+------------+-----+-----------------+
I need to use window function that is paritioned by 2 columns and do distinct count on the 3rd column and that as the 4th column. I can do count with out any issues, but using distinct count is throwing exception -
rg.apache.spark.sql.AnalysisException: Distinct window functions are not supported:
Is there any workaround for this ?
A previous answer suggested two possible techniques: approximate counting and size(collect_set(...)). Both have problems.
If you need an exact count, which is the main reason to use COUNT(DISTINCT ...) in big data, approximate counting will not do. Also, approximate counting actual error rates can vary quite significantly for small data.
size(collect_set(...)) may cause a substantial slowdown in processing of big data because it uses a mutable Scala HashSet, which is a pretty slow data structure. In addition, you may occasionally get strange results, e.g., if you run the query over an empty dataframe, because size(null) produces the counterintuitive -1. Spark's native distinct counting runs faster for a number of reasons, the main one being that it doesn't have to produce all the counted data in an array.
The typical approach to solving this problem is with a self-join. You group by whatever columns you need, compute the distinct count or any other aggregate function that cannot be used as a window function, and then join back to your original data.
Use approx_count_distinct (or) collect_set and size functions on window to mimic countDistinct functionality.
Example:
df.show()
//+---+---+---+
//| i| j| k|
//+---+---+---+
//| 1| a| c|
//| 2| b| d|
//| 1| a| c|
//| 2| b| e|
//+---+---+---+
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("i","j")
df.withColumn("cnt",size(collect_set("k").over(windowSpec))).show()
//or using approx_count_distinct
df.withColumn("cnt",approx_count_distinct("k").over(windowSpec)).show()
//+---+---+---+---+
//| i| j| k|cnt|
//+---+---+---+---+
//| 2| b| d| 2|
//| 2| b| e| 2|
//| 1| a| c| 1| //as c value repeated for 1,a partition
//| 1| a| c| 1|
//+---+---+---+---+
Trying to improve Sim's answer, if you want to do this:
//val newColumnName: String = ...
//val colToCount: Column = ...
//val aggregatingCols: Seq[Column] = ...
df.withColumn(newColName, countDistinct(colToCount).over(partitionBy(aggregatingCols:_*)))
You must instead do this:
//val aggregatingCols: Seq[String] = ...
df.groupBy(aggregatingCols.head, aggregatingCols.tail:_*)
.agg(countDistinct(colToCount).as(newColName))
.select(newColName, aggregatingCols:_*)
.join(df, usingColumns = aggregatingCols)
This will return the number of distinct elements in the partition, using dense_rank() function. When we sum ascending and descending rank, we always get the total number of distinct elements + 1 :
dense_rank().over(Window.partitionBy("i").orderBy(c.asc)) + dense_rank().over(Window.partitionBy("i").orderBy(c.desc)) - 1
For context, my ultimate goal is to remove nearly-duplicated rows from a very large dataframe. Here is some dummy data:
+---+--------+----------+---+-------+-------+---+-------+-------+
|key|unique_1| unique_2|...|col_125|col_126|...|col_414|col_415|
+---+--------+----------+---+-------+-------+---+-------+-------+
| 1| 123|01-01-2000|...| 1| true|...| 100| 25|
| 2| 123|01-01-2000|...| 0| false|...| 100| 25|
| 3| 321|12-12-2012|...| 3| true|...| 99| 1|
| 4| 321|12-12-2012|...| 3| false|...| 99| 5|
+---+--------+----------+---+-------+-------+---+-------+-------+
In this data, combinations of observations from unique_1 and unique_2 should be distinct, but they aren't always. When they are repeated, they have the same values for the vast majority of the columns, but have variation on a very small set of other columns. I am trying to develop a strategy to deal with the near-duplicates, but it is complicated because each set of near-duplicates has a different set of columns which contain variation.
I'm trying to see the columns that contain variation for a single set of near-duplicates at a time - like this:
+---+-------+-------+
|key|col_125|col_126|
+---+-------+-------+
| 1| 1| true|
| 2| 20| false|
+---+-------+-------+
or this:
+---+-------+-------+
|key|col_126|col_415|
+---+-------+-------+
| 3| true| 1|
| 4| false| 5|
+---+-------+-------+
I've successfully gotten this result with a few different approaches. This was my first attempt:
def findColumnsWithDiffs(df: DataFrame): DataFrame = {
df.columns.foldLeft(df){(a,b) =>
a.select(b).distinct.count match {
case 1 => a.drop(b)
case _ => a
}
}
}
val smallFrame = originalData.filter(($"key" === 1) || ($"key" === 2))
val desiredOutput = findColumnsWithDiffs(smallFrame)
And this works insofar as it gave me what I want, but it is so unbelievably slow. It is approximately 10x slower for the function above to run then it takes to display all of the data in smallFrame (and I think that the performance only gets worse with the size of the data - although I have not tested that hypothesis thoroughly).
I thought that using fold instead of foldLeft might yield some improvements, so I rewrote the findColumnsWithDiffs function like this:
def findColumnsWithDiffsV2(df: DataFrame): DataFrame = {
val colsWithDiffs = df.columns.map(colName => List(colName)).toList.fold(Nil){(a,b) =>
df.select(col(b(0))).distinct.count match {
case 1 => a
case _ => a ++ b
}
}
df.select(colsWithDiffs.map(colName => col(colName)):_*)
}
But performance was the same. I also tried was to map each column to the number of distinct values it has and work from there, but again performance was the same. At this point I'm out of ideas. My hunch is that the filter is being performed for each column which is why it is so terribly slow, but I don't know how to verify that theory and/or change what I'm doing to fix it if I'm correct. Does anyone have ideas to improve the efficiency of what I'm doing?
I'm currently using spark 2.1.0 / scala 2.11.8
All of the approaches to identifying the distinct values are fine, the issue is with the lazy evaluation of the filter. To improve performance, call smallFrame.cache before using findColsWithDiffs. This will save the filtered data in memory, which will be fine because it is only a few rows at a time.
I am running a spark application that reads data from a few hive tables(IP addresses) and compares each element(IP address) in a dataset with all other elements(IP addresses) from the other datasets. The end result would be something like:
+---------------+--------+---------------+---------------+---------+----------+--------+----------+
| ip_address|dataset1|dataset2 |dataset3 |dataset4 |dataset5 |dataset6| date|
+---------------+--------+---------------+---------------+---------+----------+--------+----------+
| xx.xx.xx.xx.xx| 1 | 1| 0| 0| 0| 0 |2017-11-06|
| xx.xx.xx.xx.xx| 0 | 0| 1| 0| 0| 1 |2017-11-06|
| xx.xx.xx.xx.xx| 1 | 0| 0| 0| 0| 1 |2017-11-06|
| xx.xx.xx.xx.xx| 0 | 0| 1| 0| 0| 1 |2017-11-06|
| xx.xx.xx.xx.xx| 1 | 1| 0| 1| 0| 0 |2017-11-06|
---------------------------------------------------------------------------------------------------
For doing the comparison, I am converting the dataframes resulting from the hiveContext.sql("query") statement into Fastutil objects. Like this:
val df= hiveContext.sql("query")
val dfBuffer = new it.unimi.dsi.fastutil.objects.ObjectArrayList[String](df.map(r => r(0).toString).collect())
Then, I am using an iterator to iterate over each collection and write the rows to a file using FileWriter.
val dfIterator = dfBuffer.iterator()
while (dfIterator.hasNext){
val p = dfIterator.next().toString
//logic
}
I am running the application with --num-executors 20 --executor-memory 16g --executor-cores 5 --driver-memory 20g
The process runs for about 18-19 hours in total for about 4-5 million records with one to one comparisons on a daily basis.
However, when I checked the Application Master UI, I noticed that no activity takes place after the initial conversion of dataframes to fastutil collection objects is done (this takes only a few minutes after the job is launched). I see the count and collect statements used in the code producing new jobs till the conversion is done. After that, no new jobs are launched when the comparison is running.
What does this imply? Does it mean that the distributed processing is
not happening at all?
I understand that collection objects are not treated as RDDs, could
this be the reason for this?
How is spark executing my program without using the resources
assigned?
Any help would be appreciated, Thank you!
After the line:
val dfBuffer = new it.unimi.dsi.fastutil.objects.ObjectArrayList[String](df.map(r => r(0).toString).collect())
esp. that part of the above line:
df.map(r => r(0).toString).collect()
which collect is the very main thing to notice, no Spark jobs are ever performed on dfBuffer (which is a regular local one JVM data structure).
Does it mean that the distributed processing is not happening at all?
Correct. collect brings all the data on a single JVM where the driver runs (and is exactly the reason why you should not be doing it unless...you know what you are doing and what problems it may cause).
I think the above answers all the other questions.
A possible solution to your problem of comparing two datasets (in Spark and a distributed fashion) would be to join a dataset with the reference dataset and count to compare whether the number of records didn't change.
Let's assume I have the following table:
time| id| value
1| 1| 1
3| 1| 1
1| 2| 2
The result of selecting a regular series would be:
time| id| value
1| 1| 1
2| 1| Null
3| 1| 1
1| 2| 2
2| 2| Null
3| 2| NUll
Normally, I would either just store theNULL values or create two additional tables: one holding all times, one table holding all id's and than join.
Problem with the first approach is that the table becomes quite big, because each new id forces me to insert NULL values for all previous times and each new time forces me to insert NULL values for all id's.
Problem with the second approach is that the join takes too long.
My idea is to implement a custom set returning function like the crosstab example in contrib\tablefunc.
My question is if I can expect this to be faster.