Explode function is increasing job time in Spark DataFrame

Explode function is increasing job time in Spark DataFrame - scala

I have a dataframe with one column arrs having an array of size close to 100000.
Now I need to explode this column to get unique rows for all the elements of Array.
Explode function of spark.sql is doing the job but is taking enough time
Any alternative of explode which I can try to optimize job.
dfs.printSchema()
println("Orginal DF")
dfs.show()
//Performing Explode operation
import org.apache.spark.sql.functions.{explode,col}
val opdfs=dfs.withColumn("explarrs",explode(col("arrs"))).drop("arrs")
println("Exploded DF")
opdfs.show()
Expected result should be as below but an alternative to this code which will optimize the job more efficiently.
Orginal DF
+----+------+----+--------------------+
|col1| col2|col3| arrs|
+----+------+----+--------------------+
| A|DFtest| K|[1, 2, 3, 4, 5, 6...|
+----+------+----+--------------------+
Exploded DF
+----+------+----+--------+
|col1| col2|col3|explarrs|
+----+------+----+--------+
| A|DFtest| K| 1|
| A|DFtest| K| 2|
| A|DFtest| K| 3|
| A|DFtest| K| 4|
| A|DFtest| K| 5|
| A|DFtest| K| 6|
| A|DFtest| K| 7|
| A|DFtest| K| 8|
| A|DFtest| K| 9|
| A|DFtest| K| 10|
| A|DFtest| K| 11|
| A|DFtest| K| 12|
| A|DFtest| K| 13|
| A|DFtest| K| 14|
| A|DFtest| K| 15|
| A|DFtest| K| 16|
| A|DFtest| K| 17|
| A|DFtest| K| 18|
| A|DFtest| K| 19|
| A|DFtest| K| 20|
+----+------+----+--------+
only showing top 20 rows

You can do the same without explode using flatMap method from Dataframe. For example, if you need to explode an array of integers you can proceed with something like:
val els = Seq(Row(Array(1, 2, 3)))
val df = spark.createDataFrame(spark.sparkContext.parallelize(els), StructType(Seq(StructField("data", ArrayType(IntegerType), false))))
df.show()
It gives:
+---------+
| data|
+---------+
|[1, 2, 3]|
+---------+
Using Dataframe´s flatmap:
df.flatMap(row => row.getAs[mutable.WrappedArray[Int]](0)).show()
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
+-----+
The problem with this is that you need to put the right type of the elements of your array in the getAs function, in addition to the memory overhead. As I said in my comment there was a bug that was fixed: https://issues.apache.org/jira/browse/SPARK-21657
But if you can´t upgrade your Spark version you can try the code above and compare.
If you want to add the other fields to your result you could do something like:
val els = Seq(Row(Array(1, 2, 3), "data1", "data2"), Row(Array(1, 2, 3, 4, 5, 6), "data10", "data20"))
val df = spark.createDataFrame(spark.sparkContext.parallelize(els),
StructType(Seq(StructField("data", ArrayType(IntegerType), false), StructField("data1", StringType, false), StructField("data2", StringType, false))))
df.show()
df.flatMap{ row =>
val arr = row.getAs[mutable.WrappedArray[Int]](0)
arr.map { el =>
(row.getAs[String](1), row.getAs[String](2), el)
}
}.show()
It gives:
+------+------+---+
| _1| _2| _3|
+------+------+---+
| data1| data2| 1|
| data1| data2| 2|
| data1| data2| 3|
|data10|data20| 1|
|data10|data20| 2|
|data10|data20| 3|
|data10|data20| 4|
|data10|data20| 5|
|data10|data20| 6|
+------+------+---+
maybe it can help.

Related

Pyspark: how to round up or down (round to the nearest) [duplicate]

This question already has an answer here:
Round double values and cast as integers
(1 answer)
Closed 2 years ago.
I have a df that looks like this
TEST_schema = StructType([StructField("date", StringType(), True),\
StructField("col1", FloatType(), True),\
])
TEST_data = [('2020-08-01',1.22),('2020-08-02',1.15),('2020-08-03',5.4),('2020-08-04',2.6),('2020-08-05',3.5),\
('2020-08-06',2.2),('2020-08-07',2.7),('2020-08-08',-1.6),('2020-08-09',1.3)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df = TEST_df.withColumn("date",to_date("date", 'yyyy-MM-dd'))
TEST_df.show()
+----------+-----+
| date|col1 |
+----------+-----+
|2020-08-01| 1.22|
|2020-08-02| 1.15|
|2020-08-03| 5.4 |
|2020-08-04| 2.6 |
|2020-08-05| 3.5 |
|2020-08-06| 2.2 |
|2020-08-07| 2.7 |
|2020-08-08|-1.6 |
|2020-08-09| 1.3 |
+----------+-----+
Logic : round col1 to the nearest and return as integer , and max( rounded value , 0)
the resulted df looks like this:
+----------+----+----+
| date|col1|want|
+----------+----+----+
|2020-08-01| 1.2| 1|
|2020-08-02| 1.1| 1|
|2020-08-03| 5.4| 5|
|2020-08-04| 2.6| 3|
|2020-08-05| 3.5| 4|
|2020-08-06| 2.2| 2|
|2020-08-07| 2.7| 3|
|2020-08-08|-1.6| 0|
|2020-08-09| 1.3| 1|
+----------+----+----+

Check the duplicated question that gives you all.
data = [('2020-08-01',1.22),('2020-08-02',1.15),('2020-08-03',5.4),('2020-08-04',2.6),('2020-08-05',3.5),('2020-08-06',2.2),('2020-08-07',2.7),('2020-08-08',-1.6),('2020-08-09',1.3)]
df = spark.createDataFrame(data, ['date', 'col1'])
df.withColumn('want', expr('ROUND(col1, 0)').cast('int')).show()
+----------+----+----+
| date|col1|want|
+----------+----+----+
|2020-08-01|1.22| 1|
|2020-08-02|1.15| 1|
|2020-08-03| 5.4| 5|
|2020-08-04| 2.6| 3|
|2020-08-05| 3.5| 4|
|2020-08-06| 2.2| 2|
|2020-08-07| 2.7| 3|
|2020-08-08|-1.6| -2|
|2020-08-09| 1.3| 1|
+----------+----+----+

First, here i am checking whether it's lessthan zero or not. Here we are using
when method in pyspark functions, first we check whether the value in the column
is lessthan zero, if it is will make it to zero, otherwise we take the actual value in the column then cast to int
from pyspark.sql import functions as F
TEST_df.withColumn("want", F.bround(F.when(TEST_df["col1"] < 0, 0).otherwise(TEST_df["col1"])).cast("int")).show()
+----------+----+----+
| date|col1|want|
+----------+----+----+
|2020-08-01|1.22| 1|
|2020-08-02|1.15| 1|
|2020-08-03| 5.4| 5|
|2020-08-04| 2.6| 3|
|2020-08-05| 3.5| 4|
|2020-08-06| 2.2| 2|
|2020-08-07| 2.7| 3|
|2020-08-08|-1.6| 0|
|2020-08-09| 1.3| 1|
+----------+----+----+

How do I replace null values of multiple columns with values from multiple different columns

I have a data frame like below
data = [
(1, None,7,10,11,19),
(1, 4,None,10,43,58),
(None, 4,7,67,88,91),
(1, None,7,78,96,32)
]
df = spark.createDataFrame(data, ["A_min", "B_min","C_min","A_max", "B_max","C_max"])
df.show()
and I would want the columns which show name as 'min' to be replaced by their equivalent max column.
Example null values of A_min column should be replaced by A_max column
It should be like the data frame below.
+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
| 1| 11| 7| 10| 11| 19|
| 1| 4| 58| 10| 43| 58|
| 67| 4| 7| 67| 88| 91|
| 1| 96| 7| 78| 96| 32|
+-----+-----+-----+-----+-----+-----+
I have tried the code below by defining the columns but clearly this does not work. Really appreciate any help.
min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]
for i in min_cols
df = df.withColumn(i,when(f.col(i)=='',max_cols.otherwise(col(i))))
display(df)

Assuming you have the same number of max and min columns, you can use coalesce along with python's list comprehension to obtain your solution
from pyspark.sql.functions import coalesce
min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]
df.select(*[coalesce(df[val], df[max_cols[pos]]).alias(val) for pos, val in enumerate(min_cols)], *max_cols).show()
Output:
+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
| 1| 11| 7| 10| 11| 19|
| 1| 4| 58| 10| 43| 58|
| 67| 4| 7| 67| 88| 91|
| 1| 96| 7| 78| 96| 32|
+-----+-----+-----+-----+-----+-----+

Fill null or empty with next Row value with spark

Is there a way to replace null values in spark data frame with next row not null value. There is additional row_count column added for windows partitioning and ordering. More specifically, I'd like to achieve the following result:
+---------+-----------+ +---------+--------+
| row_count | id| |row_count | id|
+---------+-----------+ +------+-----------+
| 1| null| | 1| 109|
| 2| 109| | 2| 109|
| 3| null| | 3| 108|
| 4| null| | 4| 108|
| 5| 108| => | 5| 108|
| 6| null| | 6| 110|
| 7| 110| | 7| 110|
| 8| null| | 8| null|
| 9| null| | 9| null|
| 10| null| | 10| null|
+---------+-----------+ +---------+--------+
I tried with below code, It is not giving proper result.
val ss = dataframe.select($"*", sum(when(dataframe("id").isNull||dataframe("id") === "", 1).otherwise(0)).over(Window.orderBy($"row_count")) as "value")
val window1=Window.partitionBy($"value").orderBy("id").rowsBetween(0, Long.MaxValue)
val selectList=ss.withColumn("id_fill_from_below",last("id").over(window1)).drop($"row_count").drop($"value")

Here is a approach
Filter the non nulls (dfNonNulls)
Filter the nulls (dfNulls)
Find the right value for null id, using join and Window function
Fill the null dataframe (dfNullFills)
union dfNonNulls and dfNullFills
data.csv
row_count,id
1,
2,109
3,
4,
5,108
6,
7,110
8,
9,
10,
var df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data.csv")
var dfNulls = df.filter(
$"id".isNull
).withColumnRenamed(
"row_count","row_count_nulls"
).withColumnRenamed(
"id","id_nulls"
)
val dfNonNulls = df.filter(
$"id".isNotNull
).withColumnRenamed(
"row_count","row_count_values"
).withColumnRenamed(
"id","id_values"
)
dfNulls = dfNulls.join(
dfNonNulls, $"row_count_nulls" lt $"row_count_values","left"
).select(
$"id_nulls",$"id_values",$"row_count_nulls",$"row_count_values"
)
val window = Window.partitionBy("row_count_nulls").orderBy("row_count_values")
val dfNullFills = dfNulls.withColumn(
"rn", row_number.over(window)
).where($"rn" === 1).drop("rn").select(
$"row_count_nulls".alias("row_count"),$"id_values".alias("id"))
dfNullFills .union(dfNonNulls).orderBy($"row_count").show()
which results in
+---------+----+
|row_count| id|
+---------+----+
| 1| 109|
| 2| 109|
| 3| 108|
| 4| 108|
| 5| 108|
| 6| 110|
| 7| 110|
| 8|null|
| 9|null|
| 10|null|
+---------+----+

How to efficiently perform this column operation on a Spark Dataframe?

I have a dataframe as follows:
+---+---+---+
| F1| F2| F3|
+---+---+---+
| x| y| 1|
| x| z| 2|
| x| a| 4|
| x| a| 4|
| x| y| 1|
| t| y2| 6|
| t| y3| 4|
| t| y4| 5|
+---+---+---+
I want to add another column with value as (number of unique rows of "F1" and "F2" for each unique "F3" / total number of unique rows of "F1" and "F2").
For example, for the above table, below is the desired new dataframe:
+---+---+---+----+
| F1| F2| F3| F4|
+---+---+---+----+
| t| y4| 5| 1/6|
| x| y| 1| 1/6|
| x| y| 1| 1/6|
| x| z| 2| 1/6|
| t| y2| 6| 1/6|
| t| y3| 4| 2/6|
| x| a| 4| 2/6|
| x| a| 4| 2/6|
+---+---+---+----+
Note: in case of F3 = 4, there are only 2 unique F1 and F2 = {(t, y3), (x, a)}. Therefore, for all occurrences of F3 = 4, F4 will be 2/(total number of unique ordered pairs of F1 and F2. Here there are 6 such pairs)
How to achieve the above transformation in Spark Scala?

I just learnt trying to solve your problem, that you can't use Distinct functions while performing Window over DataFrames.
So what I did is create an temporary DataFrame and join it with the initial to obtain your desired results :
case class Dog(F1:String, F2: String, F3: Int)
val df = Seq(Dog("x", "y", 1), Dog("x", "z", 2), Dog("x", "a", 4), Dog("x", "a", 4), Dog("x", "y", 1), Dog("t", "y2", 6), Dog("t", "y3", 4), Dog("t", "y4", 5)).toDF
val unique_F1_F2 = df.select("F1", "F2").distinct.count
val dd = df.withColumn("X1", concat(col("F1"), col("F2")))
.groupBy("F3")
.agg(countDistinct(col("X1")).as("distinct_count"))
val final_df = dd.join(df, "F3")
.withColumn("F4", col("distinct_count")/unique_F1_F2)
.drop("distinct_count")
final_df.show
+---+---+---+-------------------+
| F3| F1| F2| F4|
+---+---+---+-------------------+
| 1| x| y|0.16666666666666666|
| 1| x| y|0.16666666666666666|
| 6| t| y2|0.16666666666666666|
| 5| t| y4|0.16666666666666666|
| 4| t| y3| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 2| x| z|0.16666666666666666|
+---+---+---+-------------------+
I hope this is what you expected !
EDIT : I changed df.count to unique_F1_F2

Pass Distinct value of one Dataframe into another Dataframe

I want to take distinct value of column from DataFrame A and Pass that into DataFrame B's explode
function to create repeat rows (DataFrameB) for each distinct value.
distinctSet = targetDf.select('utilityId').distinct())
utilisationFrequencyTable = utilisationFrequencyTable.withColumn("utilityId", psf.explode(assign_utilityId()))
Function
assign_utilityId = psf.udf(
lambda id: [x for x in id],
ArrayType(LongType()))
How to pass distinctSet values to assign_utilityId
Update
+---------+
|utilityId|
+---------+
| 101|
| 101|
| 102|
+---------+
+-----+------+--------+
|index|status|timeSlot|
+-----+------+--------+
| 0| SUN| 0|
| 0| SUN| 1|
I want to take Unique value from Dataframe 1 and create new column in dataFrame 2. Like this
+-----+------+--------+--------+
|index|status|timeSlot|utilityId
+-----+------+--------+--------+
| 0| SUN| 0|101
| 0| SUN| 1|101
| 0| SUN| 0|102
| 0| SUN| 1|102

We don't need a udf for this. I have tried with some input,please check
>>> from pyspark.sql import function as F
>>> df = spark.createDataFrame([(1,),(2,),(3,),(2,),(3,)],['col1'])
>>> df.show()
+----+
|col1|
+----+
| 1|
| 2|
| 3|
| 2|
| 3|
+----+
>>> df1 = spark.createDataFrame([(1,2),(2,3),(3,4)],['col1','col2'])
>>> df1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2| 3|
| 3| 4|
+----+----+
>>> dist_val = df.select(F.collect_set('col1').alias('val')).first()['val']
>>> dist_val
[1, 2, 3]
>>> df1 = df1.withColumn('col3',F.array([F.lit(x) for x in dist_val]))
>>> df1.show()
+----+----+---------+
|col1|col2| col3|
+----+----+---------+
| 1| 2|[1, 2, 3]|
| 2| 3|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+----+----+---------+
>>> df1.select("*",F.explode('col3').alias('expl_col')).drop('col3').show()
+----+----+--------+
|col1|col2|expl_col|
+----+----+--------+
| 1| 2| 1|
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 1|
| 2| 3| 2|
| 2| 3| 3|
| 3| 4| 1|
| 3| 4| 2|
| 3| 4| 3|
+----+----+--------+

df = sqlContext.createDataFrame(sc.parallelize([(101,),(101,),(102,)]),['utilityId'])
df2 = sqlContext.createDataFrame(sc.parallelize([(0,'SUN',0),(0,'SUN',1)]),['index','status','timeSlot'])
rdf = df.distinct()
>>> df2.join(rdf).show()
+-----+------+--------+---------+
|index|status|timeSlot|utilityId|
+-----+------+--------+---------+
| 0| SUN| 0| 101|
| 0| SUN| 0| 102|
| 0| SUN| 1| 101|
| 0| SUN| 1| 102|
+-----+------+--------+---------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Explode function is increasing job time in Spark DataFrame - scala

Related

Pyspark: how to round up or down (round to the nearest) [duplicate]

How do I replace null values of multiple columns with values from multiple different columns

Fill null or empty with next Row value with spark

How to efficiently perform this column operation on a Spark Dataframe?

Pass Distinct value of one Dataframe into another Dataframe

Categories

Resources