Shuffle pivoted PySpark DataFrame rows throws a NullPointedException - pyspark

Suppose the next PySpark DataFrame:
+-------+----+---+---+----+
|user_id|type| d1| d2| d3|
+-------+----+---+---+----+
| c1| A|3.4|0.4| 3.5|
| c1| B|9.6|0.0| 0.0|
| c1| A|2.8|0.4| 0.3|
| c1| B|5.4|0.2|0.11|
| c2| A|0.0|9.7| 0.3|
| c2| B|9.6|8.6| 0.1|
| c2| A|7.3|9.1| 7.0|
| c2| B|0.7|6.4| 4.3|
+-------+----+---+---+----+
created with:
df = sc.parallelize([
("c1", "A", 3.4, 0.4, 3.5),
("c1", "B", 9.6, 0.0, 0.0),
("c1", "A", 2.8, 0.4, 0.3),
("c1", "B", 5.4, 0.2, 0.11),
("c2", "A", 0.0, 9.7, 0.3),
("c2", "B", 9.6, 8.6, 0.1),
("c2", "A", 7.3, 9.1, 7.0),
("c2", "B", 0.7, 6.4, 4.3)
]).toDF(["user_id", "type", "d1", "d2", "d3"])
df.show()
Then, it's pivoted by user_id to obtain:
data_wide = df.groupBy('user_id')\
.pivot('type')\
.agg(*[f.sum(x).alias(x) for x in df.columns if x not in {"user_id", "type"}])
data_wide.show()
+-------+-----------------+------------------+----+------------------+----+------------------+
|user_id| A_d1| A_d2|A_d3| B_d1|B_d2| B_d3|
+-------+-----------------+------------------+----+------------------+----+------------------+
| c1|6.199999999999999| 0.8| 3.8| 15.0| 0.2| 0.11|
| c2| 7.3|18.799999999999997| 7.3|10.299999999999999|15.0|4.3999999999999995|
+-------+-----------------+------------------+----+------------------+----+------------------+
Now, I want to randomize its rows order:
data_wide = data_wide.orderBy(f.rand())
data_wide.show()
But this last step throws a NullPointedException:
Py4JJavaError: An error occurred while calling o101.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 161 in stage 27.0 failed 20 times, most recent failure: Lost task 161.19 in stage 27.0 (TID 1300, 192.168.192.57, executor 1): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_3$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:232)
However, if the wide df is cached before the orderBy(f.rand()) step, this last step works fine:
data_wide.cache()
data_wide = data_wide.orderBy(f.rand())
data_wide.show()
+-------+-----------------+------------------+----+------------------+----+------------------+
|user_id| A_d1| A_d2|A_d3| B_d1|B_d2| B_d3|
+-------+-----------------+------------------+----+------------------+----+------------------+
| c2| 7.3|18.799999999999997| 7.3|10.299999999999999|15.0|4.3999999999999995|
| c1|6.199999999999999| 0.8| 3.8| 15.0| 0.2| 0.11|
+-------+-----------------+------------------+----+------------------+----+------------------+
What is the problem here? It seems that, in the orderBy step, the pivot has not been made effective and it does not plan the execution correctly, but I do not know what the actual problem is. Any ideas?
Spark version is 2.1.0 and python version is 3.5.2
Thank you in advance

Related

How to filter columns in one table based on the same columns in another table using Spark

I need to filter columns in one table (fixTablehb004_p) based on the same columns in another table (filtredTable109_p)
I first wanted to use this code:
val filtredTablehb004_p = fixTablehb004_p
.where($"servizio_rap" === filtredTable109_p.col("servizio_rap"))
.where($"filiale_rap" === filtredTable109_p.col("filiale_rap"))
.where($"codice_rap" === filtredTable109_p.col("codice_rap"))
But it gave out an error.
Then I tried the code based on this stackoverflow question, and I get this code. But the problem is that there are extra columns, I know what you can do drop(columnName), but I want to ask you if I'm doing it right and if there is another better option
val filtredTablehb004_p = sparkSession.sql("SELECT * FROM fixTablehb004_p " +
"JOIN filtredTable109_p " +
"ON fixTablehb004_p.servizio_rap = filtredTable109_p.servizio_rap AND " +
"fixTablehb004_p.filiale_rap = filtredTable109_p.filiale_rap AND " +
"fixTablehb004_p.codice_rap = filtredTable109_p.codice_rap ")
Let's take 2 sample dataframes and see how we can select required columns or avoid duplicate key column names in joined output dataframe.
USING DATAFRAME API:
val df1 = Seq(("A1", "A2", 1), ("A3", "A4", 2), ("A1", "A3", 3))
.toDF("c1", "c2", "c3")
val df2 = Seq(("A1", "A2", 10), ("A3", "A4", 11))
.toDF("c1", "c2", "c4")
df1.createOrReplaceTempView("tab1")
df2.createOrReplaceTempView("tab2")
If column names which you used for joining condition from both dataframes are same then output dataframe will have duplicate columns. To avoid this you can pass all those columns as Seq to join().
df1.join(df2, Seq("c1", "c2")).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A1| A2| 1| 10|
| A3| A4| 2| 11|
+---+---+---+---+
To select required columns from specific dataframe you can use below syntax:
df1.join(df2, Seq("c1", "c2")).select('c1, 'c2, df1("c3")).show()
// OR
df1.join(df2, df1("c1") === df2("c1") && df1("c2") === df2("c2"))
.select(df1("c1"), df1("c2"), df1("c3")).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| A1| A2| 1|
| A3| A4| 2|
+---+---+---+
USING SQL API:
spark.sql(
"""
|SELECT t2.c1, t2.c2, t2.c4 FROM tab1 t1
|JOIN tab2 t2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2
|""".stripMargin).show()
//OR
spark.sql(
"""
|SELECT c1, c2, t2.c4 FROM tab1 t1
|JOIN tab2 t2 USING(c1, c2)
|""".stripMargin).show()
+---+---+---+
| c1| c2| c4|
+---+---+---+
| A1| A2| 10|
| A3| A4| 11|
+---+---+---+

How to move a specific column of a pyspark dataframe in the start of the dataframe

I have a pyspark dataframe as follows (this is just a simplified example, my actual dataframe has hundreds of columns):
col1,col2,......,col_with_fix_header
1,2,.......,3
4,5,.......,6
2,3,........,4
and I want to move col_with_fix_header in the start, so that the output comes as follows:
col_with_fix_header,col1,col2,............
3,1,2,..........
6,4,5,....
4,2,3,.......
I don't want to list all the columns in the solution.
In case you don't want to list all columns of your dataframe, you can use the dataframe property columns. This property gives you a python list of column names and you can simply slice it:
df = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
df.select([df.columns[-1]] + df.columns[:-1]).show()
Output:
+---+---+-------+
|age| id| name|
+---+---+-------+
| 34| a| Alice|
| 36| b| Bob|
| 30| c|Charlie|
| 29| d| David|
| 32| e| Esther|
| 36| f| Fanny|
| 60| g| Gabby|
+---+---+-------+

Find a record with max value in a group

I have the next dataset:
|month|temperature|city|
| 1| 15.0 |foo |
| 1| 20.0 |bar |
| 2| 25.0 |baz |
| 2| 30.0 |quok|
I want to find cities with highest temperatures per month:
|month|temperature|city|
| 1|20.0 |bar |
| 2|30.0 |quok|
How to do this using apache spark SQL? I tried to use window functions but failed to get the right results
Using a window function you can do it as follows:
import org.apache.spark.sql.expressions.{Window}
import org.apache.spark.sql.functions.{max}
val l = Seq((1, 15.0, "foo"), (1, 20.0, "bar"), (2, 25.0, "baz"), (2, 30.0, "quok"))
val df = l.toDF("month", "temperature", "city")
val w = Window.partitionBy("month")
df.withColumn("m", max("temperature").over(w))
.filter($"temperature" === $"m")
.select("month", "temperature", "city")
.show()
+-----+-----------+----+
|month|temperature|city|
+-----+-----------+----+
| 1| 20.0| bar|
| 2| 30.0|quok|
+-----+-----------+----+
Alternatively, you can do it also using groupBy + join:
val maxT = df.groupBy("month").agg(max("temperature").alias("maxT"))
df.join(maxT, Seq("month"), "left")
.filter($"temperature" === $"maxT")
.select("month", "temperature", "city")
.show()
+-----+-----------+----+
|month|temperature|city|
+-----+-----------+----+
| 1| 20.0| bar|
| 2| 30.0|quok|
+-----+-----------+----+
What is more efficient depends on the data. If the aggregated DataFrame can be broadcasted, the join will be more efficient.
The most efficient way is probabely to put both temperature and city in a struct in combination with max aggregation:
val df = Seq((1, 15.0, "foo"), (1, 20.0, "bar"), (2, 25.0, "baz"), (2, 30.0, "quok")).toDF("month", "temperature", "city")
df
.groupBy($"month")
.agg(max(struct($"temperature",$"city")).as("maxtemp"))
.select($"month",$"maxtemp.*")
.show()
gives :
+-----+-----------+----+
|month|temperature|city|
+-----+-----------+----+
| 1| 20.0| bar|
| 2| 30.0|quok|
+-----+-----------+----+

isin throws stackoverflow error in withcolumn function in spark

I am using spark 2.3 in my scala application. I have a dataframe which create from spark sql that name is sqlDF in the sample code which I shared. I have a string list that has the items below
List[] stringList items
-9,-8,-7,-6
I want to replace all values that match with this lists item in all columns in dataframe to 0.
Initial dataframe
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |-6 |1
-7 |-8 |-7
It must return to
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |0 |1
0 |0 |0
For this I am itarating the query below for all columns (more than 500) in sqlDF.
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList:_*), 0).otherwise(col(currColumnName)))
But getting the error below, by the way if I choose only one column for iterating it works, but if I run the code above for 500 columns iteration it fails
Exception in thread "streaming-job-executor-0"
java.lang.StackOverflowError at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
What is the thing that I am missing?
Here is a different approach applying left anti join between columnX and X where X is your list of items transferred into a dataframe. The left anti join will return all the items not present in X, the results we concatenate them all together through an outer join (which can be replaced with left join for better performance, this though will exclude records with all zeros i.e id == 3) based on the id assigned with monotonically_increasing_id:
import org.apache.spark.sql.functions.{monotonically_increasing_id, col}
val df = Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7))
.toDF("c1", "c2", "c3")
.withColumn("id", monotonically_increasing_id())
val exdf = Seq(-9, -8, -7, -6).toDF("x")
df.columns.map{ c =>
df.select("id", c).join(exdf, col(c) === $"x", "left_anti")
}
.reduce((df1, df2) => df1.join(df2, Seq("id"), "outer"))
.na.fill(0)
.show
Output:
+---+---+---+---+
| id| c1| c2| c3|
+---+---+---+---+
| 0| 1| 1| 1|
| 1| 2| -5| 1|
| 3| 0| 0| 0|
| 2| 6| 0| 1|
+---+---+---+---+
foldLeft works perfect for your case here as below
val df = spark.sparkContext.parallelize(Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7)
)).toDF("a", "b", "c")
val list = Seq(-7, -8, -9)
val resultDF = df.columns.foldLeft(df) { (acc, name) => {
acc.withColumn(name, when(col(name).isin(list: _*), 0).otherwise(col(name)))
}
}
Output:
+---+---+---+
|a |b |c |
+---+---+---+
|1 |1 |1 |
|2 |-5 |1 |
|6 |-6 |1 |
|0 |0 |0 |
+---+---+---+
I would suggest you to broadcast the list of String :
val stringList=sc.broadcast(<Your List of List[String]>)
After that use this :
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList.value:_*), 0).otherwise(col(currColumnName)))
Make sure your currColumnName also is in String Format. Comparison should be String to String

Flatten Group By in Pyspark

I have a pyspark dataframe. For example,
d= hiveContext.createDataFrame([("A", 1), ("B", 2), ("D", 3), ("D", 3), ("A", 4), ("D", 3)],["Col1", "Col2"])
+----+----+
|Col1|Col2|
+----+----+
| A| 1|
| B| 2|
| D| 3|
| D| 3|
| A| 4|
| D| 3|
+----+----+
I want to group by Col1 and then create a list of Col2. I need to flatten the groups. I do have a lot of columns.
+----+----------+
|Col1| Col2|
+----+----------+
| A| [1,4] |
| B| [2] |
| D| [3,3,3]|
+----+----------+
You can do a groupBy() and use collect_list() as your aggregate function:
import pyspark.sql.functions as f
d.groupBy('Col1').agg(f.collect_list('Col2').alias('Col2')).show()
#+----+---------+
#|Col1| Col2|
#+----+---------+
#| B| [2]|
#| D|[3, 3, 3]|
#| A| [1, 4]|
#+----+---------+
Update
If you had multiple columns to combine, you could use collect_list() on each, and the combine the resulting lists using struct() and udf(). Consider the following example:
Create Dummy Data
from operator import add
import pyspark.sql.functions as f
# create example dataframe
d = sqlcx.createDataFrame(
[
("A", 1, 10),
("B", 2, 20),
("D", 3, 30),
("D", 3, 10),
("A", 4, 20),
("D", 3, 30)
],
["Col1", "Col2", "Col3"]
)
Collect Desired Columns into lists
Suppose you had a list of columns you wanted to collect into a list. You could do the following:
cols_to_combine = ['Col2', 'Col3']
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine]).show()
#+----+---------+------------+
#|Col1| Col2| Col3|
#+----+---------+------------+
#| B| [2]| [20]|
#| D|[3, 3, 3]|[30, 10, 30]|
#| A| [4, 1]| [20, 10]|
#+----+---------+------------+
Combine Resultant Lists into one Column
Now we want to combine the list columns into one list. If we use struct(), we will get the following:
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', f.struct(*cols_to_combine).alias('Combined'))\
.show(truncate=False)
#+----+------------------------------------------------+
#|Col1|Combined |
#+----+------------------------------------------------+
#|B |[WrappedArray(2),WrappedArray(20)] |
#|D |[WrappedArray(3, 3, 3),WrappedArray(10, 30, 30)]|
#|A |[WrappedArray(1, 4),WrappedArray(10, 20)] |
#+----+------------------------------------------------+
Flatten Wrapped Arrays
Almost there. We just need to combine the WrappedArrays. We can achieve this with a udf():
combine_wrapped_arrays = f.udf(lambda val: reduce(add, val), ArrayType(IntegerType()))
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', combine_wrapped_arrays(f.struct(*cols_to_combine)).alias('Combined'))\
.show(truncate=False)
#+----+---------------------+
#|Col1|Combined |
#+----+---------------------+
#|B |[2, 20] |
#|D |[3, 3, 3, 30, 10, 30]|
#|A |[1, 4, 10, 20] |
#+----+---------------------+
References
Pyspark Merge WrappedArrays Within a Dataframe
Update 2
A simpler way, without having to deal with WrappedArrays:
from operator import add
combine_udf = lambda cols: f.udf(
lambda *args: reduce(add, args),
ArrayType(IntegerType())
)
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', combine_udf(cols_to_combine)(*cols_to_combine).alias('Combined'))\
.show(truncate=False)
#+----+---------------------+
#|Col1|Combined |
#+----+---------------------+
#|B |[2, 20] |
#|D |[3, 3, 3, 30, 10, 30]|
#|A |[1, 4, 10, 20] |
#+----+---------------------+
Note: This last step only works if the datatypes for all of the columns are the same. You can not use this function to combine wrapped arrays with mixed types.
from spark 2.4 you can use pyspark.sql.functions.flatten
import pyspark.sql.functions as f
df.groupBy('Col1').agg(f.flatten(f.collect_list('Col2')).alias('Col2')).show()