I'm new to spark and have a question regarding filtering dataframe based on null condition.
I have gone through many answers which has solution like
df.filter(($"col2".isNotNULL) || ($"col2" !== "NULL") || ($"col2" !== "null") || ($"col2".trim !== "NULL"))
But in my case, I can not write hard coded column names as my schema is not fixed. I am reading csv file and depending upon the columns in it, I have to filter my dataframe for null values and want it in another dataframe. In short, any column which has null value, that complete row should come under a different dataframe.
for example :
Input DataFrame :
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|null |[c1,1,d1]|
| n3| 3|n3#c1.com| null |
| n4| 4|n4#c2.com|[c2,2,d2]|
| n6| 6|n6#c2.com|[c2,2,d2]|
Output :
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|null |[c1,1,d1]|
| n3| 3|n3#c1.com| null |
Thank you in advance.
Try this-
val df1 = spark.sql("select col1, col2 from values (null, 1), (2, null), (null, null), (1,2) T(col1, col2)")
/**
* +----+----+
* |col1|col2|
* +----+----+
* |null|1 |
* |2 |null|
* |null|null|
* |1 |2 |
* +----+----+
*/
df1.show(false)
df1.filter(df1.columns.map(col(_).isNull).reduce(_ || _)).show(false)
/**
* +----+----+
* |col1|col2|
* +----+----+
* |null|1 |
* |2 |null|
* |null|null|
* +----+----+
*/
Thank you so much for your answers. I tried below logic and it worked for me.
var arrayColumn = df.columns;
val filterString = String.format(" %1$s is null or %1$s == '' "+ arrayColumn(0));
val x = new StringBuilder(filterString);
for(i <- 1 until arrayColumn.length){
if (x.toString() != ""){
x ++= String.format("or %1$s is null or %1$s == '' ", arrayColumn(i))
}
}
val dfWithNullRows = df.filter(x.toString());
To deal with null values and dataframes spark has some useful functions.
I will show some dataframes examples with distinct number of columns.
val schema = StructType(List(StructField("id", IntegerType, true), StructField("obj",DoubleType, true)))
val schema1 = StructType(List(StructField("id", IntegerType, true), StructField("obj",StringType, true), StructField("obj",IntegerType, true)))
val t1 = sc.parallelize(Seq((1,null),(1,1.0),(8,3.0),(2,null),(3,1.4),(3,2.5),(null,3.7))).map(t => Row(t._1,t._2))
val t2 = sc.parallelize(Seq((1,"A",null),(2,"B",null),(3,"C",36),(null,"D",15),(5,"E",25),(6,null,7),(7,"G",null))).map(t => Row(t._1,t._2,t._3))
val tt1 = spark.createDataFrame(t1, schema)
val tt2 = spark.createDataFrame(t2, schema1)
tt1.show()
tt2.show()
// To clean all rows with null values
val dfWithoutNull = tt1.na.drop()
dfWithoutNull.show()
val df2WithoutNull = tt2.na.drop()
df2WithoutNull.show()
// To fill null values with another value
val df1 = tt1.na.fill(-1)
df1.show()
// to get new dataframes with the null values rows
val nullValues = tt1.filter(row => row.anyNull == true)
nullValues.show()
val nullValues2 = tt2.filter(row => row.anyNull == true)
nullValues2.show()
output
// input dataframes
+----+----+
| id| obj|
+----+----+
| 1|null|
| 1| 1.0|
| 8| 3.0|
| 2|null|
| 3| 1.4|
| 3| 2.5|
|null| 3.7|
+----+----+
+----+----+----+
| id| obj| obj|
+----+----+----+
| 1| A|null|
| 2| B|null|
| 3| C| 36|
|null| D| 15|
| 5| E| 25|
| 6|null| 7|
| 7| G|null|
+----+----+----+
// Dataframes without null values
+---+---+
| id|obj|
+---+---+
| 1|1.0|
| 8|3.0|
| 3|1.4|
| 3|2.5|
+---+---+
+---+---+---+
| id|obj|obj|
+---+---+---+
| 3| C| 36|
| 5| E| 25|
+---+---+---+
// Dataframe with null values replaced
+---+----+
| id| obj|
+---+----+
| 1|-1.0|
| 1| 1.0|
| 8| 3.0|
| 2|-1.0|
| 3| 1.4|
| 3| 2.5|
| -1| 3.7|
+---+----+
// Dataframes which the rows have at least one null value
+----+----+
| id| obj|
+----+----+
| 1|null|
| 2|null|
|null| 3.7|
+----+----+
+----+----+----+
| id| obj| obj|
+----+----+----+
| 1| A|null|
| 2| B|null|
|null| D| 15|
| 6|null| 7|
| 7| G|null|
+----+----+----+
I have 2 dataframes like this.
scala> df1.show
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2|30.676776|
+---+---------+
scala> df2.show
+---+-----------+
| ID| Count|
+---+-----------+
| 1|10.00998787|
| 2| 40.7767|
+---+-----------+
How can i take take the max of the column-count after join?
Expected output.
+---+---------+
| id| Count|
+---+---------+
| 1|20.565656|
| 2|40.7767 |
+---+---------+
You can do this:
df1.union(df2).groupBy("ID").max("Count").show()
+---+----------+
| ID|max(Count)|
+---+----------+
| 1| 20.565656|
| 2| 40.7767|
+---+----------+
After joining both dataframes, create an UDF with 2 count columns as input and in the UDF return the greatest value between those columns.
Always its a good practice to use UDF when we need to derive a single column based on multiple columns.
scala> df.show()
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2|30.676776|
+---+---------+
scala> df1.show()
+---+-----------+
| ID| Count|
+---+-----------+
| 1|10.00998787|
| 2| 40.7767|
+---+-----------+
scala> df.alias("x").join(df1.alias("y"), List("ID"))
.select(col("ID"), col("x.count").alias("Xcount"),col("y.count").alias("Ycount"))
.withColumn("Count", when(col("Xcount") >= col("Ycount"), col("Xcount")).otherwise(col("Ycount")))
.drop("Xcount", "YCount")
.show()
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2| 40.7767|
+---+---------+
I am trying to build a dataframe of 10k records to then save to a parquet file on Spark 2.4.3 standalone
The following works in a small scale up to 1000 records but takes forever when ramping up to 10k
scala> import spark.implicits._
import spark.implicits._
scala> var someDF = Seq((0, "item0")).toDF("x", "y")
someDF: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> for ( i <- 1 to 1000 ) {someDF = someDF.union(Seq((i,"item"+i)).toDF("x", "y")) }
scala> someDF.show
+---+------+
| x| y|
+---+------+
| 0| item0|
| 1| item1|
| 2| item2|
| 3| item3|
| 4| item4|
| 5| item5|
| 6| item6|
| 7| item7|
| 8| item8|
| 9| item9|
| 10|item10|
| 11|item11|
| 12|item12|
| 13|item13|
| 14|item14|
| 15|item15|
| 16|item16|
| 17|item17|
| 18|item18|
| 19|item19|
+---+------+
only showing top 20 rows
[Stage 2:=========================================================(20 + 0) / 20]
scala> var someDF = Seq((0, "item0")).toDF("x", "y")
someDF: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> someDF.show
+---+-----+
| x| y|
+---+-----+
| 0|item0|
+---+-----+
scala> for ( i <- 1 to 10000 ) {someDF = someDF.union(Seq((i,"item"+i)).toDF("x", "y")) }
Just want to save someDF to a parquet file to then load into Impala
//declare Range that you want
scala> val r = 1 to 10000
//create DataFrame with range
scala> val df = sc.parallelize(r).toDF("x")
//Add new column "y"
scala> val final_df = df.select(col("x"),concat(lit("item"),col("x")).alias("y"))
scala> final_df.show
+---+------+
| x| y|
+---+------+
| 1| item1|
| 2| item2|
| 3| item3|
| 4| item4|
| 5| item5|
| 6| item6|
| 7| item7|
| 8| item8|
| 9| item9|
| 10|item10|
| 11|item11|
| 12|item12|
| 13|item13|
| 14|item14|
| 15|item15|
| 16|item16|
| 17|item17|
| 18|item18|
| 19|item19|
| 20|item20|
+---+------+
scala> final_df.count
res17: Long = 10000
//Write final_df to path in parquet format
scala> final_df.write.format("parquet").save(<path to write>)
I would like to replicate rows according to their value for a given column. For example, I got this DataFrame:
+-----+
|count|
+-----+
| 3|
| 1|
| 4|
+-----+
I would like to get:
+-----+
|count|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
I tried to use withColumn method, according to this answer.
val replicateDf = originalDf
.withColumn("replicating", explode(array((1 until $"count").map(lit): _*)))
.select("count")
But $"count" is a ColumnName and cannot be used to represent its values in the above expression.
(I also tried with explode(Array.fill($"count"){1}) but same problem here.)
What do I need to change? Is there a cleaner way?
array_repeat is available from 2.4 onwards. If you need the solution in lower versions, you can use udf() or rdd. For Rdd, check this out
import scala.collection.mutable._
val df = Seq(3,1,4).toDF("count")
val rdd1 = df.rdd.flatMap( x=> { val y = x.getAs[Int]("count"); for ( p <- 0 until y ) yield Row(y) } )
spark.createDataFrame(rdd1,df.schema).show(false)
Results:
+-----+
|count|
+-----+
|3 |
|3 |
|3 |
|1 |
|4 |
|4 |
|4 |
|4 |
+-----+
With df() alone
scala> df.flatMap( r=> { (0 until r.getInt(0)).map( i => r.getInt(0)) } ).show
+-----+
|value|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
For udf(), below would work
val df = Seq(3,1,4).toDF("count")
def array_repeat(x:Int):Array[Int]={
val y = for ( p <- 0 until x )yield x
y.toArray
}
val udf_array_repeat = udf (array_repeat(_:Int):Array[Int] )
df.withColumn("count2", explode(udf_array_repeat('count))).select("count2").show(false)
EDIT :
Check #user10465355's answer below for more information about array_repeat.
You can use array_repeat function:
import org.apache.spark.sql.functions.{array_repeat, explode}
val df = Seq(1, 2, 3).toDF
df.select(explode(array_repeat($"value", $"value"))).show()
+---+
|col|
+---+
| 1|
| 2|
| 2|
| 3|
| 3|
| 3|
+---+
I have following table:
+-----+---+----+
|type | t |code|
+-----+---+----+
| A| 25| 11|
| A| 55| 42|
| B| 88| 11|
| A|114| 11|
| B|220| 58|
| B|520| 11|
+-----+---+----+
And what I want:
+-----+---+----+
|t1 | t2|code|
+-----+---+----+
| 25| 88| 11|
| 114|520| 11|
+-----+---+----+
There are two types of events A and B.
Event A is the start, Event B is the end.
I want to connect the start with the next end dependence of the code.
It's quite easy in SQL to do this:
SELECT a.t AS t1,
(SELECT b.t FROM events AS b WHERE a.code == b.code AND a.t < b.t LIMIT 1) AS t2, a.code AS code
FROM events AS a
But I have to problem to implement this in Spark because it looks like that this kind of nested query isn't supported...
I tried it with:
df.createOrReplaceTempView("events")
val sqlDF = spark.sql(/* SQL-query above */)
Error i get:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Accessing outer query column is not allowed in:
Do you have any other ideas to solve that problem?
It's quite easy in SQL to do this
And so is in Spark SQL, luckily.
val events = ...
scala> events.show
+----+---+----+
|type| t|code|
+----+---+----+
| A| 25| 11|
| A| 55| 42|
| B| 88| 11|
| A|114| 11|
| B|220| 58|
| B|520| 11|
+----+---+----+
// assumed that t is int
scala> events.printSchema
root
|-- type: string (nullable = true)
|-- t: integer (nullable = true)
|-- code: integer (nullable = true)
val eventsA = events.
where($"type" === "A").
as("a")
val eventsB = events.
where($"type" === "B").
as("b")
val solution = eventsA.
join(eventsB, "code").
where($"a.t" < $"b.t").
select($"a.t" as "t1", $"b.t" as "t2", $"a.code").
orderBy($"t1".asc, $"t2".asc).
dropDuplicates("t1", "code").
orderBy($"t1".asc)
That should give you the requested output.
scala> solution.show
+---+---+----+
| t1| t2|code|
+---+---+----+
| 25| 88| 11|
|114|520| 11|
+---+---+----+