Compare value in DF struct array spark - scala

I have the following problem to solve with spark/scala
I have this DF
+--------------+--------------------+
|co_tipo_arquiv| errorCodes|
+--------------+--------------------+
| 05|[10531, 20524, 10...|
this schema:
root
|-- co_tipo_arquiv: string (nullable = true)
|-- errorCodes: array (nullable = true)
| |-- element: string (containsNull = true)
I need to check if any of the codes in my error list(list_erors) are in the df in the errorCodes column
val list_erors = List("10531","10144")
i try this, but doesn't work
dfNire.filter(col("errorCodes").isin(list_erors)).show()

Spark 2.4+
You can use the array_intersect function with the array of errors.
val list_errors = Array("10531","10144")
df.withColumn("intersect", array_intersect(col("errors"), lit(list_errors))).show(false)
Then, the result is as follws:
+---+---------------------+---------+
|id |errors |intersect|
+---+---------------------+---------+
|05 |[10531, 20524, 11111]|[10531] |
+---+---------------------+---------+
where the column name is temporal for my test.

If you want to check/list out if the array contains any list_errors then:
df.show()
//+------------------+
//| errorCodes|
//+------------------+
//|[10531, 20254, 10]|
//| [10]|
//+------------------+
def is_exists_any(s: Seq[String]): UserDefinedFunction = udf((c: collection.mutable.WrappedArray[String]) => c.toList.intersect(s).nonEmpty)
val list_errors = Seq("10531", "10144")
df.withColumn("is_exists",is_exists_any(list_errors)(col("errorCodes"))).filter(col("is_exists") === true).show()
//+------------------+---------+
//| errorCodes|is_exists|
//+------------------+---------+
//|[10531, 20254, 10]| true|
//+------------------+---------+
Another way to get rows without using udf would be using array_intersect and then only list out the rows where size of array is not 0.
df.withColumn("is_exists", array_intersect(col("errorCodes"), lit(list_errors))).
filter(size(col("is_exists")) !==0).
show()
//+------------------+---------+
//| errorCodes|is_exists|
//+------------------+---------+
//|[10531, 20254, 10]| [10531]|
//+------------------+---------+

Related

Is there an efficient way to return Array[Int] from a spark Dataframe without using collect()

I have a dataframe something like this.
root
|-- key1: string (nullable = true)
|-- value1: string (nullable = true)
+----+------+
|key1|value1|
+----+------+
| E1| 1|
| E3| 0|
| E4| 1|
| E2| 0|
...
+----+------+
And i convert "value1" column to array[Int] by using collect() function as below. But this is not efficient solution, it takes 10-15 seconds. Because there are lots of data in the dataframe and in each spark streaming cycle, data is collected to the driver.
val data = Seq(("E1","1"),
("E3","0"),
("E4","1"),
("E2","0")
)
val columns = Seq("key1", "value1")
import spark.implicits._
val df = data.toDF(columns:_*)
val ordered_df = df.orderBy("key1").select("value1").collect().map(_(0)).toList
ordered_df.foreach(print)
Output :
1001
So, what is the efficient way to return Array of Int from the above dataframe without using Collect() function ?
Thanks,

Read csv into Dataframe with nested column

I have a csv file like this:
weight,animal_type,animal_interpretation
20,dog,"{is_large_animal=true, is_mammal=true}"
3.5,cat,"{is_large_animal=false, is_mammal=true}"
6.00E-04,ant,"{is_large_animal=false, is_mammal=false}"
And I created case class schema with the following:
package types
case class AnimalsType (
weight: Option[Double],
animal_type: Option[String],
animal_interpretation: Option[AnimalInterpretation]
)
case class AnimalInterpretation (
is_large_animal: Option[Boolean],
is_mammal: Option[Boolean]
)
I tried to load the csv into a dataframe with:
var df = spark.read.format("csv").option("header", "true").load("src/main/resources/animals.csv").as[AnimalsType]
But got the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Can't extract value from animal_interpretation#12: need struct type but got string;
Am I doing something wrong? What would be the proper way of doing this?
You can not assigned schema to csv json directly. You need to do transform csv String column (animal_interpretation) into Json format, As I have done in below code using UDF. if you can get input data in format like df1 then there is no need of below UDF you can continue from df1 and get final dataframe df2.
There is no need of any case class since your data header contain column and for json data you need to declare schema AnimalInterpretationSch as below
scala> import org.apache.spark.sql.types._
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
//Input CSV DataFrame
scala> df.show(false)
+--------+-----------+---------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+---------------------------------------+
|20 |dog |{is_large_animal=true, is_mammal=true} |
|3.5 |cat |{is_large_animal=false, is_mammal=true}|
|6.00E-04|ant |{is_large_animal=false,is_mammal=false}|
+--------+-----------+---------------------------------------+
//UDF to convert "animal_interpretation" column to Json Format
scala> def StringToJson:UserDefinedFunction = udf((data:String,JsonColumn:String) => {
| var out = data
| val JsonColList = JsonColumn.trim.split(",").toList
| JsonColList.foreach{ rr =>
| out = out.replaceAll(rr, "'"+rr+"'")
| }
| out = out.replaceAll("=", ":")
| out
| })
//All column from Json
scala> val JsonCol = "is_large_animal,is_mammal"
//New dataframe with Json format
scala> val df1 = df.withColumn("animal_interpretation", StringToJson(col("animal_interpretation"), lit(JsonCol)))
scala> df1.show(false)
+--------+-----------+-------------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+-------------------------------------------+
|20 |dog |{'is_large_animal':true, 'is_mammal':true} |
|3.5 |cat |{'is_large_animal':false, 'is_mammal':true}|
|6.00E-04|ant |{'is_large_animal':false,'is_mammal':false}|
+--------+-----------+-------------------------------------------+
//Schema declarion of Json format
scala> val AnimalInterpretationSch = new StructType().add("is_large_animal", BooleanType).add("is_mammal", BooleanType)
//Accessing Json columns
scala> val df2 = df1.select(col("weight"), col("animal_type"),from_json(col("animal_interpretation"), AnimalInterpretationSch).as("jsondata")).select("weight", "animal_type", "jsondata.*")
scala> df2.printSchema
root
|-- weight: string (nullable = true)
|-- animal_type: string (nullable = true)
|-- is_large_animal: boolean (nullable = true)
|-- is_mammal: boolean (nullable = true)
scala> df2.show()
+--------+-----------+---------------+---------+
| weight|animal_type|is_large_animal|is_mammal|
+--------+-----------+---------------+---------+
| 20| dog| true| true|
| 3.5| cat| false| true|
|6.00E-04| ant| false| false|
+--------+-----------+---------------+---------+

How can I split a column containing array of some struct into separate columns?

I have the following scenarios:
case class attribute(key:String,value:String)
case class entity(id:String,attr:List[attribute])
val entities = List(entity("1",List(attribute("name","sasha"),attribute("home","del"))),
entity("2",List(attribute("home","hyd"))))
val df = entities.toDF()
// df.show
+---+--------------------+
| id| attr|
+---+--------------------+
| 1|[[name,sasha], [d...|
| 2| [[home,hyd]]|
+---+--------------------+
//df.printSchema
root
|-- id: string (nullable = true)
|-- attr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
what I want to produce is
+---+--------------------+-------+
| id| name | home |
+---+--------------------+-------+
| 1| sasha |del |
| 2| null |hyd |
+---+--------------------+-------+
How do I go about this. I looked at quite a few similar questions on stack but couldn't find anything useful.
My main motive is to do groupBy on different attributes, thus want to bring it in the above mentioned format.
I looked into explode functionality. It breaks downs a list in separate rows, I don't want that. I want to create more columns from the array of attribute.
Similar things I found:
Spark - convert Map to a single-row DataFrame
Split 1 column into 3 columns in spark scala
Spark dataframe - Split struct column into 2 columns
That can easily be reduced to PySpark converting a column of type 'map' to multiple columns in a dataframe or How to get keys and values from MapType column in SparkSQL DataFrame. First convert attr to map<string, string>
import org.apache.spark.sql.functions.{explode, map_from_entries, map_keys}
val dfMap = df.withColumn("attr", map_from_entries($"attr"))
then it's just a matter of finding the unique keys
val keys = dfMap.select(explode(map_keys($"attr"))).as[String].distinct.collect
then selecting from the map
val result = dfMap.select($"id" +: keys.map(key => $"attr"(key) as key): _*)
result.show
+---+-----+----+
| id| name|home|
+---+-----+----+
| 1|sasha| del|
| 2| null| hyd|
+---+-----+----+
Less efficient but more concise variant is to explode and pivot
val result = df
.select($"id", explode(map_from_entries($"attr")))
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
result.show
+---+----+-----+
| id|home| name|
+---+----+-----+
| 1| del|sasha|
| 2| hyd| null|
+---+----+-----+
but in practice I'd advise against it.

create empty array-column of given schema in Spark

Due to the fact that parquet cannt parsists empty arrays, I replaced empty arrays with null before writing a table. Now as I read the table, I want to do the opposite:
I have a DataFrame with the following schema :
|-- id: long (nullable = false)
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
and the following content:
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| null|
+---+-----------+
I'd like to replace the null-array (id=2) with an empty array, i.e.
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
I've tried:
val arrSchema = df.schema(1).dataType
df
.withColumn("arr",when($"arr".isNull,array().cast(arrSchema)).otherwise($"arr"))
.show()
which gives :
java.lang.ClassCastException: org.apache.spark.sql.types.NullType$
cannot be cast to org.apache.spark.sql.types.StructType
Edit : I don't want to "hardcode" any schema of my array column (at least not the schema of the struct) because this can vary from case to case. I can only use the schema information from df at runtime
I'm using Spark 2.1 by the way, therefore I cannot use typedLit
Spark 2.2+ with known external type
In general you can use typedLit to provide empty arrays.
import org.apache.spark.sql.functions.typedLit
typedLit(Seq.empty[(Double, Double)])
To use specific names for nested objects you can use case classes:
case class Item(x: Double, y: Double)
typedLit(Seq.empty[Item])
or rename by cast:
typedLit(Seq.empty[(Double, Double)])
.cast("array<struct<x: Double, y: Double>>")
Spark 2.1+ with schema only
With schema only you can try:
val schema = StructType(Seq(
StructField("arr", StructType(Seq(
StructField("x", DoubleType),
StructField("y", DoubleType)
)))
))
def arrayOfSchema(schema: StructType) =
from_json(lit("""{"arr": []}"""), schema)("arr")
arrayOfSchema(schema).alias("arr")
where schema can be extracted from the existing DataFrame and wrapped with additional StructType:
StructType(Seq(
StructField("arr", df.schema("arr").dataType)
))
One way is the use a UDF :
val arrSchema = df.schema(1).dataType // ArrayType(StructType(StructField(x,DoubleType,true), StructField(y,DoubleType,true)),true)
val emptyArr = udf(() => Seq.empty[Any],arrSchema)
df
.withColumn("arr",when($"arr".isNull,emptyArr()).otherwise($"arr"))
.show()
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
Another approach would be to use coalesce:
val df = Seq(
(Some(1), Some(Array((1.0, 2.0)))),
(Some(2), None)
).toDF("id", "arr")
df.withColumn("arr", coalesce($"arr", typedLit(Array.empty[(Double, Double)]))).
show
// +---+-----------+
// | id| arr|
// +---+-----------+
// | 1|[[1.0,2.0]]|
// | 2| []|
// +---+-----------+
UDF with case class could also be interesting:
case class Item(x: Double, y: Double)
val udf_emptyArr = udf(() => Seq[Item]())
df
.withColumn("arr",coalesce($"arr",udf_emptyArr()))
.show()

Spark: Convert column of string to an array

How to convert a column that has been read as a string into a column of arrays?
i.e. convert from below schema
scala> test.printSchema
root
|-- a: long (nullable = true)
|-- b: string (nullable = true)
+---+---+
| a| b|
+---+---+
| 1|2,3|
+---+---+
| 2|4,5|
+---+---+
To:
scala> test1.printSchema
root
|-- a: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: long (containsNull = true)
+---+-----+
| a| b |
+---+-----+
| 1|[2,3]|
+---+-----+
| 2|[4,5]|
+---+-----+
Please share both scala and python implementation if possible.
On a related note, how do I take care of it while reading from the file itself?
I have data with ~450 columns and few of them I want to specify in this format.
Currently I am reading in pyspark as below:
df = spark.read.format('com.databricks.spark.csv').options(
header='true', inferschema='true', delimiter='|').load(input_file)
Thanks.
There are various method,
The best way to do is using split function and cast to array<long>
data.withColumn("b", split(col("b"), ",").cast("array<long>"))
You can also create simple udf to convert the values
val tolong = udf((value : String) => value.split(",").map(_.toLong))
data.withColumn("newB", tolong(data("b"))).show
Hope this helps!
Using a UDF would give you exact required schema. Like this:
val toArray = udf((b: String) => b.split(",").map(_.toLong))
val test1 = test.withColumn("b", toArray(col("b")))
It would give you schema as follows:
scala> test1.printSchema
root
|-- a: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: long (containsNull = true)
+---+-----+
| a| b |
+---+-----+
| 1|[2,3]|
+---+-----+
| 2|[4,5]|
+---+-----+
As far as applying schema on file read itself is concerned, I think that is a tough task. So, for now you can apply transformation after creating DataFrameReader of test.
I hope this helps!
In python (pyspark) it would be:
from pyspark.sql.types import *
from pyspark.sql.functions import col, split
test = test.withColumn(
"b",
split(col("b"), ",\s*").cast("array<int>").alias("ev")
)