Converting nested null values to empty strings inside dataframe spark - scala

I am looking to convert null values nested in Array of String to empty strings in spark. The data is in a dataframe. I plan on running a reduce function after making the dataframe null safe, not sure if that helps in answering the question. I am using spark 1.6.
Schema:
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Example input:
+--------------------+
|carLineName |
+--------------------+
|[null,null,null] |
|[null, null] |
|[Mustang, null] |
|[Pilot, Jeep] |
Desired output:
+--------------------+
|carLineName |
+--------------------+
|[,,] |
|[,] |
|[Mustang,] |
|[Pilot, Jeep] |
My attempt:
val safeString: Seq[String] => Seq[String] = s => if (s == null) "" else s
val udfSafeString = udf(safeString)

The input to the UDF is a sequence of strings, not a single string. Since that is the case, you need to map over it. You can do this as follows:
val udfSafeString = udf((arr: Seq[String]) => {
arr.map(s => if (s == null) "" else s)
})
df.withColumn("carLineName", udfSafeString($"carLineName"))

Related

Concatenate keys with first element in the values array in a MapType column

The schema of the dataframe is given below.
|-- idMap: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- linked: boolean (nullable = true)
If there are 3 keys in a row for example, I'm trying to convert this to a new string column of the format key1:id;key2:id;key3:id where id is part of the element at index 0.
What I've tried is
Collect the keys to a list
Create a list of columns from the list of keys
val expr = new scala.collection.mutable.ListBuffer[org.apache.spark.sql.Column]
keyList.foldLeft(expr)((expr, key) => expr += (lit(key), lit(":"), col("idMap")(key)(0)("id"), lit(";")))
Add a new column with the list of columns passed to concat
val finalDf = df.withColumn("concatColumn", concat(expr.toList:_*))
It's giving me a null column so I'm assuming this approach is flawed. Any input would be appreciated.
Edit: #mck's answer works. Also using concat_ws in step 3 will work as well.
val finalDf = df.withColumn("concatColumn", concat_ws(expr.toList:_*))
If you have Spark 3, you can use transform_values to transform the map column to get your desired output.
// sample dataframe
val df = spark.sql("select map('key1', array(struct('id1' id, true linked)), 'key2', array(struct('id2' id, false linked))) idMap")
val df2 = df.withColumn(
"concatColumn",
expr("""
concat_ws(';',
map_values(
transform_values(
idMap,
(k, v) -> concat(k, ':', transform(v, y -> y.id)[0])
)
)
)
""")
)
df2.show(false)
+-----------------------------------------------+-----------------+
|idMap |concatColumn |
+-----------------------------------------------+-----------------+
|[key1 -> [[id1, true]], key2 -> [[id2, false]]]|key1:id1;key2:id2|
+-----------------------------------------------+-----------------+

Converting Array of Strings to String with different delimiters in Spark Scala

I want to convert an array of String in a dataframe to a String with different delimiters than a comma also removing the array bracket. I want the "," to be replaced with ";#". This is to avoid elements that may have "," inside as it is a freeform text field. I am using spark 1.6.
Examples below:
Schema:
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Input as Dataframe:
+--------------------+
|carLineName |
+--------------------+
|[Avalon,CRV,Camry] |
|[Model T, Model S] |
|[Cayenne, Mustang] |
|[Pilot, Jeep] |
Desired output:
+--------------------+
|carLineName |
+--------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;# Jeep |
Current code which produces the input above:
val newCarDf = carDf.select(col("carLineName").cast("String").as("carLineName"))
You can use native function array_join (it is available since Spark 2.4):
import org.apache.spark.sql.functions.{array_join}
val l = Seq(Seq("Avalon","CRV","Camry"), Seq("Model T", "Model S"), Seq("Cayenne", "Mustang"), Seq("Pilot", "Jeep"))
val df = l.toDF("carLineName")
df.withColumn("str", array_join($"carLineName", ";#")).show()
+--------------------+------------------+
| carLineName| str|
+--------------------+------------------+
|[Avalon, CRV, Camry]|Avalon;#CRV;#Camry|
| [Model T, Model S]| Model T;#Model S|
| [Cayenne, Mustang]| Cayenne;#Mustang|
| [Pilot, Jeep]| Pilot;#Jeep|
+--------------------+------------------+
you can create a user defined function that concatenate elements with "#;" separator as the following example:
val df1 = Seq(
("1", Array("t1", "t2")),
("2", Array("t1", "t3", "t5"))
).toDF("id", "arr")
import org.apache.spark.sql.functions.{col, udf}
def formatString: Seq[String] => String = x => x.reduce(_ ++ "#;" ++ _)
def udfFormat = udf(formatString)
df1.withColumn("formatedColumn", udfFormat(col("arr")))
+---+------------+----------+
| id| arr| formated|
+---+------------+----------+
| 1| [t1, t2]| t1#;t2|
| 2|[t1, t3, t5]|t1#;t3#;t5|
+---+------------+----------+
You could simply write an User-defined function udf, which will take an Array of String as input parameter. Inside udf any operation could be performed on an array.
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def toCustomString: UserDefinedFunction = udf((carLineName: Seq[String]) => {
carLineName.mkString(";#")
})
val newCarDf = df.withColumn("carLineName", toCustomString(df.col("carLineName")))
This udf could be made generic further by passing the delimiter as the second parameter.
import org.apache.spark.sql.functions.lit
def toCustomStringWithDelimiter: UserDefinedFunction = udf((carLineName: Seq[String], delimiter: String) => {
carLineName.mkString(delimiter)
})
val newCarDf = df.withColumn("carLineName", toCustomStringWithDelimiter(df.col("carLineName"), lit(";#")))
Since you are using 1.6, we can do simple map of Row to WrappedArray.
Here is how it goes.
Input :
scala> val carLineDf = Seq( (Array("Avalon","CRV","Camry")),
| (Array("Model T", "Model S")),
| (Array("Cayenne", "Mustang")),
| (Array("Pilot", "Jeep"))
| ).toDF("carLineName")
carLineDf: org.apache.spark.sql.DataFrame = [carLineName: array<string>]
Schema ::
scala> carLineDf.printSchema
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Then we just use Row.getAs to get an WrappedArray of String instead of a Row object and we can manipulate with usual scala built-ins :
scala> import scala.collection.mutable.WrappedArray
import scala.collection.mutable.WrappedArray
scala> carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( a => a.mkString(";#")).toDF("carLineNameAsString").show(false)
+-------------------+
|carLineNameAsString|
+-------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;#Jeep |
+-------------------+
// Even an easier alternative
carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( r => r.reduce(_+";#"+_)).show(false)
That's it. You might have to use a dataframe.rdd otherwise this should do.

Compare Dataframe field with Json [duplicate]

I am working on the dataframe created by JSON and then I want to apply the filter condition over the dataframe.
val jsonStr = """{ "metadata": [{ "key": 84896, "value": 54 },{ "key": 1234, "value": 12 }]}"""
val rdd = sc.parallelize(Seq(jsonStr))
val df = sqlContext.read.json(rdd)
schema of df
root
|-- metadata: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: long (nullable = true)
| | |-- value: long (nullable = true)
now I need to filter the dataframe which I am trying to do as
val df1=df.where("key == 84896")
which throws error
ERROR Executor - Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.sql.AnalysisException: cannot resolve '`key`' given input columns: [metadata]; line 1 pos 0;
'Filter ('key = 84896)
The reason I want to use where clause is because of the expression string which I want to use directly
eg ( (key == 999, value == 55) || (key == 1234, value == 12) )
First you should use explode to get an easy-to-work-with dataFrame. Then you can select both key and value of you given input:
val explodedDF = df.withColumn("metadata", explode($"metadata"))
.select("metadata.key", "metadata.value")
Output:
+-----+-----+
| key|value|
+-----+-----+
|84896| 54|
| 1234| 12|
+-----+-----+
This way you'll be able to perform your filtering logic as usual:
scala> explodedDF.where("key == 84896").show
+-----+-----+
| key|value|
+-----+-----+
|84896| 54|
+-----+-----+
You can concatenate your filtering requirements, some examples below:
explodedDF.where("key == 84896 AND value == 54")
explodedDF.where("(key == 84896 AND value == 54) OR key = 1234")
From what I have understood from your question and comment is that you are trying to apply ( (key == 999, value == 55) || (key == 1234, value == 12) ) expression to filter the dataframe rows.
First of all, the expression needs changes as it cannot be applied as expression to dataframe in spark so you need to change as
val expression = """( (key == 999, value == 55) || (key == 1234, value == 12) )"""
val actualExpression = expression.replace(",", " and").replace("||", "or")
which should give you new valid expression as
( (key == 999 and value == 55) or (key == 1234 and value == 12) )
Now that you have valid expression, your dataframe needs modification too as you can't query such expression on a column with array and struct as schema
So you would need explode function to explode the array elements to different rows and then use .* notation to select all the elements of struct on different columns.
val df1 = df.withColumn("metadata", explode($"metadata"))
.select($"metadata.*")
which should give you dataframe as
+-----+-----+
|key |value|
+-----+-----+
|84896|54 |
|1234 |12 |
+-----+-----+
And the finally use the valid expression on the dataframe generated as
df1.where(s"${actualExpression}")
I hope the answer is helpful

How to count the elements in a column of arrays?

I'm trying to count the number of elements in FavouriteCities column in the following DataFrame.
+-----------------+
| FavouriteCities |
+-----------------+
| [NY, Canada] |
+-----------------+
The schema is as follows:
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
Expected output should be something like,
+------------+-------------+
| City | Count |
+------------+-------------+
| NY | 1 |
| Canada | 1 |
+------------+-------------+
I have tried using the agg() and count() but like the following, but it fails to extract individual elements from the array and tries to find the most common set of elements in the column.
data.agg(count("FavouriteCities").alias("count"))
Can someone please guide me with this?
To match schema you've shown:
scala> val data = Seq(Tuple1(Array("NY", "Canada"))).toDF("FavouriteCities")
data: org.apache.spark.sql.DataFrame = [FavouriteCities: array<string>]
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
Explode:
val counts = data
.select(explode($"FavouriteCities" as "City"))
.groupBy("City")
.count
and aggregate:
import spark.implicits._
scala> counts.as[(String, Long)].reduce((a, b) => if (a._2 > b._2) a else b)
res3: (String, Long) = (Canada,1)

How to cast Array[Struct[String,String]] column type in Hive to Array[Map[String,String]]?

I've a column in a Hive table:
Column Name: Filters
Data Type:
|-- filters: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: string (nullable = true)
I want to get the value from this column by it's corresponding name.
What I did so far:
val sdf: DataFrame = sqlContext.sql("select * from <tablename> where id='12345'")
val sdfFilters = sdf.select("filters").rdd.map(r => r(0).asInstanceOf[Seq[(String,String)]]).collect()
Output: sdfFilters: Array[Seq[(String, String)]] = Array(WrappedArray([filter_RISKFACTOR,OIS.SPD.*], [filter_AGGCODE,IR]), WrappedArray([filter_AGGCODE,IR_]))
Note: Casting to Seq because WrappedArray to Map conversion is not possible.
What to do next?
I want to get the value from this column by it's corresponding name.
If you want simple and reliable way to get all values by name, you may flatten your table using explode and filter:
case class Data(name: String, value: String)
case class Filters(filters: Array[Data])
val df = sqlContext.createDataFrame(Seq(Filters(Array(Data("a", "b"), Data("a", "c"))), Filters(Array(Data("b", "c")))))
df.show()
+--------------+
| filters|
+--------------+
|[[a,b], [a,c]]|
| [[b,c]]|
+--------------+
df.withColumn("filter", explode($"filters"))
.select($"filter.name" as "name", $"filter.value" as "value")
.where($"name" === "a")
.show()
+----+-----+
|name|value|
+----+-----+
| a| b|
| a| c|
+----+-----+
You can also collect your data any way you want:
val flatDf = df.withColumn("filter", explode($"filters")).select($"filter.name" as "name", $"filter.value" as "value")
flatDf.rdd.map(r => Array(r(0), r(1))).collect()
res0: Array[Array[Any]] = Array(Array(a, b), Array(a, c), Array(b, c))
flatDf.rdd.map(r => r(0) -> r(1)).groupByKey().collect() //not the best idea, if you have many values per key
res1: Array[(Any, Iterable[Any])] = Array((b,CompactBuffer(c)), (a,CompactBuffer(b, c)))
If you want to cast array[struct] to map[string, string] for future saving to some storage - it's different story, and this case is better solved by UDF. Anyway, you have to avoid collect() as long as it possible to keep your code scalable.