Converting Array of Strings to String with different delimiters in Spark Scala - scala

I want to convert an array of String in a dataframe to a String with different delimiters than a comma also removing the array bracket. I want the "," to be replaced with ";#". This is to avoid elements that may have "," inside as it is a freeform text field. I am using spark 1.6.
Examples below:
Schema:
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Input as Dataframe:
+--------------------+
|carLineName |
+--------------------+
|[Avalon,CRV,Camry] |
|[Model T, Model S] |
|[Cayenne, Mustang] |
|[Pilot, Jeep] |
Desired output:
+--------------------+
|carLineName |
+--------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;# Jeep |
Current code which produces the input above:
val newCarDf = carDf.select(col("carLineName").cast("String").as("carLineName"))

You can use native function array_join (it is available since Spark 2.4):
import org.apache.spark.sql.functions.{array_join}
val l = Seq(Seq("Avalon","CRV","Camry"), Seq("Model T", "Model S"), Seq("Cayenne", "Mustang"), Seq("Pilot", "Jeep"))
val df = l.toDF("carLineName")
df.withColumn("str", array_join($"carLineName", ";#")).show()
+--------------------+------------------+
| carLineName| str|
+--------------------+------------------+
|[Avalon, CRV, Camry]|Avalon;#CRV;#Camry|
| [Model T, Model S]| Model T;#Model S|
| [Cayenne, Mustang]| Cayenne;#Mustang|
| [Pilot, Jeep]| Pilot;#Jeep|
+--------------------+------------------+

you can create a user defined function that concatenate elements with "#;" separator as the following example:
val df1 = Seq(
("1", Array("t1", "t2")),
("2", Array("t1", "t3", "t5"))
).toDF("id", "arr")
import org.apache.spark.sql.functions.{col, udf}
def formatString: Seq[String] => String = x => x.reduce(_ ++ "#;" ++ _)
def udfFormat = udf(formatString)
df1.withColumn("formatedColumn", udfFormat(col("arr")))
+---+------------+----------+
| id| arr| formated|
+---+------------+----------+
| 1| [t1, t2]| t1#;t2|
| 2|[t1, t3, t5]|t1#;t3#;t5|
+---+------------+----------+

You could simply write an User-defined function udf, which will take an Array of String as input parameter. Inside udf any operation could be performed on an array.
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def toCustomString: UserDefinedFunction = udf((carLineName: Seq[String]) => {
carLineName.mkString(";#")
})
val newCarDf = df.withColumn("carLineName", toCustomString(df.col("carLineName")))
This udf could be made generic further by passing the delimiter as the second parameter.
import org.apache.spark.sql.functions.lit
def toCustomStringWithDelimiter: UserDefinedFunction = udf((carLineName: Seq[String], delimiter: String) => {
carLineName.mkString(delimiter)
})
val newCarDf = df.withColumn("carLineName", toCustomStringWithDelimiter(df.col("carLineName"), lit(";#")))

Since you are using 1.6, we can do simple map of Row to WrappedArray.
Here is how it goes.
Input :
scala> val carLineDf = Seq( (Array("Avalon","CRV","Camry")),
| (Array("Model T", "Model S")),
| (Array("Cayenne", "Mustang")),
| (Array("Pilot", "Jeep"))
| ).toDF("carLineName")
carLineDf: org.apache.spark.sql.DataFrame = [carLineName: array<string>]
Schema ::
scala> carLineDf.printSchema
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Then we just use Row.getAs to get an WrappedArray of String instead of a Row object and we can manipulate with usual scala built-ins :
scala> import scala.collection.mutable.WrappedArray
import scala.collection.mutable.WrappedArray
scala> carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( a => a.mkString(";#")).toDF("carLineNameAsString").show(false)
+-------------------+
|carLineNameAsString|
+-------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;#Jeep |
+-------------------+
// Even an easier alternative
carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( r => r.reduce(_+";#"+_)).show(false)
That's it. You might have to use a dataframe.rdd otherwise this should do.

Related

Read csv into Dataframe with nested column

I have a csv file like this:
weight,animal_type,animal_interpretation
20,dog,"{is_large_animal=true, is_mammal=true}"
3.5,cat,"{is_large_animal=false, is_mammal=true}"
6.00E-04,ant,"{is_large_animal=false, is_mammal=false}"
And I created case class schema with the following:
package types
case class AnimalsType (
weight: Option[Double],
animal_type: Option[String],
animal_interpretation: Option[AnimalInterpretation]
)
case class AnimalInterpretation (
is_large_animal: Option[Boolean],
is_mammal: Option[Boolean]
)
I tried to load the csv into a dataframe with:
var df = spark.read.format("csv").option("header", "true").load("src/main/resources/animals.csv").as[AnimalsType]
But got the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Can't extract value from animal_interpretation#12: need struct type but got string;
Am I doing something wrong? What would be the proper way of doing this?
You can not assigned schema to csv json directly. You need to do transform csv String column (animal_interpretation) into Json format, As I have done in below code using UDF. if you can get input data in format like df1 then there is no need of below UDF you can continue from df1 and get final dataframe df2.
There is no need of any case class since your data header contain column and for json data you need to declare schema AnimalInterpretationSch as below
scala> import org.apache.spark.sql.types._
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
//Input CSV DataFrame
scala> df.show(false)
+--------+-----------+---------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+---------------------------------------+
|20 |dog |{is_large_animal=true, is_mammal=true} |
|3.5 |cat |{is_large_animal=false, is_mammal=true}|
|6.00E-04|ant |{is_large_animal=false,is_mammal=false}|
+--------+-----------+---------------------------------------+
//UDF to convert "animal_interpretation" column to Json Format
scala> def StringToJson:UserDefinedFunction = udf((data:String,JsonColumn:String) => {
| var out = data
| val JsonColList = JsonColumn.trim.split(",").toList
| JsonColList.foreach{ rr =>
| out = out.replaceAll(rr, "'"+rr+"'")
| }
| out = out.replaceAll("=", ":")
| out
| })
//All column from Json
scala> val JsonCol = "is_large_animal,is_mammal"
//New dataframe with Json format
scala> val df1 = df.withColumn("animal_interpretation", StringToJson(col("animal_interpretation"), lit(JsonCol)))
scala> df1.show(false)
+--------+-----------+-------------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+-------------------------------------------+
|20 |dog |{'is_large_animal':true, 'is_mammal':true} |
|3.5 |cat |{'is_large_animal':false, 'is_mammal':true}|
|6.00E-04|ant |{'is_large_animal':false,'is_mammal':false}|
+--------+-----------+-------------------------------------------+
//Schema declarion of Json format
scala> val AnimalInterpretationSch = new StructType().add("is_large_animal", BooleanType).add("is_mammal", BooleanType)
//Accessing Json columns
scala> val df2 = df1.select(col("weight"), col("animal_type"),from_json(col("animal_interpretation"), AnimalInterpretationSch).as("jsondata")).select("weight", "animal_type", "jsondata.*")
scala> df2.printSchema
root
|-- weight: string (nullable = true)
|-- animal_type: string (nullable = true)
|-- is_large_animal: boolean (nullable = true)
|-- is_mammal: boolean (nullable = true)
scala> df2.show()
+--------+-----------+---------------+---------+
| weight|animal_type|is_large_animal|is_mammal|
+--------+-----------+---------------+---------+
| 20| dog| true| true|
| 3.5| cat| false| true|
|6.00E-04| ant| false| false|
+--------+-----------+---------------+---------+

How to extract the elements of an Array[String] that starts with particular value in Scala?

I have a Scala dataframe which has this schema:
filter_msg.printSchema()
root
|-- value: array (nullable = true)
| |-- element: string (containsNull = true)
and data sample:
|[SD:GK, 3.16.0, OS:Linux, (x86_64), AID:176]|
I want to extract the values from this array string where element starts with say SD and get the value and then If its OS get the value.
The problem is the position in the array string is not always same it keeps varying so I cant use
filter_msg.select($"value".getItem(1).as("SD"))
The output should give me a dataframe:
Output=GK | Linux
Output.printSchema()
String,String
You can convert to rdd and extract the values as below
// If you can conferm the data are always in same order
filter_msg.rdd.map(_.getAs[mutable.WrappedArray[String]](0))
.map(row => {
val sd = row(0).split(":").tail.head
val os = row(2).split(":").tail.head
(sd, os)
} )
.toDF("sd", "os")
Or you can use as #SleightX mentioned
filter_msg.rdd.map(_.getAs[mutable.WrappedArray[String]](0))
.map(row => {
val sd = row.filter(_.startsWith("SD:")).head.split(":").tail.head
val os = row.filter(_.startsWith("OS:")).head.split(":").tail.head
(sd, os)
} )
.toDF("sd", "os")
Output:
+---+-----+
|sd |os |
+---+-----+
|GK |Linux|
+---+-----+
Here is another approach using regex and the regexp_extract function:
import org.apache.spark.sql.functions.{concat_ws, regexp_extract}
val df = Seq(
Seq("SD:GK", "3.16.0", "OS:Linux", "(x86_64)", "AID:176")
).toDF
df.withColumn("to_str", concat_ws(",", $"value")) //concatenate array items into one string i.e: SD:GK,3.16.0,OS:Linux,(x86_64),AID:176
.select(
regexp_extract($"to_str", "SD:(\\w+),", 1) as "SD", //extract SD
regexp_extract($"to_str", "OS:(\\w+),", 1) as "OS" //extract OS
).show(false)
// Output
// +---+-----+
// |SD |OS |
// +---+-----+
// |GK |Linux|
// +---+-----+
UDF can be used:
val df = Seq(Array("SD:GK", "3.16.0", "OS:Linux", "(x86_64)", "AID:176")).toDF("value")
val extractArrayValues = (prefix: String, values: Seq[String]) =>
values.filter(_.startsWith(prefix + ":")).map(_.split(":")(1)).headOption
val extractUDF = udf(extractArrayValues)
val result = df.select(
extractUDF(lit("SD"), $"value").alias("SD"),
extractUDF(lit("OS"), $"value").alias("OS")
)
Result is:
+---+-----+
|SD |OS |
+---+-----+
|GK |Linux|
+---+-----+

Select the last element of an Array in a DataFrame

I'm working on a project and I'm dealing with some nested JSON date with a complicated schema/data structure. Basically what I want to do is filter out one of the columns, in a dataframe, such that I select the last element in the array. I'm totally stuck on how to do this. I hope this make sense.
Below is an example of what I'm trying to accomplish:
val singersDF = Seq(
("beatles", "help,hey,jude"),
("romeo", "eres,mia"),
("elvis", "this,is,an,example")
).toDF("name", "hit_songs")
val actualDF = singersDF.withColumn(
"hit_songs",
split(col("hit_songs"), "\\,")
)
actualDF.show(false)
actualDF.printSchema()
+-------+-----------------------+
|name |hit_songs |
+-------+-----------------------+
|beatles|[help, hey, jude] |
|romeo |[eres, mia] |
|elvis |[this, is, an, example]|
+-------+-----------------------+
root
|-- name: string (nullable = true)
|-- hit_songs: array (nullable = true)
| |-- element: string (containsNull = true)
The end goal for the output would be the following, to select the last "string" in the hit_songs array.
I'm not worried about what the schema would look like afterwards.
+-------+---------+
|name |hit_songs|
+-------+---------+
|beatles|jude |
|romeo |mia |
|elvis |example |
+-------+---------+
You can use the size function to calculate the index of the desired item in the array, and then pass this as the argument of Column.apply (explicitly or implicitly):
import org.apache.spark.sql.functions._
import spark.implicits._
actualDF.withColumn("hit_songs", $"hit_songs".apply(size($"hit_songs").minus(1)))
Or:
actualDF.withColumn("hit_songs", $"hit_songs"(size($"hit_songs").minus(1)))
Since spark 2.4+, you can use element_at which supports negative indexing. As you can see in this documentation quote:
element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.
With that, here's how to get the last element:
import org.apache.spark.sql.functions.element_at
actualDF.withColumn("hit_songs", element_at($"hit_songs", -1))
Reproducible example:
First let's prepare a sample dataframe with an array column:
val columns = Seq("col1")
val data = Seq((Array(1,2,3)))
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF(columns:_*)
which looks like this:
scala> df.show()
+---------+
| col1|
+---------+
|[1, 2, 3]|
+---------+
Then, apply element_at to get the last element as follows:
scala> df.withColumn("last_value", element_at($"col1", -1)).show()
+---------+----------+
| col1|last_value|
+---------+----------+
|[1, 2, 3]| 3|
+---------+----------+
Here's one approach:
val actualDF = Seq(
("beatles", Seq("help", "hey", "jude")),
("romeo", Seq("eres", "mia")),
("elvis", Seq("this", "is", "an", "example"))
).toDF("name", "hit_songs")
import org.apache.spark.sql.functions._
actualDF.withColumn("total_songs", size($"hit_songs")).
select($"name", $"hit_songs"($"total_songs" - 1).as("last_song"))
// +-------+---------+
// | name|last_song|
// +-------+---------+
// |beatles| jude|
// | romeo| mia|
// | elvis| example|
// +-------+---------+
You are looking for the SparkSQL function slice. or this PySpark Source.
Your implementation in Scala slice($"hit_songs", -1, 1)(0) where -1 is the starting position (last index) and 1 is the length, and (0) extracts the first string from resulting array of exactly 1 element.
Full Example:
import org.apache.spark.sql.functions._
val singersDF = Seq(
("beatles", "help,hey,jude"),
("romeo", "eres,mia"),
("elvis", "this,is,an,example")
).toDF("name", "hit_songs")
val actualDF = singersDF.withColumn(
"hit_songs",
split(col("hit_songs"), "\\,")
)
val newDF = actualDF.withColumn("last_song", slice($"hit_songs", -1, 1)(0))
display(newDF)
Output:
+---------+------------------------------+-------------+
| name | hit_songs | last_song |
+---------+------------------------------+-------------+
| beatles | ["help","hey","jude"] | jude |
| romeo | ["eres","mia"] | mia |
| elvis | ["this","is","an","example"] | example |
+---------+------------------------------+-------------+
You can also use an UDF like:
val lastElementUDF = udf((array: Seq[String]) => array.lastOption)
actualDF.withColumn("hit_songs", lastElementUDF($"hit_songs"))
array.lastOption will return None or Some, and array.last will throw an exception if the array is empty.

How to aggregate map columns after groupBy?

I need to union two dataframes and combine the columns by keys. The two datafrmaes have the same schema, for example:
root
|-- id: String (nullable = true)
|-- cMap: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
I want to group by "id" and aggregate the "cMap" together to deduplicate.
I tried the code:
val df = df_a.unionAll(df_b).groupBy("id").agg(collect_list("cMap") as "cMap").
rdd.map(x => {
var map = Map[String,String]()
x.getAs[Seq[Map[String,String]]]("cMap").foreach( y =>
y.foreach( tuple =>
{
val key = tuple._1
val value = tuple._2
if(!map.contains(key))//deduplicate
map += (key -> value)
}))
Row(x.getAs[String]("id"),map)
})
But it seems collect_list cannnot be used to map structure:
org.apache.spark.sql.AnalysisException: No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but map<string,string> was passed as parameter 1..;
Is there other solution for the problem?
You have to use explode function on the map columns first to destructure maps into key and value columns, union the result datasets followed by distinct to de-duplicate and only then groupBy with some custom Scala coding to aggregate the maps.
Stop talking and let's do some coding then...
Given the datasets:
scala> a.show(false)
+---+-----------------------+
|id |cMap |
+---+-----------------------+
|one|Map(1 -> one, 2 -> two)|
+---+-----------------------+
scala> a.printSchema
root
|-- id: string (nullable = true)
|-- cMap: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
scala> b.show(false)
+---+-------------+
|id |cMap |
+---+-------------+
|one|Map(1 -> one)|
+---+-------------+
scala> b.printSchema
root
|-- id: string (nullable = true)
|-- cMap: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
You should first use explode function on the map columns.
explode(e: Column): Column Creates a new row for each element in the given array or map column.
val a_keyValues = a.select('*, explode($"cMap"))
scala> a_keyValues.show(false)
+---+-----------------------+---+-----+
|id |cMap |key|value|
+---+-----------------------+---+-----+
|one|Map(1 -> one, 2 -> two)|1 |one |
|one|Map(1 -> one, 2 -> two)|2 |two |
+---+-----------------------+---+-----+
val b_keyValues = b.select('*, explode($"cMap"))
With the following you have distinct key-value pairs which is exactly deduplication you asked for.
val distinctKeyValues = a_keyValues.
union(b_keyValues).
select("id", "key", "value").
distinct // <-- deduplicate
scala> distinctKeyValues.show(false)
+---+---+-----+
|id |key|value|
+---+---+-----+
|one|1 |one |
|one|2 |two |
+---+---+-----+
Time for groupBy and create the final map column.
val result = distinctKeyValues.
withColumn("map", map($"key", $"value")).
groupBy("id").
agg(collect_list("map")).
as[(String, Seq[Map[String, String]])]. // <-- leave Rows for typed pairs
map { case (id, list) => (id, list.reduce(_ ++ _)) }. // <-- collect all entries under one map
toDF("id", "cMap") // <-- give the columns their names
scala> result.show(truncate = false)
+---+-----------------------+
|id |cMap |
+---+-----------------------+
|one|Map(1 -> one, 2 -> two)|
+---+-----------------------+
Please note that as of Spark 2.0.0 unionAll has been deprecated and union is the proper union operator:
(Since version 2.0.0) use union()
Since Spark 3.0, you can:
transform your map to an array of map entries with map_entries
collect those arrays by your id using collect_set
flatten the collected array of arrays using flatten
then rebuild the map from flattened array using map_from_entries
See following code snippet where input is your input dataframe:
import org.apache.spark.sql.functions.{col, collect_set, flatten, map_entries, map_from_entries}
input
.withColumn("cMap", map_entries(col("cMap")))
.groupBy("id")
.agg(map_from_entries(flatten(collect_set("cMap"))).as("cMap"))
Example
Given the following dataframe input:
+---+--------------------+
|id |cMap |
+---+--------------------+
|1 |[k1 -> v1] |
|1 |[k2 -> v2, k3 -> v3]|
|2 |[k4 -> v4] |
|2 |[] |
|3 |[k6 -> v6, k7 -> v7]|
+---+--------------------+
The code snippet above returns the following dataframe:
+---+------------------------------+
|id |cMap |
+---+------------------------------+
|1 |[k1 -> v1, k2 -> v2, k3 -> v3]|
|3 |[k6 -> v6, k7 -> v7] |
|2 |[k4 -> v4] |
+---+------------------------------+
I agree with #Shankar. Your codes seems to be flawless.
The only mistake I assume you are doing is that you are importing wrong library.
You had to import
import org.apache.spark.sql.functions.collect_list
But I guess you are importing
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList
I hope I am guessing it right.

How to cast Array[Struct[String,String]] column type in Hive to Array[Map[String,String]]?

I've a column in a Hive table:
Column Name: Filters
Data Type:
|-- filters: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: string (nullable = true)
I want to get the value from this column by it's corresponding name.
What I did so far:
val sdf: DataFrame = sqlContext.sql("select * from <tablename> where id='12345'")
val sdfFilters = sdf.select("filters").rdd.map(r => r(0).asInstanceOf[Seq[(String,String)]]).collect()
Output: sdfFilters: Array[Seq[(String, String)]] = Array(WrappedArray([filter_RISKFACTOR,OIS.SPD.*], [filter_AGGCODE,IR]), WrappedArray([filter_AGGCODE,IR_]))
Note: Casting to Seq because WrappedArray to Map conversion is not possible.
What to do next?
I want to get the value from this column by it's corresponding name.
If you want simple and reliable way to get all values by name, you may flatten your table using explode and filter:
case class Data(name: String, value: String)
case class Filters(filters: Array[Data])
val df = sqlContext.createDataFrame(Seq(Filters(Array(Data("a", "b"), Data("a", "c"))), Filters(Array(Data("b", "c")))))
df.show()
+--------------+
| filters|
+--------------+
|[[a,b], [a,c]]|
| [[b,c]]|
+--------------+
df.withColumn("filter", explode($"filters"))
.select($"filter.name" as "name", $"filter.value" as "value")
.where($"name" === "a")
.show()
+----+-----+
|name|value|
+----+-----+
| a| b|
| a| c|
+----+-----+
You can also collect your data any way you want:
val flatDf = df.withColumn("filter", explode($"filters")).select($"filter.name" as "name", $"filter.value" as "value")
flatDf.rdd.map(r => Array(r(0), r(1))).collect()
res0: Array[Array[Any]] = Array(Array(a, b), Array(a, c), Array(b, c))
flatDf.rdd.map(r => r(0) -> r(1)).groupByKey().collect() //not the best idea, if you have many values per key
res1: Array[(Any, Iterable[Any])] = Array((b,CompactBuffer(c)), (a,CompactBuffer(b, c)))
If you want to cast array[struct] to map[string, string] for future saving to some storage - it's different story, and this case is better solved by UDF. Anyway, you have to avoid collect() as long as it possible to keep your code scalable.