How to get datatype of column in spark dataframe dynamically - scala

I have a dataframe - converted dtypes to map.
val dfTypesMap:Map[String,String]] = df.dtypes.toMap
Output:
(PRODUCT_ID,StringType)
(PRODUCT_ID_BSTP_MAP,MapType(StringType,IntegerType,false))
(PRODUCT_ID_CAT_MAP,MapType(StringType,StringType,true))
(PRODUCT_ID_FETR_MAP_END_FR,ArrayType(StringType,true))
When I use type [String] hardcoding in row.getAS[String], there is no compilation error.
df.foreach(row => {
val prdValue = row.getAs[String]("PRODUCT_ID")
})
I want to iterate above map dfTypesMap and get corresponding value type. Is there any way to convert dt column types to general types like below?
StringType --> String
MapType(StringType,IntegerType,false) ---> Map[String,Int]
MapType(StringType,StringType,true) ---> Map[String,String]
ArrayType(StringType,true) ---> List[String]

As mentioned, Datasets make it easier to work with types.
Dataset is basically a collection of strongly-typed JVM objects.
You can map your data to case classes like so
case class Foo(PRODUCT_ID: String, PRODUCT_NAME: String)
val ds: Dataset[Foo] = df.as[Foo]
Then you can safely operate on your typed objects. In your case you could do
ds.foreach(foo => {
val prdValue = foo.PRODUCT_ID
})
For more on Datasets, check out
https://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets

Related

Is there any way to specify type in scala dynamically

I'm new in Spark, Scala, so sorry for stupid question. So I have a number of tables:
table_a, table_b, ...
and number of corresponding types for these tables
case class classA(...), case class classB(...), ...
Then I need to write a methods that read data from these tables and create dataset:
def getDataFromSource: Dataset[classA] = {
val df: DataFrame = spark.sql("SELECT * FROM table_a")
df.as[classA]
}
The same for other tables and types. Is there any way to avoid routine code - I mean individual fucntion for each table and get by with one? For example:
def getDataFromSource[T: Encoder](table_name: String): Dataset[T] = {
val df: DataFrame = spark.sql(s"SELECT * FROM $table_name")
df.as[T]
}
Then create list of pairs (table_name, type_name):
val tableTypePairs = List(("table_a", classA), ("table_b", classB), ...)
Then to call it using foreach:
tableTypePairs.foreach(tupl => getDataFromSource[what should I put here?](tupl._1))
Thanks in advance!
Something like this should work
def getDataFromSource[T](table_name: String, encoder: Encoder[T]): Dataset[T] =
spark.sql(s"SELECT * FROM $table_name").as(encoder)
val tableTypePairs = List(
"table_a" -> implicitly[Encoder[classA]],
"table_b" -> implicitly[Encoder[classB]]
)
tableTypePairs.foreach {
case (table, enc) =>
getDataFromSource(table, enc)
}
Note that this is a case of discarding a value, which is a bit of a code smell. Since Encoder is invariant, tableTypePairs isn't going to have that useful of a type, and neither would something like
tableTypePairs.map {
case (table, enc) =>
getDataFromSource(table, enc)
}
One option is to pass the Class to the method, this way the generic type T will be inferred:
def getDataFromSource[T: Encoder](table_name: String, clazz: Class[T]): Dataset[T] = {
val df: DataFrame = spark.sql(s"SELECT * FROM $table_name")
df.as[T]
}
tableTypePairs.foreach { case (table name, clazz) => getDataFromSource(tableName, clazz) }
But then I'm not sure of how you'll be able to exploit this list of Dataset without .asInstanceOf.

Scala - how to filter a StructType with a list of StructField names?

I'm writing a method to parse schema and want to filter the resulting StructType with a list of column names. Which is a subset of StructField names of the original schema.
As a result, if a flag isFilteringReq = true, I want to return a StructType containing only StructFields with the names from the specialColumnNames, in the same order. If the flag is false, then return an original StructType.
val specialColumnNames = Seq("metric_1", "metric_2", "metric_3")
First I'm getting an original schema with pattern-matching.
val customSchema: StructType = schemaType match {
case "type_1" => getType1chema()
case "type_2" => getType2chema()
}
There are two problems:
1 - I wasn't able to apply .filter() directly to the customSchema right after the curly brace. And geting a Cannot resolve symbol filter. So I wrote a separate method makeCustomSchema. But I don't need a separate object. Is there a more elegant way to apply filtering in this case?
2 - I could filter the originalStruct but only with a single hardcoded column name. How should I pass the specialColumnNames to contains()?
def makeCustomSchema(originalStruct: StructType, isFilteringReq: Boolean, updColumns: Seq[String]) = if (isFilteringReq) {
originalStruct.filter(s => s.name.contains("metric_1"))
} else {
originalStruct
}
val newSchema = makeCustomSchema(customSchema, isFilteringReq, specialColumnNames)
Instead of passing a Seq, pass a Set and you can filter if the field is in the set or not.
Also, I wouldn't use a flag, instead, you could pass an empty Set when there's no filtering, or use Option[Set[String]].
Anyway, you could also use the copy method that comes for free with case classes.
Something like this should work.
def makeCustomSchema(originalStruct: StructType, updColumns:Set[String]): StructType = {
updColumns match {
case s if s.isEmpty => originalStruct
case _ => originalStruct.copy(
fields = originalStruct.fields.filter(
f => updColumns.contains(f.name)))
}
}
Usually you don't need to build structs like this, have you tried using the drop() method in DataFrame/DataSet ?

How to use map / flatMap on a scala Map?

I have two sequences, i.e. prices: Seq[Price] and overrides: Seq[Override]. I need to do some magic on them yet only for a subset based on a shared id.
So I grouped them both into a Map each via groupBy:
I do the group by via:
val pricesById = prices.groupBy(_.someId) // Int => Seq[Cruise]
val overridesById = overrides.groupBy(_.someId) // // Int => Seq[Override]
I expected to be able to create my wanted sequence via flatMap:
val applyOverrides = (someId: Int, prices: Seq[Price]): Seq[Price] => {
val applicableOverrides = overridesById.getOrElse(someId, Seq())
magicMethod(prices, applicableOverrides) // returns Seq[Price]
}
val myPrices: Seq[Price] = pricesById.flatMap(applyOverrides)
I expected myPrices to contain just one big Seq[Price].
Yet I get a weird type mismatch within the flatMap method with NonInferedB I am unable to resolve.
In scala, maps are tuples, not a key-value pair.
The function for flatMap hence expects only one parameter, namely the tuple (key, value), and not two parameters key, value.
Since you can access first element of a tuple via _1, the second via _2 and so on, you can generate your desired function like so:
val pricesWithMagicApplied = pricesById.flatMap(tuple =>
applyOverrides(tuple._1, tuple._2)
Another approach is to use case matching:
val pricesWithMagicApplied: Seq[CruisePrice] = pricesById.flatMap {
case (someId, prices) => applyOverrides(someId, prices)
}.toSeq

Spark (scala) dataframes - Return list of words from a set that are found in a given string

I am using a UDF function to apply to a column of strings in a spark dataframe which iterates over a words set of words, and finds if the given column string contains any of the words from the set (see below):
udf { (s: String) => words.value.exists(word => s.contains(word)) }
How would I need to alter this function so that it returns a list of all the items in the words set which are found in the string?
I have tried using when and otherwise:
udf { (s: String) => when(words.value.exists(word => s.contains(word)), word).otherwise(null) }
But i get a type mismatch, and anyway, I think this would only return the first match. I'm just learning scala and spark, so any suggestions welcome.
The argument passed to the udf function you're using here should be a plain Scala function - any use of SQL functions like when would return a Column object, which is not the intended return type of these functions (they should return types supported as data types in Spark DataFrames - primitives, arrays, maps, case classes etc.).
So, the implementation would simply be:
udf { (s: String) => words.value.filter(word => s.contains(word)) }
This creates a UDF with input type String and outpur type Seq[String], which means the resulting column would be ab Array column.
For example:
val words = sc.broadcast(Seq("aaa", "bbb"))
val udf1 = udf { (s: String) => words.value.filter(word => s.contains(word)) }
Seq("aaabbbb", "bbb", "aabb").toDF("word").select(udf1($"word")).show()
// +----------+
// | UDF(word)|
// +----------+
// |[aaa, bbb]|
// | [bbb]|
// | []|
// +----------+

Scala iterator on pattern match

I need help to iterate this piece of code written in Spark-Scala with DataFrame. I'm new on Scala, so I apologize if my question may seem trivial.
The function is very simple: Given a dataframe, the function casts the column if there is a pattern matching, otherwise select all field.
/* Load sources */
val df = sqlContext.sql("select id_vehicle, id_size, id_country, id_time from " + working_database + carPark);
val df2 = df.select(
df.columns.map {
case id_vehicle # "id_vehicle" => df(id_vehicle).cast("Int").as(id_vehicle)
case other => df(other)
}: _*
)
This function, with pattern matching, works perfectly!
Now I have a question: Is there any way to "iterate" this? In practice I need a function that given a dataframe, an Array[String] of column (column_1, column_2, ...) and another Array[String] of type (int, double, float, ...), return to me the same dataframe with the right cast at right position.
I need help :)
//Your supplied code fits nicely into this function
def castOnce(df: DataFrame, colName: String, typeName: String): DataFrame = {
val colsCasted = df.columns.map{
case colName => df(colName).cast(typeName).as(colName)
case other => df(other)
}
df.select(colsCasted:_ *)
}
def castMany(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val colsWithTypes: Array[(String, String)] = colNames.zip(typeNames)
colsWithTypes.foldLeft(df)((cAndType, newDf) => castOnce(newDf, cAndType._1, cAndType._2))
}
When you have a function that you just need to apply many times to the same thing a fold is often what you want.
The above code zips the two arrays together to combine them into one.
It then iterates through this list applying your function each time to the dataframe and then applying the next pair to the resultant dataframe etc.
Based on your edit I filled in the function above. I don't have a compiler so I'm not 100% sure its correct. Having written it out I am also left questioning my original approach. Below is a better way I believe but I am leaving the previous one for reference.
def(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val nameToType: Map[String, String] = colNames.zip(typeNames).toMap
val newCols= df.columns.map{dfCol =>
nameToType.get(dfCol).map{newType =>
df(dfCol).cast(newType).as(dfCol)
}.getOrElse(df(dfCol))
}
df.select(newCols:_ *)
}
The above code creates a map of column name to the desired type.
Then foreach column in the dataframe it looks the type up in the Map.
If the type exists we cast the column to that new type. If the column does not exist in the Map then we default to the column from the DataFrame directly.
We then select these columns from the DataFrame