Change schema of Spark Dataframe - scala

I have a DataFrame[SimpleType]. SimpleType is a class that contains 16 fields. But I have to change it into DataFrame[ComplexType].
I've got only schema of ComplexType(there is more than 400 fields), there is no case class for this type. I know mapping neccesary fields (but I don't know how to map it from DataFrame[SimpleType] -> DataFrame[ComplexType]), the rest fields I want to leave as nulls. Does anyone know how to do this in most efficent way?
Thanks
edit
class SimpleType{
field1
field2
field3
field4
.
.
.
field16
}
I have got DataFrame that contains this simple type. Also I have a schema of complex type.
I want to convert this DataFrame[SimpleType] -> Dataframe[ComplexType]

It's quite simple:
// function to get field names
import scala.reflect.runtime.universe._
def classAccessors[T: TypeTag]: List[String] = typeOf[T].members.collect {
case m: MethodSymbol if m.isCaseAccessor => m}
.toList.map(s => s.name.toString)
val typeComplexFields = classAccessors[ComplexType]
val newDataFrame = simpleDF
.select(typeComplexFields
.map(c => if (simpleDF.columns.contains(c)) col(c) else lit(null).as(c)) : _*)
.as[ComplexType]
Credits also for author of Scala. Get field names list from case class, I've copied his function to get field names with modifications

Related

Scala - how to filter a StructType with a list of StructField names?

I'm writing a method to parse schema and want to filter the resulting StructType with a list of column names. Which is a subset of StructField names of the original schema.
As a result, if a flag isFilteringReq = true, I want to return a StructType containing only StructFields with the names from the specialColumnNames, in the same order. If the flag is false, then return an original StructType.
val specialColumnNames = Seq("metric_1", "metric_2", "metric_3")
First I'm getting an original schema with pattern-matching.
val customSchema: StructType = schemaType match {
case "type_1" => getType1chema()
case "type_2" => getType2chema()
}
There are two problems:
1 - I wasn't able to apply .filter() directly to the customSchema right after the curly brace. And geting a Cannot resolve symbol filter. So I wrote a separate method makeCustomSchema. But I don't need a separate object. Is there a more elegant way to apply filtering in this case?
2 - I could filter the originalStruct but only with a single hardcoded column name. How should I pass the specialColumnNames to contains()?
def makeCustomSchema(originalStruct: StructType, isFilteringReq: Boolean, updColumns: Seq[String]) = if (isFilteringReq) {
originalStruct.filter(s => s.name.contains("metric_1"))
} else {
originalStruct
}
val newSchema = makeCustomSchema(customSchema, isFilteringReq, specialColumnNames)
Instead of passing a Seq, pass a Set and you can filter if the field is in the set or not.
Also, I wouldn't use a flag, instead, you could pass an empty Set when there's no filtering, or use Option[Set[String]].
Anyway, you could also use the copy method that comes for free with case classes.
Something like this should work.
def makeCustomSchema(originalStruct: StructType, updColumns:Set[String]): StructType = {
updColumns match {
case s if s.isEmpty => originalStruct
case _ => originalStruct.copy(
fields = originalStruct.fields.filter(
f => updColumns.contains(f.name)))
}
}
Usually you don't need to build structs like this, have you tried using the drop() method in DataFrame/DataSet ?

How to get datatype of column in spark dataframe dynamically

I have a dataframe - converted dtypes to map.
val dfTypesMap:Map[String,String]] = df.dtypes.toMap
Output:
(PRODUCT_ID,StringType)
(PRODUCT_ID_BSTP_MAP,MapType(StringType,IntegerType,false))
(PRODUCT_ID_CAT_MAP,MapType(StringType,StringType,true))
(PRODUCT_ID_FETR_MAP_END_FR,ArrayType(StringType,true))
When I use type [String] hardcoding in row.getAS[String], there is no compilation error.
df.foreach(row => {
val prdValue = row.getAs[String]("PRODUCT_ID")
})
I want to iterate above map dfTypesMap and get corresponding value type. Is there any way to convert dt column types to general types like below?
StringType --> String
MapType(StringType,IntegerType,false) ---> Map[String,Int]
MapType(StringType,StringType,true) ---> Map[String,String]
ArrayType(StringType,true) ---> List[String]
As mentioned, Datasets make it easier to work with types.
Dataset is basically a collection of strongly-typed JVM objects.
You can map your data to case classes like so
case class Foo(PRODUCT_ID: String, PRODUCT_NAME: String)
val ds: Dataset[Foo] = df.as[Foo]
Then you can safely operate on your typed objects. In your case you could do
ds.foreach(foo => {
val prdValue = foo.PRODUCT_ID
})
For more on Datasets, check out
https://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets

Slick: how to implement find by example i.e. findByExample generically?

I'm exploring the different possibilities on how to implement a generic DAO using the latest Slick 3.1.1 to boost productivity and yes there is need for it because basing the service layer of my Play Web application on TableQuery alone leads to a lot of boilerplate code. One of the methods I'd like to feature in my generic DAO implementation is the findByExample, possible in JPA with the help of the Criteria API. In my case, I'm using the Slick Code Generator to generate the model classes from a sql script.
I need the following to be able to dynamically access the attribute names, taken from Scala. Get field names list from case class:
import scala.reflect.runtime.universe._
def classAccessors[T: TypeTag]: List[MethodSymbol] = typeOf[T].members.collect {
case m: MethodSymbol if m.isCaseAccessor => m
}.toList
A draft implementation for findByExample would be:
def findByExample[T, R](example: R) : Future[Seq[R]] = {
var qt = TableQuery[T].result
val accessors = classAccessors[R]
(0 until example.productArity).map { i =>
example.productElement(i) match {
case None => // ignore
case 0 => // ignore
// ... some more default values => // ignore
// handle a populated case
case Some(x) => {
val columnName = accessors(i)
qt = qt.filter(_.columnByName(columnName) == x)
}
}
}
qt.result
}
But this doesn't work because I need better Scala Kungfu. T is the entity table type and R is the row type that is generated as a case class and therefore a valid Scala Product type.
The first problem in that code is that would be too inefficient because instead of doing e.g.
qt.filter(_.firstName === "Juan" && _.streetName === "Rosedale Ave." && _.streetNumber === 5)
is doing:
// find all
var qt = TableQuery[T].result
// then filter by each column at the time
qt = qt.filter(_.firstName === "Juan")
qt = qt.filter(_.streetName === "Rosedale Ave.")
qt = qt.filter(_.streetNumber === 5)
Second I can't see how to dynamically access the column name in the filter method i.e.
qt.filter(_.firstName == "Juan")
I need to instead have
qt.filter(_.columnByName("firstName") == "Juan")
but apparently there is no such possibility while using the filter function?
Probably the best ways to implement filters and sorting by dynamically provided column names would be either plain SQL or extending the code generator to generate extension methods, something like this:
implicit class DynamicPersonQueries[C[_]](q: Query[PersonTable, PersonRow, C]){
def dynamicFilter( column: String, value: String ) = column {
case "firstName" => q.filter(_.firstName === value)
case "streetNumber" => q.filter(_.streetNumber === value.toInt)
...
}
}
You might have to fiddle with the types a bit to get it to compile (and ideally update this post afterwards :)).
You can then filter by all the provided values like this:
val examples: Map[String, String] = ...
val t = TableQuery[PersonTable]
val query = examples.foldLeft(t){case (t,(column, value)) => t.dynamicFilter(column, value)
query.result
Extending the code generator is explained here: http://slick.lightbend.com/doc/3.1.1/code-generation.html#customization
After further researching found the following blog post Repository Pattern / Generic DAO Implementation.
There they declare and implement a generic filter method that works for any Model Entity type and therefore it is in my view a valid functional replacement to the more JPA findByExample.
i.e.
T <: Table[E] with IdentifyableTable[PK]
E <: Entity[PK]
PK: BaseColumnType
def filter[C <: Rep[_]](expr: T => C)(implicit wt: CanBeQueryCondition[C]) : Query[T, E, Seq] = tableQuery.filter(expr)

Unable to use collectAsMap() in scala code

val titleMap = movies.map(line => line.split("\\|")).take(2)
//converting movie-id and movie name as map(key-pair)
val title1 = titleMap.map(array=>(array(0).toInt,array(1)))
val titles = movies.map(line => line.split("\\|").take(2)).map(array
=> (array(0).toInt,
array(1))).collectAsMap()
Whats wrong here with "title1",I am unable to apply collectAsMap function here,same thing I can apply in case of "titles"
The type of title1 is not an RDD, so it doesn't have the method collectAsMap().
The type of titles is an RDD so it does have the method collectAsMap().
Advise reading up on types https://en.wikipedia.org/wiki/Type_safety, https://en.wikipedia.org/wiki/Type_system

Access database column names from a Table?

Let's say I have a table:
object Suppliers extends Table[(Int, String, String, String)]("SUPPLIERS") {
def id = column[Int]("SUP_ID", O.PrimaryKey)
def name = column[String]("SUP_NAME")
def state = column[String]("STATE")
def zip = column[String]("ZIP")
def * = id ~ name ~ state ~ zip
}
Table's database name
The table's database name can be accessed by going: Suppliers.tableName
This is supported by the Scaladoc on AbstractTable.
For example, the above table's database name is "SUPPLIERS".
Columns' database names
Looking through AbstractTable, getLinearizedNodes and indexes looked promising. No column names in their string representations though.
I assume that * means "all the columns I'm usually interested in." * is a MappedProjection, which has this signature:
final case class MappedProjection[T, P <: Product](
child: Node,
f: (P) ⇒ T,
g: (T) ⇒ Option[P])(proj: Projection[P])
extends ColumnBase[T] with UnaryNode with Product with Serializable
*.getLinearizedNodes contains a huge sequence of numbers, and I realized that at this point I'm just doing a brute force inspection of everything in the API for possibly finding the column names in the String.
Has anybody also encountered this problem before, or could anybody give me a better understanding of how MappedProjection works?
It requires you to rely on Slick internals, which may change between versions, but it is possible. Here is how it works for Slick 1.0.1: You have to go via the FieldSymbol. Then you can extract the information you want like how columnInfo(driver: JdbcDriver, column: FieldSymbol): ColumnInfo does it.
To get a FieldSymbol from a Column you can use fieldSym(node: Node): Option[FieldSymbol] and fieldSym(column: Column[_]): FieldSymbol.
To get the (qualified) column names you can simply do the following:
Suppliers.id.toString
Suppliers.name.toString
Suppliers.state.toString
Suppliers.zip.toString
It's not explicitly stated anywhere that the toString will yield the column name, so your question is a valid one.
Now, if you want to programmatically get all the column names, then that's a bit harder. You could try using reflection to get all the methods that return a Column[_] and call toString on them, but it wouldn't be elegant. Or you could hack a bit and get a select * SQL statement from a query like this:
val selectStatement = DB withSession {
Query(Suppliers).selectStatement
}
And then parse our the column names.
This is the best I could do. If someone knows a better way then please share - I'm interested too ;)
Code is based on Lightbend Activator "slick-http-app".
slick version: 3.1.1
Added this method to the BaseDal:
def getColumns(): mutable.Map[String, Type] = {
val columns = mutable.Map.empty[String, Type]
def selectType(t: Any): Option[Any] = t match {
case t: TableExpansion => Some(t.columns)
case t: Select => Some(t.field)
case _ => None
}
def selectArray(t:Any): Option[ConstArray[Node]] = t match {
case t: TypeMapping => Some(t.child.children)
case _ => None
}
def selectFieldSymbol(t:Any): Option[FieldSymbol] = t match {
case t: FieldSymbol => Some(t)
case _ => None
}
val t = selectType(tableQ.toNode)
val c = selectArray(t.get)
for (se <- c.get) {
val col = selectType(se)
val fs = selectFieldSymbol(col.get)
columns += (fs.get.name -> fs.get.tpe)
}
columns
}
this method gets the column names (real names in DB) + types form the TableQ
used imports are:
import slick.ast._
import slick.util.ConstArray