Updating a column with a function in spark scala - scala

I have this column in my database, called id which contains INTS.
e.g:
{id: 123456}
{id: 234567}
{id: 345678}
{id: 456789}
{id: 567890}
and I need to update these values with it's encrypted values by calling a function encryptId(id). encryptId takes in a LONG and then returns a STRING value
My thought process is to use .withcolumn to replace the current id column with the encrypted value
db.withColumn("id", encryptId(col("id"))) gives me the error
type mismatch. Required: Long, found: column
db.withColumn("id", encryptId("id")) gives me the error
type mismatch. Required: Long, found: string
Am I doing this incorrectly? :(

It seems that you didn't register encryptId as spark UDF. Let's assume that encryptId is defined as following:
val encryptId = (id: Long) => {
// dummy implementation for simplicity
id.toString
}
You can register encryptId as UDF:
import org.apache.spark.sql.functions.udf
val encryptIdUdf = udf(encryptId)
Now you can use encryptIdUdf as following:
db.withColumn("id", encryptIdUdf(col("id")))

Related

How to make spark udf accept a list with different data types?

My underlying function is defined like this:
def rowToSHA1(s: Seq[Any]): String = {
//return sha1 of sequence
}
}
Here is the definition of my udf:
val toSha = udf[String, Seq[Any]](rowToSHA1)
df.withColumn("shavalue",(toSha(array($"id",$"name",$"description",$"accepted")))
It works when i pass only a list of string as parameter but i get an error when there is a boolean.
org.apache.spark.sql.AnalysisException: cannot resolve 'array(`id`, `name`,
`description`, `accepted`)' due to data type mismatch: input to function
array should all be the same type, but it's [string, string, string,
boolean];;
I'm exploring the use of a generic function, is it a good idea?
FIX: converted my column to string before applying the function
df.withColumn("shavalue",(toSha(array($"id",$"name",$"description",$"accepted".cast("string)))
The best solution I know for this kind of situation is just convert everything to String, When you read/create the DataFrame make sure everything is String or convert it at some point. Later you can convert if back to any other type.

Flink Scala - Extending WindowFunction

I am trying to figure out how to write my own WindowFunction but having issues, and I can not figure out why. The issue I am having is with the apply function, as it does not recognize MyWindowFunction as a valid input, so I can not compile. The data I am streaming contains (timestamp,x,y) where x and y are 0 and 1 for testing. extractTupleWithoutTs simply returns a tuple (x,y). I have been running the code with simple sum and reduce functions with success. Grateful for any help :) Using Flink 1.3
Imports:
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
Rest of the code:
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val text = env.socketTextStream("localhost", 9999).assignTimestampsAndWatermarks(new TsExtractor)
val tuple = text.map( str => extractTupleWithoutTs(str))
val counts = tuple.keyBy(0).timeWindow(Time.seconds(5)).apply(new MyWindowFunction())
counts.print()
env.execute("Window Stream")
MyWindow function which is basically copy paste from example with changes of the types.
class MyWindowFunction extends WindowFunction[(Int, Int), Int, Int, TimeWindow] {
def apply(key: Int, window: TimeWindow, input: Iterable[(Int, Int)], out: Collector[Int]): () = {
var count = 0
for (in <- input) {
count = count + 1
}
out.collect(count)
}
}
The problem is the third type parameter of the WindowFunction, i.e., the type of the key. The key is declared with an index in the keyBy method (keyBy(0)). Therefore, the type of the key cannot be determined at compile time. The same problem arises, if you declare the key as a string, i.e., keyBy("f0").
There are two options to resolve this:
Use a KeySelector function in keyBy to extract the key (something like keyBy(_._1)). The return type of the KeySelector function is known at compile time such that you can use a correctly typed WindowFunction with an Int key.
Change the type of the third type parameter of the WindowFunction to org.apache.flink.api.java.tuple.Tuple, i.e., WindowFunction[(Int, Int), Int, org.apache.flink.api.java.tuple.Tuple, TimeWindow]. Tuple is a generic holder for the keys extracted by keyBy. In your case it will be a org.apache.flink.api.java.tuple.Tuple1. In WindowFunction.apply() you can cast Tuple to Tuple1 and access the key field by Tuple1.f0.

Spark cast column to sql type stored in string

The simple request is I need help adding a column to a dataframe but, the column has to be empty, its type is from ...spark.sql.types and the type has to be defined from a string.
I can probably do this with ifs or case but I'm looking for something more elegant. Something that does not require writing a case for every type in org.apache.spark.sql.types
If I do this for example:
df = df.withColumn("col_name", lit(null).cast(org.apache.spark.sql.types.StringType))
It works as intended, but I have the type stored as a string,
var the_type = "StringType"
or
var the_type = "org.apache.spark.sql.types.StringType"
and I can't get it to work by defining the type from the string.
For those interested here are some more details: I have a set containing tuples (col_name, col_type) both as strings and I need to add columns with the correct types for a future union between 2 dataframes.
I currently have this:
for (i <- set_of_col_type_tuples) yield {
val tip = Class.forName("org.apache.spark.sql.types."+i._2)
df = df.withColumn(i._1, lit(null).cast(the_type))
df }
if I use
val the_type = Class.forName("org.apache.spark.sql.types."+i._2)
I get
error: overloaded method value cast with alternatives: (to: String)org.apache.spark.sql.Column <and> (to: org.apache.spark.sql.types.DataType)org.apache.spark.sql.Column cannot be applied to (Class[?0])
if I use
val the_type = Class.forName("org.apache.spark.sql.types."+i._2).getName()
It's a string so I get:
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '.' expecting {<EOF>, '('}(line 1, pos 3)
== SQL == org.apache.spark.sql.types.StringType
---^^^
EDIT: So, just to be clear, the set contains tuples like this ("col1","IntegerType"), ("col2","StringType") not ("col1","int"), ("col2","string"). A simple cast(i._2) does not work.
Thank you.
You can use overloaded method cast, which has a String as an argument:
val stringType : String = ...
column.cast(stringType)
def cast(to: String): Column
Casts the column to a different data type, using the canonical string
representation of the type.
You can also scan for all Data Types:
val types = classOf[DataTypes]
.getDeclaredFields()
.filter(f => java.lang.reflect.Modifier.isStatic(f.getModifiers()))
.map(f => f.get(new DataTypes()).asInstanceOf[DataType])
Now types is Array[DataType]. You can translate it to Map:
val typeMap = types.map(t => (t.getClass.getSimpleName.replace("$", ""), t)).toMap
and use in code:
column.cast(typeMap(yourType))

Explicit cast reading .csv with case class Spark 2.1.0

I have the following case class:
case class OrderDetails(OrderID : String, ProductID : String, UnitPrice : Double,
Qty : Int, Discount : Double)
I am trying read this csv: https://github.com/xsankar/fdps-v3/blob/master/data/NW-Order-Details.csv
This is my code:
val spark = SparkSession.builder.master(sparkMaster).appName(sparkAppName).getOrCreate()
import spark.implicits._
val orderDetails = spark.read.option("header","true").csv( inputFiles + "NW-Order-Details.csv").as[OrderDetails]
And the error is:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Cannot up cast `UnitPrice` from string to double as it may truncate
The type path of the target object is:
- field (class: "scala.Double", name: "UnitPrice")
- root class: "es.own3dh2so4.OrderDetails"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
Why can not it be transformed if all fields are "doubles" values? What do not I understand?
Spark version 2.1.0, Scala version 2.11.7
You just need to explicitly cast your field to a Double:
val orderDetails = spark.read
.option("header","true")
.csv( inputFiles + "NW-Order-Details.csv")
.withColumn("unitPrice", 'UnitPrice.cast(DoubleType))
.as[OrderDetails]
On a side note, by Scala (and Java) convention, your case class constructor parameters should be lower camel case:
case class OrderDetails(orderID: String,
productID: String,
unitPrice: Double,
qty: Int,
discount: Double)
If we want to change the datatype for multiple columns; if we use withColumn option it will look ugly.
The better way to apply schema for the data is
Get the Case Class schema using Encoders as shown below
val caseClassschema = Encoders.product[CaseClass].schema
Apply this schema while reading data
val data = spark.read.schema(caseClassschema)

Spark Dataset and java.sql.Date

Let's say I have a Spark Dataset like this:
scala> import java.sql.Date
scala> case class Event(id: Int, date: Date, name: String)
scala> val ds = Seq(Event(1, Date.valueOf("2016-08-01"), "ev1"), Event(2, Date.valueOf("2018-08-02"), "ev2")).toDS
I want to create a new Dataset with only the name and date fields. As far as I can see, I can either use ds.select() with TypedColumn or I can use ds.select() with Column and then convert the DataFrame to Dataset.
However, I can't get the former option working with the Date type. For example:
scala> ds.select($"name".as[String], $"date".as[Date])
<console>:31: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
ds.select($"name".as[String], $"date".as[Date])
^
The later option works:
scala> ds.select($"name", $"date").as[(String, Date)]
res2: org.apache.spark.sql.Dataset[(String, java.sql.Date)] = [name: string, date: date]
Is there a way to select Date fields from Dataset without going to DataFrame and back?
Been bashing my head against problems like these for the whole day. I think you can solve your problem with one line:
implicit val e: Encoder[(String, Date)] = org.apache.spark.sql.Encoders.kryo[(String,Date)]
At least that has been working for me.
EDIT
In these cases, the problem is that for most Dataset operations, Spark 2 requires an Encoder that stores schema information (presumably for optimizations). The schema information takes the form of an implicit parameter (and a bunch of Dataset operations have this sort of implicit parameter).
In this case, the OP found the correct schema for java.sql.Date so the following works:
implicit val e = org.apache.spark.sql.Encoders.DATE