I am trying to define udf with the function getTIme for spark scala udf but i am getting the error as error: illegal start of declaration. What might be error in the syntax and retutrn the date and also if there is parse exception instead of returing the null, send the some string as error
def getTime=udf((x:String) : java.sql.Timestamp => {
if (x.toString() == "") return null
else { val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss");
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime()); return t
}})
Thank you!
The return type for the udf is derived and should not be specified. Change the first line of code to:
def getTime=udf((x:String) => {
// your code
}
This should get rid of the error.
The following is a fully working code written in functional style and making use of Scala constructs:
val data: Seq[String] = Seq("", null, "2017-01-15 10:18:30")
val ds = spark.createDataset(data).as[String]
import java.text.SimpleDateFormat
import java.sql.Timestamp
val fmt = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
// ********HERE is the udf completely re-written: **********
val f = udf((input: String) => {
Option(input).filter(_.nonEmpty).map(str => new Timestamp(fmt.parse(str).getTime)).orNull
})
val ds2 = ds.withColumn("parsedTimestamp", f($"value"))
The following is output:
+-------------------+--------------------+
| value| parsedTimestamp|
+-------------------+--------------------+
| | null|
| null| null|
|2017-01-15 10:18:30|2017-01-15 10:18:...|
+-------------------+--------------------+
You should be using Scala datatypes, not Java datatypes. It would go like this:
def getTime(x: String): Timestamp = {
//your code here
}
You can easily do it in this way :
def getTimeFunction(timeAsString: String): java.sql.Timestamp = {
if (timeAsString.isEmpty)
null
else {
val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss")
val date = format.parse(timeAsString.toString())
val time = new Timestamp(date.getTime())
time
}
}
val getTimeUdf = udf(getTimeFunction _)
Then use this getTimeUdf accordingly. !
Related
I use IntelliJ IDEA to execute the code shown below. The content of df is the following:
+------+------+
|nodeId| p_i|
+------+------+
| 26|0.6914|
| 29|0.6914|
| 474| 0.0|
| 65|0.4898|
| 191|0.4445|
| 418|0.4445|
I get Task serialization error at line result.show() when I run this code:
class MyUtils extends Serializable {
def calculate(spark: SparkSession,
df: DataFrame): DataFrame = {
def myFunc(a: Double): String = {
var result: String = "-"
if (a > 1) {
result = "A"
}
return result
}
val myFuncUdf = udf(myFunc _)
val result = df.withColumn("role", myFuncUdf(df("a")))
result.show()
result
}
}
Why do I get this error?
Update:
This is how I run the code:
object Processor extends App {
// ...
val mu = new MyUtils()
var result = mu.calculate(spark, df)
}
I had to add extends Serializable to the specification of a class MyUtils.
I have a JSON like below
{"name":"method1","parameter1":"P1name","parameter2": 1.0}
I am loading my JSON file
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("C:/Users/test/Desktop/te.txt")
scala> df.show()
+-------+----------+----------+
| name|parameter1|parameter2|
+-------+----------+----------+
|method1| P1name| 1.0 |
+-------+----------+----------+
I have a function like below:
def method1(P1:String, P2:Double)={
| print(P1)
print(P2)
| }
I am calling my method1 based on column name after executing below code it should execute method1.
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
df.withColumn("methodCalling", when($"name" === "method1", method1($"parameter1",$"parameter2")).otherwise(when($"name" === "method2", method2($"parameter1",$"parameter2")))).show(false)
But I am getting bellow error.
<console>:63: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
Please let me know how to convert org.apache.spark.sql.ColumnName data type to String
When you pass arguments as
method1($"parameter1",$"parameter2")
You are passing columns to the function and not primitive datatypes. So, I would suggest you to change your method1 and method2 as udf functions, if you want to apply primitive datatype manipulations inside functions. And udf functions would have to return a value for each row of the new column.
import org.apache.spark.sql.functions._
def method1 = udf((P1:String, P2:Double)=>{
print(P1)
print(P2)
P1+P2
})
def method2 = udf((P1:String, P2:Double)=>{
print(P1)
print(P2)
P1+P2
})
Then your withColumn api should work properly
df.withColumn("methodCalling", when($"name" === "method1", method1($"parameter1",$"parameter2")).otherwise(when($"name" === "method2", method2($"parameter1",$"parameter2")))).show(false)
Note: udf functions perform data serialization and deserialzation for changing the column dataTypes to be processed row-wise which would increase complexity and a lot of memory usages. spark functions should be used as much as possible
You can try like this:
scala> def method1(P1:String, P2:Double): Int = {
| println(P1)
| println(P2)
| 0
| }
scala> def method2(P1:String, P2:Double): Int = {
| println(P1)
| println(P2)
| 1
| }
df.withColumn("methodCalling", when($"name" === "method1", method1(df.select($"parameter1").map(_.getString(0)).collect.head,df.select($"parameter2").map(_.getDouble(0)).collect.head))
.otherwise(when($"name" === "method2", method2(df.select($"parameter1").map(_.getString(0)).collect.head,df.select($"parameter2").map(_.getDouble(0)).collect.head)))).show
//output
P1name
1.0
+-------+----------+----------+-------------+
| name|parameter1|parameter2|methodCalling|
+-------+----------+----------+-------------+
|method1| P1name| 1.0| 0|
+-------+----------+----------+-------------+
You have to return something from your method otherwise it will retun unit and it will give error after printing result:
java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:75)
at org.apache.spark.sql.functions$.lit(functions.scala:101)
at org.apache.spark.sql.functions$.when(functions.scala:1245)
... 50 elided
Thanks.
I think you just want to read the JSON and based on that call the methods.
Since you have already created a dataframe, you can do something like :
df.map( row => (row.getString(0), row.getString(1) , row.getDouble(2) ) ).collect
.foreach { x =>
x._1.trim.toLowerCase match {
case "method1" => method1(x._2, x._3)
//case "method2" => method2(x._2, x._3)
//case _ => methodn(x._2, x._3)
}
}
// Output : P1name1.0
// Because you used `print` and not `println` ;)
I am new to Spark and Scala, now I'm somehow stuck with a problem: how to handle different field of each row by field name, then into a new rdd.
This is my pseudo codeļ¼
val newRdd = df.rdd.map(x=>{
def Random1 => random(1,10000) //pseudo
def Random2 => random(10000,20000) //pseduo
x.schema.map(y=> {
if (y.name == "XXX1")
x.getAs[y.dataType](y.name)) = Random1
else if (y.name == "XXX2")
x.getAs[y.dataType](y.name)) = Random2
else
x.getAs[y.dataType](y.name)) //pseduo,keeper the same
})
})
There are 2 less errors in above:
the second map,"x.getAs" is a error syntax
how to into a new rdd
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Thanks Ramesh Maharjan, it works now.
def randomString(len: Int): String = {
val rand = new scala.util.Random(System.nanoTime)
val sb = new StringBuilder(len)
val ab = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
for (i <- 0 until len) {
sb.append(ab(rand.nextInt(ab.length)))
}
sb.toString
}
def testUdf = udf((value: String) =>randomString(2))
val df = sqlContext.createDataFrame(Seq((1,"Android"), (2, "iPhone")))
df.withColumn("_2", testUdf(df("_2")))
+---+---+
| _1| _2|
+---+---+
| 1| F3|
| 2| Ag|
+---+---+
If you are intending to filter certain felds "XXX1" "XXX2" then simple select function should do the trick
df.select("XXX1", "XXX2")
and convert that to rdd
If you are intending something else then your x.getAs should look as below
val random1 = x.getAs(y.name)
It seems that you are trying to change values in some columns "XXX1" and "XXX2"
For that a simple udf function and withColumn should do the trick
Simple udf function is as below
def testUdf = udf((value: String) => {
//do your logics here and what you return from here would be reflected in the value you passed from the column
})
And you can call the udf function as
df.withColumn("XXX1", testUdf(df("XXX1")))
Similarly you can do for "XXX2"
I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!
I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.
I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and convert it into a Timestamp like this
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.Row
def getTimestamp(x:Any) : Timestamp = {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
if (x.toString() == "")
return null
else {
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime());
return t
}
}
def convert(row : Row) : Row = {
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1)
}
Is there a better, more concise way to do this, with the Dataframe API or spark-sql? The above method requires the creation of an RDD and to give the schema for the Dataframe again.
Spark >= 2.2
Since you 2.2 you can provide format string directly:
import org.apache.spark.sql.functions.to_timestamp
val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+-------------------+
// |id |dts |ts |
// +---+-------------------+-------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01|
// |2 |#$#### |null |
// +---+-------------------+-------------------+
Spark >= 1.6, < 2.2
You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
You can use unix_timestamp to parse strings and cast it to timestamp
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat.
Spark >= 1.5, < 1.6
You'll have to use use something like this:
unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
or
(unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
due to SPARK-11724.
Spark < 1.5
you should be able to use these with expr and HiveContext.
I haven't played with Spark SQL yet but I think this would be more idiomatic scala (null usage is not considered a good practice):
def getTimestamp(s: String) : Option[Timestamp] = s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
Please notice I assume you know Row elements types beforehand (if you read it from a csv file, all them are String), that's why I use a proper type like String and not Any (everything is subtype of Any).
It also depends on how you want to handle parsing exceptions. In this case, if a parsing exception occurs, a None is simply returned.
You could use it further on with:
rows.map(row => Row(row(0),row(1),row(2), getTimestamp(row(3))
I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:
import org.joda.time.{DateTime, DateTimeZone}
object DateUtils extends Serializable {
def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
}
sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))
And you can just use the UDF in your spark SQL query.
Spark Version: 2.4.4
scala> import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.types.TimestampType
scala> val df = Seq("2019-04-01 08:28:00").toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: string]
scala> val df_mod = df.select($"ts".cast(TimestampType))
df_mod: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df_mod.printSchema()
root
|-- ts: timestamp (nullable = true)
I would like to move the getTimeStamp method wrote by you into rdd's mapPartitions and reuse GenericMutableRow among rows in an iterator:
val strRdd = sc.textFile("hdfs://path/to/cvs-file")
val rowRdd: RDD[Row] = strRdd.map(_.split('\t')).mapPartitions { iter =>
new Iterator[Row] {
val row = new GenericMutableRow(4)
var current: Array[String] = _
def hasNext = iter.hasNext
def next() = {
current = iter.next()
row(0) = current(0)
row(1) = current(1)
row(2) = current(2)
val ts = getTimestamp(current(3))
if(ts != null) {
row.update(3, ts)
} else {
row.setNullAt(3)
}
row
}
}
}
And you should still use schema to generate a DataFrame
val df = sqlContext.createDataFrame(rowRdd, tableSchema)
The usage of GenericMutableRow inside an iterator implementation could be find in Aggregate Operator, InMemoryColumnarTableScan, ParquetTableOperations etc.
I would use https://github.com/databricks/spark-csv
This will infer timestamps for you.
import com.databricks.spark.csv._
val rdd: RDD[String] = sc.textFile("csvfile.csv")
val df : DataFrame = new CsvParser().withDelimiter('|')
.withInferSchema(true)
.withParseMode("DROPMALFORMED")
.csvRdd(sqlContext, rdd)
I had some issues with to_timestamp where it was returning an empty string. After a lot of trial and error, I was able to get around it by casting as a timestamp, and then casting back as a string. I hope this helps for anyone else with the same issue:
df.columns.intersect(cols).foldLeft(df)((newDf, col) => {
val conversionFunc = to_timestamp(newDf(col).cast("timestamp"), "MM/dd/yyyy HH:mm:ss").cast("string")
newDf.withColumn(col, conversionFunc)
})