I have the following code in pyspark which works fine.
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import udf, array
prod_cols = udf(lambda arr: float(arr[0])*float(arr[1]), DoubleType())
finalDf = finalDf.withColumn('click_factor', sum_cols(array('rating', 'score')))
Now i tried similar code in scala.
val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf = finalDf.withColumn("cl_rate", prod_cols(finalDf("rating"), finalDf("score")))
Somehow second code doesnt give right answers, always null or zero
Can you help me get the right scala code. Essentially i just need a code two multiply two columns, considering there may be null values of score or rating.
Pass only Not Null values to UDF.
Change below code
val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf.withColumn("cl_rate", prod_cols(finalDf("rating"), finalDf("score")))
to
val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf
.withColumn("rating",$"rating".cast("double")) // Ignore this line if column data type is already double
.withColumn("score",$"score".cast("double")) // Ignore this line if column data type is already double
.withColumn("cl_rate",
when(
$"rating".isNotNull && $"score".isNotNull,
prod_cols($"rating", $"score")
).otherwise(lit(null).cast("double"))
)
Related
created a dataset with below schema
org.apache.spark.sql.Dataset[Records] = [value: string, RowNo: int]
Here value field is fixed length position which I would like to convert it to individual columns and add RowNo as last column using a UDF.
def ReadFixWidthFileWithRDD(SrcFileType:String, rdd: org.apache.spark.rdd.RDD[(String, String)], inputFileLength: Int = 6): DataFrame = {
val postapendSchemaRowNo=StructType(Array(StructField("RowNo", StringType, true)))
val inputLength =List(inputFileLength)
val FileInfoList = FixWidth_Dictionary.get(SrcFileType).toList
val fileSchema = FileInfoList(0)._1
val fileColumnSize = FileInfoList(0)._2
val fileSchemaWithFileName = StructType(fileSchema++postapendSchemaRowNo)
val fileColumnSizeWithFileNameLength = fileColumnSize:::inputLength
val data = rdd
var retDF = spark.createDataFrame(data.map{ x =>;
lsplit(fileColumnSizeWithFileNameLength,x._1+x._2)},fileSchemaWithFileName )
retDF
}
Now in the above function, I want to use a dataset instead of Rdd, as my RowNo is not displaying values beyond 99999.
can someone suggest an alternative
I got the solution.
I had created a Hashkey and associated sequence number into a dataframe.
The hashkey is also associated with a dataframe as well.
I joined those two after splitting the fixed length position.
Description
When I try to select a column that is cast to unix_timestamp and then timestamp from a dataframe there is a sparkanalysisexception error. See link below.
However, when I combine two columns, and then cast the combo to a unix_timestamp and then timestamp type and then select from a df there is no error.
Disparate Cases
Error:
How to extract year from a date string?
No Error
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder().
appName("myapp").master("local").getOrCreate()
case class Person(id: Int, date: String, time:String)
import spark.implicits._
val mydf: DataFrame = Seq(Person(1,"9/16/13", "11:11:11")).toDF()
//solution.show()
//column modificaton
val datecol: Column = mydf("date")
val timecol: Column = mydf("time")
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
mydf.select(newcol).show()
Results
Expected:
Error-sparkanalysis, can't find unix_timestamp(concat(....)) in mydf
Actual:
+------------------------------------------------------------------+
|CAST(unix_timestamp(concat(date, , time), MM/dd/yy) AS TIMESTAMP)|
+------------------------------------------------------------------+
| 2013-09-16 00:00:...|
These do not seem disparate cases. In the erroneous case, you had a new dataframe with changed column names. See below :-
val select_df: DataFrame = mydf.select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType))
select_df.select(year($"date")).show()
Here, select_df dataframe has changed column names from date to something like cast(unix_timestamp(mydf("date"),"MM/dd/yy")) as Timestamp
While in the case mentioned above, you are just defining a new column when you say :-
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
And then you use this to select from your dataframe and thus it gives out expected results.
Hope this makes things clearer.
I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column.
var columns = getColumns(x) // Returns a List[Column]
tempDf.select(columns) //trying to get
Trying to find a good way of doing this I know, if it were a string I could do something like
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
For spark 2.0 seems that you have two options. Both depends on how you manage your columns (Strings or Columns).
Spark code (spark-sql_2.11/org/apache/spark/sql/Dataset.scala):
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
You can see how internally spark is converting your head & tail to a list of Columns to call again Select.
So, in that case if you want a clear code I will recommend:
If columns: List[String]:
import org.apache.spark.sql.functions.col
df.select(columns.map(col): _*)
Otherwise, if columns: List[Columns]:
df.select(columns: _*)
I have a Dataframe in which some columns are of type String and contain NULL as a String value (not as actual NULL). I want to impute them with zero. apparently df.na.fill(0) doesn't work. How can I impute them with zero?
You can use replace() from DataFrameNaFunctions, these can be accessed by the prefix .na:
val df1 = df.na.replace("*", Map("NULL" -> "0"))
You could also create your own udf that replicates this behaviour:
import org.apache.spark.sql.functions.col
val nullReplacer = udf((x: String) => {
if (x == "NULL") "0"
else x
})
val df1 = df.select(df.columns.map(c => nullReplacer(col(c)).alias(c)): _*)
However this would be superfluous given it does the same as the above, at the cost of more lines of code than necessary.
I have a dataframe in Spark with many columns and a udf that I defined. I want the same dataframe back, except with one column transformed. Furthermore, my udf takes in a string and returns a timestamp. Is there an easy way to do this? I tried
val test = myDF.select("my_column").rdd.map(r => getTimestamp(r))
but this returns an RDD and just with the transformed column.
If you really need to use your function, I can suggest two options:
Using map / toDF:
import org.apache.spark.sql.Row
import sqlContext.implicits._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val test = myDF.select("my_column").rdd.map {
case Row(string_val: String) => (string_val, getTimestamp(string_val))
}.toDF("my_column", "new_column")
Using UDFs (UserDefinedFunction):
import org.apache.spark.sql.functions._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val newCol = udf(getTimestamp).apply(col("my_column")) // creates the new column
val test = myDF.withColumn("new_column", newCol) // adds the new column to original DF
Alternatively,
If you just want to transform a StringType column into a TimestampType column you can use the unix_timestamp column function available since Spark SQL 1.5:
val test = myDF
.withColumn("new_column", unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm")
.cast("timestamp"))
Note: For spark 1.5.x, it is necessary to multiply the result of unix_timestamp by 1000 before casting to timestamp (issue SPARK-11724). The resulting code would be:
val test = myDF
.withColumn("new_column", (unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm") *1000L)
.cast("timestamp"))
Edit: Added udf option