Spark Decimal Precision and Scale seems wrong when Casting - scala

Reading the documentation, a Spark DataType BigDecimal(precision, scale) means that
Precision is total number of digits and
Scale is the number of digits after the decimal point.
So when I cast a value to decimal
scala> val sss = """select cast(1.7142857343 as decimal(9,8))"""
scala> spark.sql(sss).show
+----------------------------------+
|CAST(1.7142857343 AS DECIMAL(9,8))|
+----------------------------------+
| 1.71428573| // It has 8 decimal digits
+----------------------------------+
But when I cast values above 10.0, I get NULL
scala> val sss = """select cast(12.345678901 as decimal(9,8))"""
scala> spark.sql(sss).show
+----------------------------+
|CAST(11.714 AS DECIMAL(9,8))|
+----------------------------+
| null|
+----------------------------+
I would expect the result would be 12.3456789,
Why is it NULL?
Why is it that Precision is not being implemented?

To cast decimal spark internally validates that provided schema decimal(9,8) is wider than 12.345678901 actual schema decimal(11,9). If yes, it means numbers can be casted into provided schema safely without losing any precision or range. Have a look at org.apache.spark.sql.types.DecimalType.isWiderThan()
However, in the above case decimal(11,9) can not be cast into decimal(9,8) therefore it is returning null.
//MAX_PRECISION = 38
val sss = """select cast(12.345678901 as decimal(38,7))"""
spark.sql(sss1).show(10)
+-----------------------------------+
|CAST(12.345678901 AS DECIMAL(38,8))|
+-----------------------------------+
| 12.3456789|
+-----------------------------------+

Related

Spark Scala - Winsorize DataFrame columns within groups

I am pre-processing data for machine learning inputs, a target value column, call it "price" has many outliers, and rather than winsorizing price over the whole set I want to winsorize within groups labeled by "product_category". There are other features, product_category is just a price-relevant label.
There is a Scala stat function that works great:
df_data.stat.approxQuantile("price", Array(0.01, 0.99), 0.00001)
// res19: Array[Double] = Array(3.13, 318.54)
Unfortunately, it doesn't support computing the quantiles within groups. Nor does is support window partitions.
df_data
.groupBy("product_category")
.approxQuantile($"price", Array(0.01, 0.99), 0.00001)
// error: value approxQuantile is not a member of
// org.apache.spark.sql.RelationalGroupedDataset
What is the best way to compute say the p01 and p99 within groups of a spark dataframe, for the purpose of replacing values beyond that range, ie winsorizing?
My dataset schema can be imagined like this, and its over 20MM rows with appx 10K different labels for "product_category", so performance is also a concern.
df_data and a winsorized price column:
+---------+------------------+--------+---------+
| item | product_category | price | pr_winz |
+---------+------------------+--------+---------+
| I000001 | XX11 | 1.99 | 5.00 |
| I000002 | XX11 | 59.99 | 59.99 |
| I000003 | XX11 |1359.00 | 850.00 |
+---------+------------------+--------+---------+
supposing p01 = 5.00, p99 = 850.00 for this product_category
Here is what I came up with, after struggling with the documentation (there are two functions approx_percentile and percentile_approx that apparently do the same thing).
I was not able to figure out how to implement this except as a spark sql expression, not sure exactly why grouping only works there. I suspect because its part of Hive?
Spark DataFrame Winsorizor
Tested on DF in 10 to 100MM rows range
// Winsorize function, groupable by columns list
// low/hi element of [0,1]
// precision: integer in [1, 1E7-ish], in practice use 100 or 1000 for large data, smaller is faster/less accurate
// group_col: comma-separated list of column names
import org.apache.spark.sql._
def grouped_winzo(df: DataFrame, winz_col: String, group_col: String, low: Double, hi: Double, precision: Integer): DataFrame = {
df.createOrReplaceTempView("df_table")
spark.sql(s"""
select distinct
*
, percentile_approx($winz_col, $low, $precision) over(partition by $group_col) p_low
, percentile_approx($winz_col, $hi, $precision) over(partition by $group_col) p_hi
from df_table
""")
.withColumn(winz_col + "_winz", expr(s"""
case when $winz_col <= p_low then p_low
when $winz_col >= p_hi then p_hi
else $winz_col end"""))
.drop(winz_col, "p_low", "p_hi")
}
// winsorize the price column of a dataframe at the p01 and p99
// percentiles, grouped by 'product_category' column.
val df_winsorized = grouped_winzo(
df_data
, "price"
, "product_category"
, 0.01, 0.99, 1000)

Spark decimal type precision loss

I'm doing some testing of spark decimal types for currency measures and am seeing some odd precision results when I set the scale and precision as shown below. I want to be sure that I won't have any data loss during calculations but the example below is not reassuring of that. Can anyone tell me why this is happening with spark sql? Currently on version 2.3.0
val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as decimal(38,14)) as decimal(38,14)) val"""
spark.sql(sql).show
This returns
+----------------+
| val|
+----------------+
|0.33333300000000|
+----------------+
This is a current open issue, see SPARK-27089. The suggested work around is to adjust the setting below. I validated that the SQL statement works as expected with this setting set to false.
spark.sql.decimalOperations.allowPrecisionLoss=false
Use BigDecimal to avoid precision loss. See Double vs. BigDecimal?
example:
scala> val df = Seq(BigDecimal("0.03"),BigDecimal("8.20"),BigDecimal("0.02")).toDS
df: org.apache.spark.sql.Dataset[scala.math.BigDecimal] = [value: decimal(38,18)]
scala> df.select($"value").show
+--------------------+
| value|
+--------------------+
|0.030000000000000000|
|8.200000000000000000|
|0.020000000000000000|
+--------------------+
Using BigDecimal:
scala> df.select($"value" + BigDecimal("0.1")).show
+-------------------+
| (value + 0.1)|
+-------------------+
|0.13000000000000000|
|8.30000000000000000|
|0.12000000000000000|
+-------------------+
if you don't use BigDecimal, there will be a loss in precision. In this case 0.1 is a double
scala> df.select($"value" + lit(0.1)).show
+-------------------+
| (value + 0.1)|
+-------------------+
| 0.13|
| 8.299999999999999|
|0.12000000000000001|
+-------------------+

How to merge two or more columns into one?

I have a streaming Dataframe that I want to calculate min and avg over some columns.
Instead of getting separate resulting columns of min and avg after applying the operations, I want to merge the min and average output into a single column.
The dataframe look like this:
+-----+-----+
| 1 | 2 |
+-----+-----+-
|24 | 55 |
+-----+-----+
|20 | 51 |
+-----+-----+
I thought I'd use a Scala tuple for it, but that does not seem to work:
val res = List("1","2").map(name => (min(col(name)), avg(col(name))).as(s"result($name)"))
All code used:
val res = List("1","2").map(name => (min(col(name)),avg(col(name))).as(s"result($name)"))
val groupedByTimeWindowDF1 = processedDf.groupBy($"xyz", window($"timestamp", "60 seconds"))
.agg(res.head, res.tail: _*)
I'm expecting the output after applying the min and avg mathematical opearations to be:
+-----------+-----------+
| result(1)| result(2)|
+-----------+-----------+
|20 ,22 | 51,53 |
+-----------+-----------+
How I should write the expression?
Use struct standard function:
struct(colName: String, colNames: String*): Column
struct(cols: Column*): Column
Creates a new struct column that composes multiple input columns.
That gives you the values as well as the names (of the columns).
val res = List("1","2").map(name =>
struct(min(col(name)), avg(col(name))) as s"result($name)")
^^^^^^ HERE
The power of struct can be seen when you want to reference one field in the struct and you can use the name (not index).
q.select("structCol.name")
What you want to do is to merge the values of multiple columns together in a single column. For this you can use the array function. In this case it would be:
val res = List("1","2").map(name => array(min(col(name)),avg(col(name))).as(s"result($name)"))
Which will give you :
+------------+------------+
| result(1)| result(2)|
+------------+------------+
|[20.0, 22.0]|[51.0, 53.0]|
+------------+------------+

How to calculate difference between date column and current date?

I am trying to calculate the Date Diff between a column field and current date of the system.
Here is my sample code where I have hard coded the my column field with 20170126.
val currentDate = java.time.LocalDate.now
var datediff = spark.sqlContext.sql("""Select datediff(to_date('$currentDate'),to_date(DATE_FORMAT(CAST(unix_timestamp( cast('20170126' as String), 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'))) AS GAP
""")
datediff.show()
Output is like:
+----+
| GAP|
+----+
|null|
+----+
I need to calculate actual Gap Between the two dates but getting NULL.
You have not defined the type and format of "column field" so I assume it's a string in the (not-very-pleasant) format YYYYMMdd.
val records = Seq((0, "20170126")).toDF("id", "date")
scala> records.show
+---+--------+
| id| date|
+---+--------+
| 0|20170126|
+---+--------+
scala> records
.withColumn("year", substring($"date", 0, 4))
.withColumn("month", substring($"date", 5, 2))
.withColumn("day", substring($"date", 7, 2))
.withColumn("d", concat_ws("-", $"year", $"month", $"day"))
.select($"id", $"d" cast "date")
.withColumn("datediff", datediff(current_date(), $"d"))
.show
+---+----------+--------+
| id| d|datediff|
+---+----------+--------+
| 0|2017-01-26| 83|
+---+----------+--------+
PROTIP: Read up on functions object.
Caveats
cast
Please note that I could not convince Spark SQL to cast the column "date" to DateType given the rules in DateTimeUtils.stringToDate:
yyyy,
yyyy-[m]m
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d *
yyyy-[m]m-[d]dT*
date_format
I could not convince date_format to work either so I parsed "date" column myself using substring and concat_ws functions.

SparkSQL function require type Decimal

I designed the following function to work with arrays of any numeric type:
def array_sum[T](item:Traversable[T])(implicit n:Numeric[T]) = item.sum
// Registers a function as a UDF so it can be used in SQL statements.
sqlContext.udf.register("array_sumD", array_sum(_:Seq[Float]))
But wanting to pass an array of type float me the following error:
// Now we can use our function directly in SparkSQL.
sqlContext.sql("SELECT array_sumD(array(5.0,1.0,2.0)) as array_sum").show
Error:
cannot resolve 'UDF(array(5.0,1.0,2.0))' due to data type mismatch: argument 1 requires array<double> type, however, 'array(5.0,1.0,2.0)' is of array<decimal(2,1)> type;
Default data type for decimal values in Spark-SQL is, well, decimal. If you cast your literals in the query into floats, and use the same UDF, it works:
sqlContext.sql(
"""SELECT array_sumD(array(
| CAST(5.0 AS FLOAT),
| CAST(1.0 AS FLOAT),
| CAST(2.0 AS FLOAT)
|)) as array_sum""".stripMargin).show
The result, as expected:
+---------+
|array_sum|
+---------+
| 8.0|
+---------+
Alternatively, if you do want to use decimals (to avoid floating point issues), you'll still have to use casting to get the right precision, plus you won't be able to use Scala's nice Numeric and sum, as decimals are read as java.math.BigDecimal. So - your code would be:
def array_sum(item:Traversable[java.math.BigDecimal]) = item.reduce((a, b) => a.add(b))
// Registers a function as a UDF so it can be used in SQL statements.
sqlContext.udf.register("array_sumD", array_sum(_:Seq[java.math.BigDecimal]))
sqlContext.sql(
"""SELECT array_sumD(array(
| CAST(5.0 AS DECIMAL(38,18)),
| CAST(1.0 AS DECIMAL(38,18)),
| CAST(2.0 AS DECIMAL(38,18))
|)) as array_sum""".stripMargin).show