About a loss of precision when calculating an aggregate sum with data frames - scala

i have a Dataframe with this kind of data:
unit,sensitivity currency,trading desk ,portfolio ,issuer ,bucket ,underlying ,delta ,converted sensitivity
ES ,USD ,EQ DERIVATIVES,ESEQRED_LH_MIDX ,5GOY ,5 ,repo ,0.00002 ,0.00002
ES ,USD ,EQ DERIVATIVES,IND_GLOBAL1 ,no_localizado ,8 ,repo ,-0.16962 ,-0.15198
ES ,EUR ,EQ DERIVATIVES,ESEQ_UKFLOWN ,IGN2 ,8 ,repo ,-0.00253 ,-0.00253
ES ,USD ,EQ DERIVATIVES,BASKETS1 ,9YFV ,5 ,spot ,-1003.64501 ,-899.24586
and I have to do an aggregation operation over this data, doing something like this:
val filteredDF = myDF.filter("unit = 'ES' AND `trading desk` = 'EQ DERIVATIVES' AND issuer = '5GOY' AND bucket = 5 AND underlying = 'repo' AND portfolio ='ESEQRED_LH_MIDX'")
.groupBy("unit","trading desk","portfolio","issuer","bucket","underlying")
.agg(sum("converted_sensitivity"))
But I am seeing that I am loosing precision on the aggregated sum, so how can I be sure about that every value of "converted_sensitivity" is converted to a BigDecimal(25,5) before doing the sum operation over the new aggregated column?
Thank you very much.

To be sure of the convertion you can use the DecimalType in your DataFrame.
According to Spark documentation the DecimalType is:
The data type representing java.math.BigDecimal values. A Decimal that must have fixed precision (the maximum number of digits) and scale (the number of digits on right side of dot).
The precision can be up to 38, scale can also be up to 38 (less or equal to precision).
The default precision and scale is (10, 0).
You can see this here.
To convert the data you can use the function cast of the Column object. Like this:
import org.apache.spark.sql.types.DecimalType
val filteredDF = myDF.filter("unit = 'ES' AND `trading desk` = 'EQ DERIVATIVES' AND issuer = '5GOY' AND bucket = 5 AND underlying = 'repo' AND portfolio ='ESEQRED_LH_MIDX'")
.withColumn("new_column_big_decimal", col("converted_sensitivity").cast(DecimalType(25,5))
.groupBy("unit","trading desk","portfolio","issuer","bucket","underlying")
.agg(sum("new_column_big_decimal"))

Related

Pyspark dataframe filter based on in between values

I have a Pyspark dataframe with below values -
[Row(id='ABCD123', score='28.095238095238095'), Row(id='EDFG456', score='36.2962962962963'), Row(id='HIJK789', score='37.56218905472637'), Row(id='LMNO1011', score='36.82352941176471')]
I want only the values from the DF which have score between the input score value and input score value + 1, say, the input score value is 36 then I want the output DF with only two ids - EDFG456 & LMNO1011 as their score falls between 36 & 37. I achieved this by doing as follows -
input_score_value = 36
input_df = my_df.withColumn("score_num", substring(my_df.score, 1,2))
output_matched = input_df.filter(input_df.score_num == input_score_value)
print(output_matched.take(5))
The above code gives the below output, but it takes too long to process 2 mil rows. I was thinking if there is some better way to do this to reduce the response time.
[Row(id='EDFG456', score='36.2962962962963'), Row(id='LMNO1011',score='36.82352941176471')]
You can use the function floor.
from pyspark.sql.functions import floor
output_matched = input_df.filter(foor(input_df.score_num) == input_score_value)
print(output_matched.take(5))
It should be much faster compared to substring. Let me know.

PySpark approxSimilarityJoin() not returning any results

I am trying to find similar users by vectorizing user features and sorting by distance between user vectors in PySpark. I'm running this in Databricks on Runtime 5.5 LTS ML cluster (Scala 2.11, Spark 2.4.3)
Following the code in the docs, I am using approxSimilarityJoin() method from the pyspark.ml.feature.BucketedRandomProjectionLSH model.
I have found similar users successfully using approxSimilarityJoin(), but every now and then I come across a user of interest that apparently has no users similar to them.
Usually when approxSimilarityJoin() doesn't return anything, I assume it's because the threshold parameter is set to low. That fixes the issue sometimes, but now I've tried using a threshold of 100000 and still getting nothing back.
I define the model as
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=1.0)
I'm not sure if I changing bucketLength or numHashTables would help in obtaining results.
The following example shows a pair of users where approxSimilarityJoin() returned something (dataA, dataB) and a pair of users (dataC, dataD) where it didn't.
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([0.7016968702094931,0.2636417660310031,4.155293362824633,4.191398632883099]),)]
dataB = [(1, Vectors.dense([0.3757117100334294,0.2636417660310031,4.1539923630906745,4.190086328785612]),)]
dfA = spark.createDataFrame(dataA, ["customer_id", "scaledFeatures"])
dfB = spark.createDataFrame(dataB, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfA)
# returns
# theshold of 100000 is clearly overkill
# A dataframe with dfA and dfB feature vectors and a EuclideanDistance of 0.32599039770730354
model.approxSimilarityJoin(dfA, dfB, 100000, distCol="EuclideanDistance").show()
dataC = [(0, Vectors.dense([1.1600056435954367,78.27652460873155,3.5535837780801396,0.0030949620591871887]),)]
dataD = [(1, Vectors.dense([0.4660731192450482,39.85571715054726,1.0679201943112886,0.012330725745062067]),)]
dfC = spark.createDataFrame(dataC, ["customer_id", "scaledFeatures"])
dfD = spark.createDataFrame(dataD, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfC)
# returns empty df
model.approxSimilarityJoin(dfC, dfD, 100000, distCol="EuclideanDistance").show()
I was able to obtain results to the second half of the example above by increasing the bucketLength parameter value to 15. The threshold could have been lowered because the Euclidean Distance was ~34.
Per the PySpark docs:
bucketLength = the length of each hash bucket, a larger bucket lowers the false negative rate

Scala: Converting a Double from Scientific Notation without losing precision?

I'd like to convert a double such as 1.1231053E7 to 11,231,053.0 in scala. Currently the way I am converting doubles is to do this f"$number" where number is a double value. Unfortunately this just gives me a string with 1.1231053E7.
I can convert it out of scientific notation using NumberFormat or DecimalFormat but these also force me to choose a predetermined precision. I want flexible precision. So...
val number1 = 1.2313215
val number2 = 100
val number4 = 3.333E2
... when converted should be...
1.2313215
100
333.3
Currently DecimalFormat makes me choose the precision during construction like so: new DecimalFormat(##.##). Each # after . signifies a decimal point.
If I use f"$number", it treats the decimal points correctly but, like I said before, it is unable to handle the scientific notation.
Just decide how many places after the . you need, write out the number hiding the zeros:
val fmt = new java.text.DecimalFormat("#,##0.##############")
for (x <- List[Double](1.2313215, 100, 3.333E2)) println(fmt.format(x))
prints:
1.2313215
100
333.3

Spark case class - decimal type encoder error "Cannot up cast from decimal"

I'm extracting data from MySQL/MariaDB and during creation of Dataset, an error occurs with the data types
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Cannot up cast AMOUNT from decimal(30,6) to decimal(38,18) as it may
truncate The type path of the target object is:
- field (class: "org.apache.spark.sql.types.Decimal", name: "AMOUNT")
- root class: "com.misp.spark.Deal" You can either add an explicit cast to the input data or choose a higher precision type of the field
in the target object;
Case class is defined like this
case class
(
AMOUNT: Decimal
)
Anyone know how to fix it and not touch the database?
That error says that apache spark can’t automatically convert BigDecimal(30,6) from database to BigDecimal(38,18) which wanted in Dataset (I don't know why it needs fixed paramers 38,18. And it is even more strange that spark can’t automatically convert type with low precision to type with high precision).
There was reported a bug: https://issues.apache.org/jira/browse/SPARK-20162 (maybe it was you). Anyway I found good workaround for reading data through casting columns to BigDecimal(38,18) in dataframe and then casting dataframe to dataset.
//first read data to dataframe with any way suitable for you
var df: DataFrame = ???
val dfSchema = df.schema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DecimalType
dfSchema.foreach { field =>
field.dataType match {
case t: DecimalType if t != DecimalType(38, 18) =>
df = df.withColumn(field.name, col(field.name).cast(DecimalType(38,18)))
}
}
df.as[YourCaseClassWithBigDecimal]
It should solve problems with reading (but not with writing I guess)
As was previously stated, since your DB uses DecimalType(30,6) means you have 30 slots total and 6 slots past the decimal point which leaves 30-6=24 for the area in front of the decimal point. I like to call it a (24 left, 6 right) big-decimal. This of-course does not fit into a (20 left, 18 right) (i.e. DecimalType(38,18)) since the latter does not have enough slots on the left (20 vs 24 needed). We only have 20 left-slots in a DecimalType(38,18) but we need 24 left-slots to accomodate your DecimalType(30,6).
What we can do here, is to down-cast the (24 left, 6 right) into a (20 left, 6 right) (i.e. DecimalType(26,6)) so that when it's being auto-casted to a (20 left, 18 right) (I.e. DecimalType(38,18)) both sides will fit. Your DecimalType(26,6) will have 20 left-slots allowing it to fit inside of a DecimalType(38,18) and of-course 6 rights slots will fit into the 18.
The way you do that is before converting anything to a Dataset, run the following operation on the DataFrame:
val downCastableData =
originalData.withColumn("amount", $"amount".cast(DecimalType(26,6)))
Then converting to Dataset should work.
(Actually, you can cast to anything that's (20 left, 6 right) or less e.g. (19 left, 5 right) etc...).
While I don't have a solution here is my understanding of what is going on:
By default spark will infer the schema of the Decimal type (or BigDecimal) in a case class to be DecimalType(38, 18) (see org.apache.spark.sql.types.DecimalType.SYSTEM_DEFAULT). The 38 means the Decimal can hold 38 digits total (for both left and right of the decimal point) while the 18 means 18 of those 38 digits are reserved for the right of the decimal point. That means a Decimal(38, 18) may have 20 digits for the left of the decimal point. Your MySQL schema is decimal(30, 6) which means it may contain values with 24 digits (30 - 6) to the left of the decimal point and 6 digits to the right of the decimal point. Since 24 digits is greater than 20 digits there could be values that are truncated when converting from your MySQL schema to that Decimal type.
Unfortunately inferring schema from a scala case class is considered a convenience by the spark developers and they have chosen to not support allowing the programmer to specify precision and scale for Decimal or BigDecimal types within the case class (see https://issues.apache.org/jira/browse/SPARK-18484)
Building on #user2737635's answer, you can use a foldLeft rather than foreach to avoid defining your dataset as a var and redefining it:
//first read data to dataframe with any way suitable for you
val df: DataFrame = ???
val dfSchema = df.schema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DecimalType
dfSchema.foldLeft(df){
(dataframe, field) => field.dataType match {
case t: DecimalType if t != DecimalType(38, 18) => dataframe.withColumn(field.name, col(field.name).cast(DecimalType(38, 18)))
case _ => dataframe
}
}.as[YourCaseClassWithBigDecimal]
We are working on a work around by defining our own Encoder which we use at the call site .as. We generate the Encoder using the StructType which knows the correct precision and scales (see below link for code).
https://issues.apache.org/jira/browse/SPARK-27339
According to pyspark, the Decimal(38,18) is default.
When create a DecimalType, the default precision and scale is (10, 0). When infer schema from decimal.Decimal objects, it will be DecimalType(38, 18).

Round Down Double in Spark

I have some cassandra data that is of the type double that I need to round down in spark to 1 decimal place.
The problem is how to extract it from cassandra, convert it to a decimal, round down to 1 decimal point and then write back to a table in cassandra. My rounding code is as follows:
BigDecimal(number).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble
This works great if the number going in is a decimal but I dont know how to convert the double to a decimal before rouding. My Double needs to be divided by 1000000 prior to rounding.
For example 510999000 would be 510.990 before being rounded down to 510.9
EDIT: I was able to get it to do what I wanted with the following command.
BigDecimal(((number).toDouble) / 1000000).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble
Not sure how good this is but it works.
Great answer guys. Just chiming other ways to do the same
1. If using Spark DataFrame then ( x and y are DataFrames )
import org.apache.spark.sql.functions.round
val y = x.withColumn("col1", round($"col1", 3))
2. val y = x.rdd.map( x => (x(0)*1000).round / 1000.toDouble )
The answer I was able to work with was:
BigDecimal(((number).toDouble) / 1000000).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble