Handling decimal values in spark scala - scala

I have data in a file as shown below:
7373743343333444.
7373743343333432.
This data should be converted to decimal values and should be in a position of 8.7 where 8 are the digits before decimal and 7 are the digits after decimal.
I am trying to read the data file as below:
val readDataFile = Initialize.spark.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").schema(***SCHEMA*****).load(****DATA FILE PATH******)
I have tried this:
val changed = dataFileWithSchema.withColumn("COLUMN NAME", dataFileWithSchema.col("COLUMN NAME").cast(new DecimalType(38,3)))
println(changed.show(5))
but it only gives me zeros at the end of the number, like this:
7373743343333444.0000
But I want the digits formatted as described above, how can I achieve this?

A simple combination of regexp_replace, trim and format_number inbuilt function should get you what you desire
import org.apache.spark.sql.functions._
df.withColumn("column", regexp_replace(format_number(trim(regexp_replace(col("column"), "\\.", "")).cast("long")/100000000, 7), ",", ""))

Divide the column by 10^8, this will move the decimal point 8 steps. After that cast to DecimalType to get the correct number of decimals. Since there are 16 digits to begin with, this means the last one is removed.
df.withColumn("col", (col("col").cast(DoubleType)/math.pow(10,8)).cast(DecimalType(38,7)))

Related

pyspark sql to_timestamp function formatting

I have a dataframe in pyspark that has columns time1 and time 2. They both appear as strings like the below:
Time1 Time2
1990-03-18 22:50:09.693159 2022-04-23 17:30:22-07:00
1990-03-19 22:57:09.433159 2022-04-23 16:11:12-06:00
1990-03-20 22:04:09.437359 2022-04-23 17:56:33-05:00
I am trying to convert these into timestamp(preferably utc)
I am trying the below code:
Newtime1 = Function.to_timestamp(Function.col('time1'),'yyyy-MM-dd HH:mm:ss.SSSSSS')
Newtime2 = Function.to_timestamp(Function.col('time2'),'yyyy-MM-dd HH:mm:ss Z')
When applying to a dataframe like below:
mydataframe = mydataframe.withColumn('time1',Newtime1)
mydataframe = mydataframe.withColumn('time2',Newtime2)
This yields 'None' to be displayed in the data. How can I get the desired timestamps?
The format for timezone is a little tricky. Read the docs carefully.
"The count of pattern letters determines the format."
And there is a difference between X vs x vs Z.
...
Offset X and x: This formats the offset based on the number of pattern letters. One letter outputs just the hour, such as ‘+01’, unless the minute is non-zero in which case the minute is also output, such as ‘+0130’. Two letters outputs the hour and minute, without a colon, such as ‘+0130’. Three letters outputs the hour and minute, with a colon, such as ‘+01:30’. Four letters outputs the hour and minute and optional second, without a colon, such as ‘+013015’. Five letters outputs the hour and minute and optional second, with a colon, such as ‘+01:30:15’. Six or more letters will fail. Pattern letter ‘X’ (upper case) will output ‘Z’ when the offset to be output would be zero, whereas pattern letter ‘x’ (lower case) will output ‘+00’, ‘+0000’, or ‘+00:00’.
Offset Z: This formats the offset based on the number of pattern letters. One, two or three letters outputs the hour and minute, without a colon, such as ‘+0130’. The output will be ‘+0000’ when the offset is zero. Four letters outputs the full form of localized offset, equivalent to four letters of Offset-O. The output will be the corresponding localized offset text if the offset is zero. Five letters outputs the hour, minute, with optional second if non-zero, with colon. It outputs ‘Z’ if the offset is zero. Six or more letters will fail.
>>> from pyspark.sql import functions as F
>>>
>>> df = spark.createDataFrame([
... ('1990-03-18 22:50:09.693159', '2022-04-23 17:30:22-07:00'),
... ('1990-03-19 22:57:09.433159', '2022-04-23 16:11:12Z'),
... ('1990-03-20 22:04:09.437359', '2022-04-23 17:56:33+00:00')
... ],
... ('time1', 'time2')
... )
>>>
>>> df2 = (df
... .withColumn('t1', F.to_timestamp(df.time1, 'yyyy-MM-dd HH:mm:ss.SSSSSS'))
... .withColumn('t2_lower_xxx', F.to_timestamp(df.time2, 'yyyy-MM-dd HH:mm:ssxxx'))
... .withColumn('t2_upper_XXX', F.to_timestamp(df.time2, 'yyyy-MM-dd HH:mm:ssXXX'))
... .withColumn('t2_ZZZZZ', F.to_timestamp(df.time2, 'yyyy-MM-dd HH:mm:ssZZZZZ'))
... )
>>>
>>> df2.select('time2', 't2_lower_xxx', 't2_upper_XXX', 't2_ZZZZZ', 'time1', 't1').show(truncate=False)
+-------------------------+-------------------+-------------------+-------------------+--------------------------+--------------------------+
|time2 |t2_lower_xxx |t2_upper_XXX |t2_ZZZZZ |time1 |t1 |
+-------------------------+-------------------+-------------------+-------------------+--------------------------+--------------------------+
|2022-04-23 17:30:22-07:00|2022-04-23 19:30:22|2022-04-23 19:30:22|2022-04-23 19:30:22|1990-03-18 22:50:09.693159|1990-03-18 22:50:09.693159|
|2022-04-23 16:11:12Z |null |2022-04-23 11:11:12|2022-04-23 11:11:12|1990-03-19 22:57:09.433159|1990-03-19 22:57:09.433159|
|2022-04-23 17:56:33+00:00|2022-04-23 12:56:33|2022-04-23 12:56:33|2022-04-23 12:56:33|1990-03-20 22:04:09.437359|1990-03-20 22:04:09.437359|
+-------------------------+-------------------+-------------------+-------------------+--------------------------+--------------------------+
>>>
For col 'time2' the pattern will be like below:
yyyy-MM-dd HH:mm:ssxxx
Tested in Pyspark v3.2.3 both are working after making above change.

Scala: Converting a Double from Scientific Notation without losing precision?

I'd like to convert a double such as 1.1231053E7 to 11,231,053.0 in scala. Currently the way I am converting doubles is to do this f"$number" where number is a double value. Unfortunately this just gives me a string with 1.1231053E7.
I can convert it out of scientific notation using NumberFormat or DecimalFormat but these also force me to choose a predetermined precision. I want flexible precision. So...
val number1 = 1.2313215
val number2 = 100
val number4 = 3.333E2
... when converted should be...
1.2313215
100
333.3
Currently DecimalFormat makes me choose the precision during construction like so: new DecimalFormat(##.##). Each # after . signifies a decimal point.
If I use f"$number", it treats the decimal points correctly but, like I said before, it is unable to handle the scientific notation.
Just decide how many places after the . you need, write out the number hiding the zeros:
val fmt = new java.text.DecimalFormat("#,##0.##############")
for (x <- List[Double](1.2313215, 100, 3.333E2)) println(fmt.format(x))
prints:
1.2313215
100
333.3

Spark case class - decimal type encoder error "Cannot up cast from decimal"

I'm extracting data from MySQL/MariaDB and during creation of Dataset, an error occurs with the data types
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Cannot up cast AMOUNT from decimal(30,6) to decimal(38,18) as it may
truncate The type path of the target object is:
- field (class: "org.apache.spark.sql.types.Decimal", name: "AMOUNT")
- root class: "com.misp.spark.Deal" You can either add an explicit cast to the input data or choose a higher precision type of the field
in the target object;
Case class is defined like this
case class
(
AMOUNT: Decimal
)
Anyone know how to fix it and not touch the database?
That error says that apache spark can’t automatically convert BigDecimal(30,6) from database to BigDecimal(38,18) which wanted in Dataset (I don't know why it needs fixed paramers 38,18. And it is even more strange that spark can’t automatically convert type with low precision to type with high precision).
There was reported a bug: https://issues.apache.org/jira/browse/SPARK-20162 (maybe it was you). Anyway I found good workaround for reading data through casting columns to BigDecimal(38,18) in dataframe and then casting dataframe to dataset.
//first read data to dataframe with any way suitable for you
var df: DataFrame = ???
val dfSchema = df.schema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DecimalType
dfSchema.foreach { field =>
field.dataType match {
case t: DecimalType if t != DecimalType(38, 18) =>
df = df.withColumn(field.name, col(field.name).cast(DecimalType(38,18)))
}
}
df.as[YourCaseClassWithBigDecimal]
It should solve problems with reading (but not with writing I guess)
As was previously stated, since your DB uses DecimalType(30,6) means you have 30 slots total and 6 slots past the decimal point which leaves 30-6=24 for the area in front of the decimal point. I like to call it a (24 left, 6 right) big-decimal. This of-course does not fit into a (20 left, 18 right) (i.e. DecimalType(38,18)) since the latter does not have enough slots on the left (20 vs 24 needed). We only have 20 left-slots in a DecimalType(38,18) but we need 24 left-slots to accomodate your DecimalType(30,6).
What we can do here, is to down-cast the (24 left, 6 right) into a (20 left, 6 right) (i.e. DecimalType(26,6)) so that when it's being auto-casted to a (20 left, 18 right) (I.e. DecimalType(38,18)) both sides will fit. Your DecimalType(26,6) will have 20 left-slots allowing it to fit inside of a DecimalType(38,18) and of-course 6 rights slots will fit into the 18.
The way you do that is before converting anything to a Dataset, run the following operation on the DataFrame:
val downCastableData =
originalData.withColumn("amount", $"amount".cast(DecimalType(26,6)))
Then converting to Dataset should work.
(Actually, you can cast to anything that's (20 left, 6 right) or less e.g. (19 left, 5 right) etc...).
While I don't have a solution here is my understanding of what is going on:
By default spark will infer the schema of the Decimal type (or BigDecimal) in a case class to be DecimalType(38, 18) (see org.apache.spark.sql.types.DecimalType.SYSTEM_DEFAULT). The 38 means the Decimal can hold 38 digits total (for both left and right of the decimal point) while the 18 means 18 of those 38 digits are reserved for the right of the decimal point. That means a Decimal(38, 18) may have 20 digits for the left of the decimal point. Your MySQL schema is decimal(30, 6) which means it may contain values with 24 digits (30 - 6) to the left of the decimal point and 6 digits to the right of the decimal point. Since 24 digits is greater than 20 digits there could be values that are truncated when converting from your MySQL schema to that Decimal type.
Unfortunately inferring schema from a scala case class is considered a convenience by the spark developers and they have chosen to not support allowing the programmer to specify precision and scale for Decimal or BigDecimal types within the case class (see https://issues.apache.org/jira/browse/SPARK-18484)
Building on #user2737635's answer, you can use a foldLeft rather than foreach to avoid defining your dataset as a var and redefining it:
//first read data to dataframe with any way suitable for you
val df: DataFrame = ???
val dfSchema = df.schema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DecimalType
dfSchema.foldLeft(df){
(dataframe, field) => field.dataType match {
case t: DecimalType if t != DecimalType(38, 18) => dataframe.withColumn(field.name, col(field.name).cast(DecimalType(38, 18)))
case _ => dataframe
}
}.as[YourCaseClassWithBigDecimal]
We are working on a work around by defining our own Encoder which we use at the call site .as. We generate the Encoder using the StructType which knows the correct precision and scales (see below link for code).
https://issues.apache.org/jira/browse/SPARK-27339
According to pyspark, the Decimal(38,18) is default.
When create a DecimalType, the default precision and scale is (10, 0). When infer schema from decimal.Decimal objects, it will be DecimalType(38, 18).

Round Down Double in Spark

I have some cassandra data that is of the type double that I need to round down in spark to 1 decimal place.
The problem is how to extract it from cassandra, convert it to a decimal, round down to 1 decimal point and then write back to a table in cassandra. My rounding code is as follows:
BigDecimal(number).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble
This works great if the number going in is a decimal but I dont know how to convert the double to a decimal before rouding. My Double needs to be divided by 1000000 prior to rounding.
For example 510999000 would be 510.990 before being rounded down to 510.9
EDIT: I was able to get it to do what I wanted with the following command.
BigDecimal(((number).toDouble) / 1000000).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble
Not sure how good this is but it works.
Great answer guys. Just chiming other ways to do the same
1. If using Spark DataFrame then ( x and y are DataFrames )
import org.apache.spark.sql.functions.round
val y = x.withColumn("col1", round($"col1", 3))
2. val y = x.rdd.map( x => (x(0)*1000).round / 1000.toDouble )
The answer I was able to work with was:
BigDecimal(((number).toDouble) / 1000000).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble

Floating point in hexadecimal form

How can I represent a given floating point number in hexadecimal form? For example,
60123,124;
<sign>0x1.<mantissa>p±<exponent>
>>> (1.2).hex()
'0x1.3333333333333p+0'
>>> (1125.2).hex()
'0x1.194cccccccccdp+10'
>>> (7e85).hex()
'0x1.204362b6da56fp+285'
>>> (5e-3).hex()
'0x1.47ae147ae147bp-8'
>>> (-8.).hex()
'-0x1.0000000000000p+3'
>>> (60123.124).hex()
'0x1.d5b63f7ced917p+15'
Here (AU) we use a decimal point:
60123.124
Which my calculator converts to hex like so:
0xEADB.1FBE76C8B43958106
The principle is the same: where in base 10 the first decimal place represents 10ths, in base 16 the first decimal place represents 16ths.
See this related question.
The %a printf format specifier is described here