I have some cassandra data that is of the type double that I need to round down in spark to 1 decimal place.
The problem is how to extract it from cassandra, convert it to a decimal, round down to 1 decimal point and then write back to a table in cassandra. My rounding code is as follows:
BigDecimal(number).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble
This works great if the number going in is a decimal but I dont know how to convert the double to a decimal before rouding. My Double needs to be divided by 1000000 prior to rounding.
For example 510999000 would be 510.990 before being rounded down to 510.9
EDIT: I was able to get it to do what I wanted with the following command.
BigDecimal(((number).toDouble) / 1000000).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble
Not sure how good this is but it works.
Great answer guys. Just chiming other ways to do the same
1. If using Spark DataFrame then ( x and y are DataFrames )
import org.apache.spark.sql.functions.round
val y = x.withColumn("col1", round($"col1", 3))
2. val y = x.rdd.map( x => (x(0)*1000).round / 1000.toDouble )
The answer I was able to work with was:
BigDecimal(((number).toDouble) / 1000000).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble
Related
I have a very large data frame where there is a column that is a list of numbers representing category membership.
Here is a dummy version
import pandas as pd
import numpy as np
segments = [str(i) for i in range(1_000)]
# My real data is ~500m rows
nums = np.random.choice(segments, (100_000,10))
df = pd.DataFrame({'segments': [','.join(n) for n in nums]})
userId
segments
0
885,106,49,138,295,254,26,460,0,844
1
908,709,454,966,151,922,666,886,65,708
2
664,713,272,241,301,498,630,834,702,289
3
60,880,906,471,437,383,878,369,556,876
4
817,183,365,171,23,484,934,476,273,230
...
...
Note that there is a known list of segments (0-999 in the example)
I want to cast this into dummy columns indicating membership to each segment.
I found a few ways of doing this:
In pandas:
df_one_hot_encoded = (df['segments']
.str.split(',')
.explode()
.reset_index()
.assign(__one__=1)
.pivot_table(index='index', columns='segments', values='__one__', fill_value=0)
)
(takes 8 seconds on a 100k row sample)
And polars
df2 = pl.from_pandas(df[['segments']])
df_ans = (df2
.with_columns([
pl.arange(0, len(df2)).alias('row_index'),
pl.col('segments').str.split(','),
pl.lit(1).alias('__one__')
])
.explode('segments')
.pivot(index='row_index', columns='segments', values='__one__')
.fill_null(0)
)
df_one_hot_encoded = df_ans.to_pandas()
(takes 1.5 seconds inclusive of the conversion to and from pandas, .9s without)
However, I hear .pivot is not efficient, and that it does not work well with lazy frames.
I tried other solutions in polars, but they were much slower:
_ = df2.lazy().with_columns(**{segment: pl.col('segments').str.contains(segment) for segment in segments}).collect()
(2 seconds)
(df2
.with_columns([
pl.arange(0, len(df2)).alias('row_index'),
pl.col('segments').str.split(',')
])
.explode('segments')
.to_dummies(columns=['segments'])
.groupby('row_index')
.sum()
)
(4 seconds)
Does anyone know a better solution than the .9s pivot?
This approach ends up being slower than the pivot but it's a got a different trick so I'll include it.
df2=pl.from_pandas(df)
df2_ans=(df2.with_row_count('userId').with_column(pl.col('segments').str.split(',')).explode('segments') \
.with_columns([pl.when(pl.col('segments')==pl.lit(str(i))).then(pl.lit(1,pl.Int32)).otherwise(pl.lit(0,pl.Int32)).alias(str(i)) for i in range(1000)]) \
.groupby('userId')).agg(pl.exclude('segments').sum())
df_one_hot_encoded = df2_ans.to_pandas()
A couple of other observations. I'm not sure if you checked the output of your str.contains method but I would think that wouldn't work because, for example, 15 is contained within 154 when looking at strings.
The other thing, which I guess is just a preference, is the with_row_count syntax vs the pl.arrange. I don't think the performance of either is better (at least not significantly so) but you don't have to reference the df name to get the len of it which is nice.
I tried a couple other things that were also worse including not doing the explode and just doing is_in but that was slower. I tried using bools instead of 1s and 0s and then aggregating with any but that was slower.
I searched internet for a function to find exact square root of BigInt using scala programming language. I didn't get one, But saw one Java Program and I converted that function into Scala version. It is working but I am not sure, whether it can handle very large BigInt. But it returns BigInt only. Not BigDecimal as Square Root. It shows there is some bit manipulation done in the code with some hard coding of numbers like shiftRight(5), BigInt("8") and shiftRight(1). I can understand the logic clearly, But not the hard coding of these bitshift numbers and the number 8. May be these bitshift functions are not available in scala, and thats why it is needed to convert to java BigInteger at few places. These hard coded numbers may impact the precision of the result.I just changed the java code into scala code just copying the exact algorithm. And here is the code I have written in scala:
def sqt(n:BigInt):BigInt = {
var a = BigInt(1)
var b = (n>>5)+BigInt(8)
while((b-a) >= 0) {
var mid:BigInt = (a+b)>>1
if(mid*mid-n> 0) b = mid-1
else a = mid+1
}
a-1
}
My Points are:
Can't we return a BigDecimal instead of BigInt? How can we do that?
How these hardcoded numbers shiftRight(5), shiftRight(1) and 8 are related
to precision of the result.
I tested for one number in scala REPL: The function sqt is giving exact square root of the squared number. but not for the actual number as below:
scala> sqt(BigInt("19928937494873929279191794189"))
res9: BigInt = 141169888768369
scala> res9*res9
res10: scala.math.BigInt = 19928937494873675935734920161
scala> sqt(res10)
res11: BigInt = 141169888768369
scala>
I understand shiftRight(5) means divide by 2^5 ie.by 32 in decimal and so on..but why 8 is added here after shift operation? why exactly 5 shifts? as a first guess?
Your question 1 and question 3 are actually the same question.
How [do] these bitshifts impact [the] precision of the result?
They don't.
How [are] these hardcoded numbers ... related to precision of the result?
They aren't.
There are many different methods/algorithms for estimating/calculating the square root of a number (as can be seen here). The algorithm you've posted appears to be a pretty straight forward binary search.
Pick a number a guaranteed to be smaller than the target (square root of n).
Pick a number b guaranteed to be larger than the target (square root of n).
Calculate mid, the whole number mid-point between a and b.
If mid is larger than (or equal to) the target then move b to mid (-1 because we know it's too large).
If mid is smaller than the target then move a to mid (+1 because we know it's too small).
Repeat 3,4,5 until a is no longer less than b.
Return a-1 as the square root of n rounded down to a whole number.
The bitshifts and hardcoded numbers are used in selecting the initial value of b. But b only has be greater than the target. We could have just done var b = n. Why all the bother?
It's all about efficiency. The closer b is to the target, the fewer iterations are needed to find the result. Why add 8 after the shift? Because 31>>5 is zero, which is not greater than the target. The author chose (n>>5)+8 but he/she might have chosen (n>>7)+12. There are trade-offs.
Can't we return a BigDecimal instead of BigInt? How can we do that?
Here's one way to do that.
def sqt(n:BigInt) :BigDecimal = {
val d = BigDecimal(n)
var a = BigDecimal(1.0)
var b = d
while(b-a >= 0) {
val mid = (a+b)/2
if (mid*mid-d > 0) b = mid-0.0001 //adjust down
else a = mid+0.0001 //adjust up
}
b
}
There are better algorithms for calculating floating-point square root values. In this case you get better precision by using smaller adjustment values but the efficiency gets much worse.
Can't we return a BigDecimal instead of BigInt? How can we do that?
This makes no sense if you want exact roots: if a BigInt's square root can be represented exactly by a BigDecimal, it can be represented by a BigInt. If you don't want exact roots, you'll need to specify precision and modify the algorithm (and for most cases, Double will be good enough and much much much faster than BigDecimal).
I understand shiftRight(5) means divide by 2^5 ie.by 32 in decimal and so on..but why 8 is added here after shift operation? why exactly 5 shifts? as a first guess?
These aren't the only options. The point is that for every positive n, n/32 + 8 >= sqrt(n) (where sqrt is the mathematical square root). This is easiest to show by a bit of calculus (or just by building a graph of the difference). So at the start we know a <= sqrt(n) <= b (unless n == 0 which can be checked separately), and you can verify this remains true on each step.
I'd like to convert a double such as 1.1231053E7 to 11,231,053.0 in scala. Currently the way I am converting doubles is to do this f"$number" where number is a double value. Unfortunately this just gives me a string with 1.1231053E7.
I can convert it out of scientific notation using NumberFormat or DecimalFormat but these also force me to choose a predetermined precision. I want flexible precision. So...
val number1 = 1.2313215
val number2 = 100
val number4 = 3.333E2
... when converted should be...
1.2313215
100
333.3
Currently DecimalFormat makes me choose the precision during construction like so: new DecimalFormat(##.##). Each # after . signifies a decimal point.
If I use f"$number", it treats the decimal points correctly but, like I said before, it is unable to handle the scientific notation.
Just decide how many places after the . you need, write out the number hiding the zeros:
val fmt = new java.text.DecimalFormat("#,##0.##############")
for (x <- List[Double](1.2313215, 100, 3.333E2)) println(fmt.format(x))
prints:
1.2313215
100
333.3
I have data in a file as shown below:
7373743343333444.
7373743343333432.
This data should be converted to decimal values and should be in a position of 8.7 where 8 are the digits before decimal and 7 are the digits after decimal.
I am trying to read the data file as below:
val readDataFile = Initialize.spark.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").schema(***SCHEMA*****).load(****DATA FILE PATH******)
I have tried this:
val changed = dataFileWithSchema.withColumn("COLUMN NAME", dataFileWithSchema.col("COLUMN NAME").cast(new DecimalType(38,3)))
println(changed.show(5))
but it only gives me zeros at the end of the number, like this:
7373743343333444.0000
But I want the digits formatted as described above, how can I achieve this?
A simple combination of regexp_replace, trim and format_number inbuilt function should get you what you desire
import org.apache.spark.sql.functions._
df.withColumn("column", regexp_replace(format_number(trim(regexp_replace(col("column"), "\\.", "")).cast("long")/100000000, 7), ",", ""))
Divide the column by 10^8, this will move the decimal point 8 steps. After that cast to DecimalType to get the correct number of decimals. Since there are 16 digits to begin with, this means the last one is removed.
df.withColumn("col", (col("col").cast(DoubleType)/math.pow(10,8)).cast(DecimalType(38,7)))
i have a Dataframe with this kind of data:
unit,sensitivity currency,trading desk ,portfolio ,issuer ,bucket ,underlying ,delta ,converted sensitivity
ES ,USD ,EQ DERIVATIVES,ESEQRED_LH_MIDX ,5GOY ,5 ,repo ,0.00002 ,0.00002
ES ,USD ,EQ DERIVATIVES,IND_GLOBAL1 ,no_localizado ,8 ,repo ,-0.16962 ,-0.15198
ES ,EUR ,EQ DERIVATIVES,ESEQ_UKFLOWN ,IGN2 ,8 ,repo ,-0.00253 ,-0.00253
ES ,USD ,EQ DERIVATIVES,BASKETS1 ,9YFV ,5 ,spot ,-1003.64501 ,-899.24586
and I have to do an aggregation operation over this data, doing something like this:
val filteredDF = myDF.filter("unit = 'ES' AND `trading desk` = 'EQ DERIVATIVES' AND issuer = '5GOY' AND bucket = 5 AND underlying = 'repo' AND portfolio ='ESEQRED_LH_MIDX'")
.groupBy("unit","trading desk","portfolio","issuer","bucket","underlying")
.agg(sum("converted_sensitivity"))
But I am seeing that I am loosing precision on the aggregated sum, so how can I be sure about that every value of "converted_sensitivity" is converted to a BigDecimal(25,5) before doing the sum operation over the new aggregated column?
Thank you very much.
To be sure of the convertion you can use the DecimalType in your DataFrame.
According to Spark documentation the DecimalType is:
The data type representing java.math.BigDecimal values. A Decimal that must have fixed precision (the maximum number of digits) and scale (the number of digits on right side of dot).
The precision can be up to 38, scale can also be up to 38 (less or equal to precision).
The default precision and scale is (10, 0).
You can see this here.
To convert the data you can use the function cast of the Column object. Like this:
import org.apache.spark.sql.types.DecimalType
val filteredDF = myDF.filter("unit = 'ES' AND `trading desk` = 'EQ DERIVATIVES' AND issuer = '5GOY' AND bucket = 5 AND underlying = 'repo' AND portfolio ='ESEQRED_LH_MIDX'")
.withColumn("new_column_big_decimal", col("converted_sensitivity").cast(DecimalType(25,5))
.groupBy("unit","trading desk","portfolio","issuer","bucket","underlying")
.agg(sum("new_column_big_decimal"))