I'm doing some testing of spark decimal types for currency measures and am seeing some odd precision results when I set the scale and precision as shown below. I want to be sure that I won't have any data loss during calculations but the example below is not reassuring of that. Can anyone tell me why this is happening with spark sql? Currently on version 2.3.0
val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as decimal(38,14)) as decimal(38,14)) val"""
spark.sql(sql).show
This returns
+----------------+
| val|
+----------------+
|0.33333300000000|
+----------------+
This is a current open issue, see SPARK-27089. The suggested work around is to adjust the setting below. I validated that the SQL statement works as expected with this setting set to false.
spark.sql.decimalOperations.allowPrecisionLoss=false
Use BigDecimal to avoid precision loss. See Double vs. BigDecimal?
example:
scala> val df = Seq(BigDecimal("0.03"),BigDecimal("8.20"),BigDecimal("0.02")).toDS
df: org.apache.spark.sql.Dataset[scala.math.BigDecimal] = [value: decimal(38,18)]
scala> df.select($"value").show
+--------------------+
| value|
+--------------------+
|0.030000000000000000|
|8.200000000000000000|
|0.020000000000000000|
+--------------------+
Using BigDecimal:
scala> df.select($"value" + BigDecimal("0.1")).show
+-------------------+
| (value + 0.1)|
+-------------------+
|0.13000000000000000|
|8.30000000000000000|
|0.12000000000000000|
+-------------------+
if you don't use BigDecimal, there will be a loss in precision. In this case 0.1 is a double
scala> df.select($"value" + lit(0.1)).show
+-------------------+
| (value + 0.1)|
+-------------------+
| 0.13|
| 8.299999999999999|
|0.12000000000000001|
+-------------------+
Related
Reading the documentation, a Spark DataType BigDecimal(precision, scale) means that
Precision is total number of digits and
Scale is the number of digits after the decimal point.
So when I cast a value to decimal
scala> val sss = """select cast(1.7142857343 as decimal(9,8))"""
scala> spark.sql(sss).show
+----------------------------------+
|CAST(1.7142857343 AS DECIMAL(9,8))|
+----------------------------------+
| 1.71428573| // It has 8 decimal digits
+----------------------------------+
But when I cast values above 10.0, I get NULL
scala> val sss = """select cast(12.345678901 as decimal(9,8))"""
scala> spark.sql(sss).show
+----------------------------+
|CAST(11.714 AS DECIMAL(9,8))|
+----------------------------+
| null|
+----------------------------+
I would expect the result would be 12.3456789,
Why is it NULL?
Why is it that Precision is not being implemented?
To cast decimal spark internally validates that provided schema decimal(9,8) is wider than 12.345678901 actual schema decimal(11,9). If yes, it means numbers can be casted into provided schema safely without losing any precision or range. Have a look at org.apache.spark.sql.types.DecimalType.isWiderThan()
However, in the above case decimal(11,9) can not be cast into decimal(9,8) therefore it is returning null.
//MAX_PRECISION = 38
val sss = """select cast(12.345678901 as decimal(38,7))"""
spark.sql(sss1).show(10)
+-----------------------------------+
|CAST(12.345678901 AS DECIMAL(38,8))|
+-----------------------------------+
| 12.3456789|
+-----------------------------------+
I am working with time data and try to convert the string to timestamp format.
Here is what the 'Time' column looks like
+----------+
| Time |
+----------+
|1358380800|
|1380672000|
+----------+
Here is what I want
+---------------+
| Time |
+---------------+
|2013/1/17 8:0:0|
|2013/10/2 8:0:0|
+---------------+
I find some similar questions and answers and have tried these code, but all end with 'null'
df2 = df.withColumn("Time", test["Time"].cast(TimestampType()))
df2 = df.withColumn('Time', F.unix_timestamp('Time', 'yyyy-MM-dd').cast(TimestampType()))
Well your are doing it the other way around. The sql function unix_timestamp converts a string with the given format to a unix timestamp. When you want to convert a unix timestamp to the datetime format, you have to use the from_unixtime sql function:
from pyspark.sql import functions as F
from pyspark.sql import types as T
l1 = [('1358380800',),('1380672000',)]
df = spark.createDataFrame(l1,['Time'])
df.withColumn('Time', F.from_unixtime(df.Time).cast(T.TimestampType())).show()
Output:
+-------------------+
| Time|
+-------------------+
|2013-01-17 01:00:00|
|2013-10-02 02:00:00|
+-------------------+
I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful
I would like to transform a spark dataframe column from its value hour min seconds
E.g "01:12:17.8370000"
Would become 4337 s thanks for the comment.
or "00:00:39.0390000"
would become 39 s.
I have read this question but I am lost on how I can use this code to transform my spark dataframe column.
Convert HH:mm:ss in seconds
Something like this
df.withColumn("duration",col("duration")....)
I am using scala 2.10.5 and spark 1.6
Thank you
Assuming the column "duration" contains the duration in a string, you can just use "unix_timestamp" function of the functions package to get the number of seconds passing the pattern:
import org.apache.spark.sql.functions._
val df = Seq("01:12:17.8370000", "00:00:39.0390000").toDF("duration")
val newColumn = unix_timestamp(col("duration"), "HH:mm:ss")
val result = df.withColumn("duration", newColumn)
result.show
+--------+
|duration|
+--------+
| 4337|
| 39|
+--------+
If you have a string column, you can write a udf to calculate this manually:
val df = Seq("01:12:17.8370000", "00:00:39.0390000").toDF("duration")
def str_sec = udf((s: String) => {
val Array(hour, minute, second) = s.split(":")
hour.toInt * 3600 + minute.toInt * 60 + second.toDouble.toInt
})
df.withColumn("duration", str_sec($"duration")).show
+--------+
|duration|
+--------+
| 4337|
| 39|
+--------+
there are inbuilt functions you can take advantage of which are faster and efficient than using udf functions
given input dataframe as
+----------------+
|duration |
+----------------+
|01:12:17.8370000|
|00:00:39.0390000|
+----------------+
so you can do something like below
df.withColumn("seconds", hour($"duration")*3600+minute($"duration")*60+second($"duration"))
you should be getting output as
+----------------+-------+
|duration |seconds|
+----------------+-------+
|01:12:17.8370000|4337 |
|00:00:39.0390000|39 |
+----------------+-------+
I'm lost on how I can calculate the average string length of any column in a dataframe using scala. I have been able to easily do it for numeric columns doing the following
val avgDF = df.dtypes.filter(x => x._2 == "DoubleType").map(ct =>avg(col(ct._1))).toList
import org.apache.spark.sql.functions._
val avgDF = df.agg(mean(length(col("yourColumn"))))
val findLength = udf { (ColValue: String) => ColValue.size }
myData.dtypes.filter(x=>x._2=="StringType").foreach(f=>
myData.select(avg(findLength(col(f._1)))).show()
)
Sample Data
Name|Age|email
Hari|12|hary#h0otmail.ocm
Hari|12|hary#h0otmail.ocm
Hari|12|hary#h0otmail.ocm
Hari|12|hary#h0otmail.ocm
Hasasasi|12|hary#h0otmail.in
Output
+-------------------+
|AVG(scalaUDF(Name))|
+-------------------+
| 4.8|
+-------------------+
+--------------------+
|AVG(scalaUDF(email))|
+--------------------+
| 16.8|
+--------------------+