SparkSQL function require type Decimal - scala

I designed the following function to work with arrays of any numeric type:
def array_sum[T](item:Traversable[T])(implicit n:Numeric[T]) = item.sum
// Registers a function as a UDF so it can be used in SQL statements.
sqlContext.udf.register("array_sumD", array_sum(_:Seq[Float]))
But wanting to pass an array of type float me the following error:
// Now we can use our function directly in SparkSQL.
sqlContext.sql("SELECT array_sumD(array(5.0,1.0,2.0)) as array_sum").show
Error:
cannot resolve 'UDF(array(5.0,1.0,2.0))' due to data type mismatch: argument 1 requires array<double> type, however, 'array(5.0,1.0,2.0)' is of array<decimal(2,1)> type;

Default data type for decimal values in Spark-SQL is, well, decimal. If you cast your literals in the query into floats, and use the same UDF, it works:
sqlContext.sql(
"""SELECT array_sumD(array(
| CAST(5.0 AS FLOAT),
| CAST(1.0 AS FLOAT),
| CAST(2.0 AS FLOAT)
|)) as array_sum""".stripMargin).show
The result, as expected:
+---------+
|array_sum|
+---------+
| 8.0|
+---------+
Alternatively, if you do want to use decimals (to avoid floating point issues), you'll still have to use casting to get the right precision, plus you won't be able to use Scala's nice Numeric and sum, as decimals are read as java.math.BigDecimal. So - your code would be:
def array_sum(item:Traversable[java.math.BigDecimal]) = item.reduce((a, b) => a.add(b))
// Registers a function as a UDF so it can be used in SQL statements.
sqlContext.udf.register("array_sumD", array_sum(_:Seq[java.math.BigDecimal]))
sqlContext.sql(
"""SELECT array_sumD(array(
| CAST(5.0 AS DECIMAL(38,18)),
| CAST(1.0 AS DECIMAL(38,18)),
| CAST(2.0 AS DECIMAL(38,18))
|)) as array_sum""".stripMargin).show

Related

locate function usage on dataframe without using UDF Spark Scala

I am curious as to why this will not work in Spark Scala on a dataframe:
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))
It works with a UDF, but not as per above. Col vs. String aspects. Seems awkward and lacking aspect. I.e. how to convert a column to a string for passing to locate that needs String.
df("search_string") allows a String to be generated is my understanding.
But error gotten is:
command-679436134936072:15: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))
Understanding what's going wrong
I'm not sure which version of Spark you're on, but the locate method has the following function signature on both Spark 3.3.1 (the current latest version) and Spark 2.4.5 (the version running on my local running Spark shell).
This function signature is the following:
def locate(substr: String, str: Column, pos: Int): Column
So substr can't be a Column, it needs to be a String. In your case, you were using df("search_string"). This actually calls the apply method with the following function signature:
def apply(colName: String): Column
So it makes sense that you're having a problem since the locate function needs a String.
Trying to fix your issue
If I correctly understood, you want to be able to locate a substring from one column inside of a string in another column without UDFs. You can use a map on a Dataset to do that. Something like this:
import spark.implicits._
case class MyTest (A:String, B: String)
val df = Seq(
MyTest("with", "potatoes with meat"),
MyTest("with", "pasta with cream"),
MyTest("food", "tasty food"),
MyTest("notInThere", "don't forget some nice drinks")
).toDF("A", "B").as[MyTest]
val output = df.map{
case MyTest(a,b) => (a, b, b indexOf a)
}
output.show(false)
+----------+-----------------------------+---+
|_1 |_2 |_3 |
+----------+-----------------------------+---+
|with |potatoes with meat |9 |
|with |pasta with cream |6 |
|food |tasty food |6 |
|notInThere|don't forget some nice drinks|-1 |
+----------+-----------------------------+---+
Once you're inside of a map operation of a strongly typed Dataset, you have the Scala language at your disposal.
Hope this helps!

cast big number in human readable format

I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?
To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.

Spark Decimal Precision and Scale seems wrong when Casting

Reading the documentation, a Spark DataType BigDecimal(precision, scale) means that
Precision is total number of digits and
Scale is the number of digits after the decimal point.
So when I cast a value to decimal
scala> val sss = """select cast(1.7142857343 as decimal(9,8))"""
scala> spark.sql(sss).show
+----------------------------------+
|CAST(1.7142857343 AS DECIMAL(9,8))|
+----------------------------------+
| 1.71428573| // It has 8 decimal digits
+----------------------------------+
But when I cast values above 10.0, I get NULL
scala> val sss = """select cast(12.345678901 as decimal(9,8))"""
scala> spark.sql(sss).show
+----------------------------+
|CAST(11.714 AS DECIMAL(9,8))|
+----------------------------+
| null|
+----------------------------+
I would expect the result would be 12.3456789,
Why is it NULL?
Why is it that Precision is not being implemented?
To cast decimal spark internally validates that provided schema decimal(9,8) is wider than 12.345678901 actual schema decimal(11,9). If yes, it means numbers can be casted into provided schema safely without losing any precision or range. Have a look at org.apache.spark.sql.types.DecimalType.isWiderThan()
However, in the above case decimal(11,9) can not be cast into decimal(9,8) therefore it is returning null.
//MAX_PRECISION = 38
val sss = """select cast(12.345678901 as decimal(38,7))"""
spark.sql(sss1).show(10)
+-----------------------------------+
|CAST(12.345678901 AS DECIMAL(38,8))|
+-----------------------------------+
| 12.3456789|
+-----------------------------------+

Convert dataframe String column to Array[Int]

I am new to Scala and Spark and I am trying to read a csv file locally (for testing):
val spark = org.apache.spark.sql.SparkSession.builder.master("local").appName("Spark CSV Reader").getOrCreate;
val topics_df = spark.read.format("csv").option("header", "true").load("path-to-file.csv")
topics_df.show(10)
Here's how the file looks like:
+-----+--------------------+--------------------+
|topic| termindices| termweights|
+-----+--------------------+--------------------+
| 15|[21,31,51,108,101...|[0.0987100701,0.0...|
| 16|[42,25,121,132,55...|[0.0405490884,0.0...|
| 7|[1,23,38,7,63,0,1...|[0.1793091892,0.0...|
| 8|[13,40,35,104,153...|[0.0737646511,0.0...|
| 9|[2,10,93,9,158,18...|[0.1639456608,0.1...|
| 0|[28,39,71,46,123,...|[0.0867449145,0.0...|
| 1|[11,34,36,110,112...|[0.0729913664,0.0...|
| 17|[6,4,14,82,157,61...|[0.1583892199,0.1...|
| 18|[9,27,74,103,166,...|[0.0633899386,0.0...|
| 19|[15,81,289,218,34...|[0.1348582482,0.0...|
+-----+--------------------+--------------------+
with
ReadSchema: struct<topic:string,termindices:string,termweights:string>
The termindices column is supposed to be of type Array[Int], but when saved to CSV it is a String (This usually would not be a problem if I pull from databases).
How do I convert the type and eventually cast the DataFrame to a:
case class TopicDFRow(topic: Int, termIndices: Array[Int], termWeights: Array[Double])
I have the function ready to perform the conversion:
termIndices.substring(1, termIndices.length - 1).split(",").map(_.toInt)
I have looked at udf and a few other solutions but I am convinced that there should be a much cleaner and faster way to perform said conversion. Any help is greatly appreciated!
UDFs should be avoided when it's possible to use the more efficient in-built Spark functions. To my knowledge there is no better way than the one proposed; remove the first and last characters of the string, split and convert.
Using the in-built functions, this can be done as follows:
df.withColumn("termindices", split($"termindices".substr(lit(2), length($"termindices")-2), ",").cast("array<int>"))
.withColumn("termweights", split($"termweights".substr(lit(2), length($"termweights")-2), ",").cast("array<double>"))
.as[TopicDFRow]
substr if 1-index based so to remove the first character we start from 2. The second argument is the length to take (not the end point) hence the -2.
The last command will cast the dataframe to a dataset of type TopicDFRow.

How to filter nullable Array-Elements in Spark 1.6 UDF

Consider the following DataFrame
root
|-- values: array (nullable = true)
| |-- element: double (containsNull = true)
with content:
+-----------+
| values|
+-----------+
|[1.0, null]|
+-----------+
Now I want to pass thie value column to an UDF:
val inspect = udf((data:Seq[Double]) => {
data.foreach(println)
println()
data.foreach(d => println(d))
println()
data.foreach(d => println(d==null))
""
})
df.withColumn("dummy",inspect($"values"))
I'm really confused from the output of the above println statements:
1.0
null
1.0
0.0
false
false
My questions:
Why is foreach(println) not giving the same output as foreach(d=>println(d))?
How can the Double be null in the first println-statement, I thought scala's Double cannot be null?
How can I filter null values in my Seq other han filtering 0.0 which isnt really safe? Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls? (this works, but I'm unsure if that is the way to go)
Note that I'm aware of this Question, but my question is specific to array-type columns.
Why is foreach(println) not giving the same output as foreach(d=>println(d))?
In the context where Any is expected data cast is skipped completely. This is explained in detail in If an Int can't be null, what does null.asInstanceOf[Int] mean?
How can the Double be null in the first println-statement, I thought scala's Double cannot be null?
Internal binary representation doesn't use Scala types at all. Once array data is decoded it is represented as an Array[Any] and elements are coerced to a declared type with simple asInstanceOf.
Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls?
In general if values are nullable then you should use external type which is nullable as well or Option. Unfortunately only the first option is applicable for UDFs.