Saving double dataype in spark dataframe - scala

In below spark-scala code, double data type value is storing differently. Though in table, all columns are of string type. Same result in impala as well.
Does someone knows how to make sure exact value get saved and retrieved ?
Thanks
val df = Seq(("one", 1324235345435.4546)).toDF("a", "b")
df.write.mode("append").insertInto("test")
spark.sql("select * from test").show(false)
+---+---------------------+
|a |b |
+---+---------------------+
|one|1.3242353454354546E12|
+---+---------------------+

Try with casting to Decimal type and then insert into Hive table.
val df = Seq(("one", 1324235345435.4546))
.toDF("a", "b")
.select('a,'b.cast("Decimal(36,4)"))
df.show(false)
+---+------------------+
|a |b |
+---+------------------+
|one|1324235345435.4546|
+---+------------------+

scala> df.select(format_number(col("b"),4)).show(false)
+----------------------+
|format_number(b, 4) |
+----------------------+
|1,324,235,345,435.4546|
+----------------------+
you could use the number_formater function on top of column so that you it can automatically convert to string with your requirement with the precision.
hope this helps on generalizing.

Related

How do I string concat two columns in Scala but order the resulting column alphabetically?

I have a dataframe like this...
val new_df =Seq(("a","b"),("b","a"),("a","c")).toDF("col1","col2")
and I want to create "col3" which is a string concatenation of "col1" and "col2". However, I want the concatenation of "ab" and "ba" to be treated the same, sorted alphabetically so that it's only "ab".
The resulting dataframe I would like to look like this:
val new_df =Seq(("a","b","ab"),("b","a","ab"),("a","c","ac")).toDF("col1","col2","col3")
And here's a before and after picture too:
before:
after:
thanks and have a great day!
With Spark SQL functions to take advantage of the Spark SQL Optimizations:
import org.apache.spark.sql.functions.{sort_array, array, concat_ws}
new_df.withColumn("col3",
concat_ws("",
sort_array(array(col("col1"), col("col2")))))
You can just create an udf to create a sorted String
val concatColumns = udf((c1: String, c2: String) => {
List(c1, c2).sorted.mkString
})
And then use it in a withColumn statement sending the desired columns to concatenate
new_df.withColumn("col3", concatColumns($"col1", $"col2")).show(false)
Result
+----+----+----+
|col1|col2|col3|
+----+----+----+
|a |b |ab |
|b |a |ab |
|a |c |ac |
+----+----+----+

Cast to decima in spark scala

I am having the input data in text format like below. I need to convert this into Decimal in spark Scala. Please help me the cast(DecimalType) statement.
+0000025.42
I have tried .cast(DecimalType(11,2)) and it is displaying Null
Might be having some text characters in that column because of that It is not able to convert data to decimal(11,2) type & It is adding null in that column.
Check data in that column.
scala> val df = Seq(("+0000025.42"),("sample")).toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.show(false)
+-----------+
|value |
+-----------+
|+0000025.42|
|sample |
+-----------+
scala> df.select($"value".cast("decimal(11,2)").as("value")).show(false)
+-----+
|value|
+-----+
|25.42|
|null |
+-----+

How to remove the fractional part from a dataframe column?

Input dataframe:
val ds = Seq((1,34.44),
(2,76.788),
(3,54.822)).toDF("id","mark")
Expected output:
val ds = Seq((1,34),
(2,76),
(3,54)).toDF("id","mark")
I want to remove the fractional part from the column mark as above. I have searched for any builtin functions, but did not find anything. How should an udf look like to achieve the above result?
You can just use cast to integer as
import org.apache.spark.sql.functions._
ds.withColumn("mark", $"mark".cast("integer")).show(false)
which should give you
+---+----+
|id |mark|
+---+----+
|1 |34 |
|2 |76 |
|3 |54 |
+---+----+
I hope the answer is helpful
Update
You commented as
But if any string values are there in the column , it is getting null since we are casting into integer . i don't want that kind of behaviour
So I guess your mark column must be a StringType() and you can use regexp_replace
import org.apache.spark.sql.functions._
ds.withColumn("mark", regexp_replace($"mark", "(\\.\\d+)", "")).show(false)

How to add days (as values of a column) to date?

I have a problem with adding days (numbers) to date format columns in Spark. I know that there is a function date_add that takes two arguments - date column and integer:
date_add(date startdate, tinyint/smallint/int days)
I'd like to use a column value that is of type integer instead (not an integer itself).
Say I have the following dataframe:
val data = Seq(
(0, "2016-01-1"),
(1, "2016-02-2"),
(2, "2016-03-22"),
(3, "2016-04-25"),
(4, "2016-05-21"),
(5, "2016-06-1"),
(6, "2016-03-21"))
).toDF("id", "date")
I can simply add integers to dates:
val date_add_fun =
data.select(
$"id",
$"date",
date_add($"date", 1)
)
But I cannot use a column expression that contains the values:
val date_add_fun =
data.select(
$"id",
$"date",
date_add($"date", $"id")
)
It gives error:
<console>:60: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Int
date_add($"date", $"id")
Does anyone know if it is possible to use column is date_add function? Or what is the workaround?
You can use expr:
import org.apache.spark.sql.functions.expr
data.withColumn("future", expr("date_add(date, id)")).show
// +---+----------+----------+
// | id| date| future|
// +---+----------+----------+
// | 0| 2016-01-1|2016-01-01|
// | 1| 2016-02-2|2016-02-03|
// | 2|2016-03-22|2016-03-24|
// | 3|2016-04-25|2016-04-28|
// | 4|2016-05-21|2016-05-25|
// | 5| 2016-06-1|2016-06-06|
// | 6|2016-03-21|2016-03-27|
// +---+----------+----------+
selectExpr could be use in a similar way:
data.selectExpr("*", "date_add(date, id) as future").show
The other answers work but aren't a drop in replacement for the existing date_add function.
I had a case where expr wouldn't work for me, so here is a drop in replacement:
def date_add(date: Column, days: Column) = {
new Column(DateAdd(date.expr, days.expr))
}
Basically, all the machinery is there in Spark to do this already, the function signature for date_add just forces it to be a literal.
You can use a sql expression as
data.createOrReplaceTempView("table")
sqlContext.sql("select id, date, date_add(`date`, `id`) as added_date from table").show(false)
which would give you
+---+----------+----------+
|id |date |added_date|
+---+----------+----------+
|0 |2016-01-1 |2016-01-01|
|1 |2016-02-2 |2016-02-03|
|2 |2016-03-22|2016-03-24|
|3 |2016-04-25|2016-04-28|
|4 |2016-05-21|2016-05-25|
|5 |2016-06-1 |2016-06-06|
|6 |2016-03-21|2016-03-27|
+---+----------+----------+
For the Python developers who are here, you can simply add a date column to another column together using +:
import pyspark.sql.functions as F
new_df = df.withColumn("new_date", F.col("date") + F.col("offset"))
Juste make sure that the offset column is int/smallint/tinyint.

How to append an element to an array column of a Spark Dataframe?

Suppose I have the following DataFrame:
scala> val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
df1: org.apache.spark.sql.DataFrame = [id: string, nums: array<int>]
scala> df1.show()
+---+----+
| id|nums|
+---+----+
| a| [1]|
| b| [1]|
+---+----+
And I want to add elements to the array in the nums column, so that I get something like the following:
+---+-------+
| id|nums |
+---+-------+
| a| [1,5] |
| b| [1,5] |
+---+-------+
Is there a way to do this using the .withColumn() method of the DataFrame? E.g.
val df2 = df1.withColumn("nums", append(col("nums"), lit(5)))
I've looked through the API documentation for Spark, but can't find anything that would allow me to do this. I could probably use split and concat_ws to hack something together, but I would prefer a more elegant solution if one is possible. Thanks.
import org.apache.spark.sql.functions.{lit, array, array_union}
val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
val df2 = df1.withColumn("nums", array_union($"nums", lit(Array(5))))
df2.show
+---+------+
| id| nums|
+---+------+
| a|[1, 5]|
| b|[1, 5]|
+---+------+
The array_union() was added since spark 2.4.0 release on 11/2/2018, 7 months after you asked the question, :) see https://spark.apache.org/news/index.html
You can do it using a udf function as
def addValue = udf((array: Seq[Int])=> array ++ Array(5))
df1.withColumn("nums", addValue(col("nums")))
.show(false)
and you should get
+---+------+
|id |nums |
+---+------+
|a |[1, 5]|
|b |[1, 5]|
+---+------+
Updated
Alternative way is to go with dataset way and use map as
df1.map(row => add(row.getAs[String]("id"), row.getAs[Seq[Int]]("nums")++Seq(5)))
.show(false)
where add is a case class
case class add(id: String, nums: Seq[Int])
I hope the answer is helpful
If you are, like me, searching how to do this in a Spark SQL statement; here's how:
%sql
select array_union(array("value 1"), array("value 2"))
You can use array_union to join up two arrays. To be able to use this, you have to turn your value-to-append into an array. Do this by using the array() function.
You can enter a value like array("a string") or array(yourColumn).
Be careful with using spark array_join. It is removing duplicates. So you will not get expected results if you have duplicated entries in your array. And it is at least costing O(N). So when I use it with a array aggregate, it became an O(N^2) operation and took forever for some large arrays.