Cast to decima in spark scala - scala

I am having the input data in text format like below. I need to convert this into Decimal in spark Scala. Please help me the cast(DecimalType) statement.
+0000025.42
I have tried .cast(DecimalType(11,2)) and it is displaying Null

Might be having some text characters in that column because of that It is not able to convert data to decimal(11,2) type & It is adding null in that column.
Check data in that column.
scala> val df = Seq(("+0000025.42"),("sample")).toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.show(false)
+-----------+
|value |
+-----------+
|+0000025.42|
|sample |
+-----------+
scala> df.select($"value".cast("decimal(11,2)").as("value")).show(false)
+-----+
|value|
+-----+
|25.42|
|null |
+-----+

Related

Saving double dataype in spark dataframe

In below spark-scala code, double data type value is storing differently. Though in table, all columns are of string type. Same result in impala as well.
Does someone knows how to make sure exact value get saved and retrieved ?
Thanks
val df = Seq(("one", 1324235345435.4546)).toDF("a", "b")
df.write.mode("append").insertInto("test")
spark.sql("select * from test").show(false)
+---+---------------------+
|a |b |
+---+---------------------+
|one|1.3242353454354546E12|
+---+---------------------+
Try with casting to Decimal type and then insert into Hive table.
val df = Seq(("one", 1324235345435.4546))
.toDF("a", "b")
.select('a,'b.cast("Decimal(36,4)"))
df.show(false)
+---+------------------+
|a |b |
+---+------------------+
|one|1324235345435.4546|
+---+------------------+
scala> df.select(format_number(col("b"),4)).show(false)
+----------------------+
|format_number(b, 4) |
+----------------------+
|1,324,235,345,435.4546|
+----------------------+
you could use the number_formater function on top of column so that you it can automatically convert to string with your requirement with the precision.
hope this helps on generalizing.

How to transform a string column of a dataframe into a column of Array[String] with Apache Spark and Scala

I have a DataFrame with a column 'title_from' as below.
.
This colume contains a sentence and I want to transform this column into a Array[String]. I have tried something like this but it does not works.
val newDF = df.select("title_from").map(x => x.split("\\\s+")
How can I achieve this? How can I transform a datafram of strings into a dataframe of Array[string]? I want evry line of newDF to be an array of words from df.
Thanks for any help!
You can use the withColumn function.
import org.apache.spark.sql.functions._
val newDF = df.withColumn("split_title_from", split(col("title_from"), "\\s+"))
.select("split_title_from")
Can you try following to get the list of all authors
scala> val df = Seq((1,"a1,a2,a3"), (2,"a1,a4,a10")).toDF("id","author")
df: org.apache.spark.sql.DataFrame = [id: int, author: string]
scala> df.show()
+---+---------+
| id| author|
+---+---------+
| 1| a1,a2,a3|
| 2|a1,a4,a10|
+---+---------+
scala> df.select("author").show
+---------+
| author|
+---------+
| a1,a2,a3|
|a1,a4,a10|
+---------+
scala> df.select("author").flatMap( row => { row.get(0).toString().split(",")}).show()
+-----+
|value|
+-----+
| a1|
| a2|
| a3|
| a1|
| a4|
| a10|
+-----+

How to rename column headers in a scala dataframe

How can I do string.replace("fromstr", "tostr") on a scala dataframe.
As far as I can see withColumnRenamed performs replace on all columns and not just the headers.
withColumnRenamed renames column names only, data remains the same. If you need to change rows context, you can use one of the following:
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val inputDf = Seq("to_be", "misc").toDF("c1")
val resultd1Df = inputDf
.withColumn("c2", regexp_replace($"c1", "^to_be$", "not_to_be"))
.select($"c2".as("c1"))
resultd1Df.show()
val resultd2Df = inputDf
.withColumn("c2", when($"c1" === "to_be", "not_to_be").otherwise($"c1"))
.select($"c2".as("c1"))
resultd2Df.show()
def replace(mapping: Map[String, String]) = udf(
(from: String) => mapping.get(from).orElse(Some(from))
)
val resultd3Df = inputDf
.withColumn("c2", replace(Map("to_be" -> "not_to_be"))($"c1"))
.select($"c2".as("c1"))
resultd3Df.show()
Input dataframe:
+-----+
| c1|
+-----+
|to_be|
| misc|
+-----+
Result dataframe:
+---------+
| c1|
+---------+
|not_to_be|
| misc|
+---------+
You can find the list of available Spark functions there

How to extract week day as a number from a Spark dataframe with the Scala API

I have a date column which is string in dataframe in the 2017-01-01 12:15:43 timestamp format.
Now I want to get weekday number(1 to 7) from that column using dataframe and not spark sql.
Like below
df.select(weekday(col("colname")))
I found one in python and sql but not in scala. can any body help me on this
in sqlcontext
sqlContext.sql("select date_format(to_date('2017-01-01'),'W') as week")
This works the same way in Scala:
scala> spark.version
res1: String = 2.3.0
scala> spark.sql("select date_format(to_date('2017-01-01'),'W') as week").show
// +----+
// |week|
// +----+
// | 1|
// +----+
or
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val df = Seq("2017-01-01").toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]
scala> df.select(date_format(to_date(col("date")), "W")).show
+-------------------------------+
|date_format(to_date(`date`), W)|
+-------------------------------+
| 1|
+-------------------------------+

Pass Array[seq[String]] to UDF in spark scala

I am new to UDF in spark. I have also read the answer here
Problem statement: I'm trying to find pattern matching from a dataframe col.
Ex: Dataframe
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
df.show()
+---+--------------------+
| id| text|
+---+--------------------+
| 1| z|
| 2| abs,abc,dfg|
| 3|a,b,c,d,e,f,abs,a...|
+---+--------------------+
df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row
Now I want to do this pattern matching for every row in column $text and add new column called count.
Result:
+---+--------------------+-----+
| id| text|count|
+---+--------------------+-----+
| 1| z| 1|
| 2| abs,abc,dfg| 2|
| 3|a,b,c,d,e,f,abs,a...| 1|
+---+--------------------+-----+
I tried to define a udf passing $text column as Array[Seq[String]. But I am not able to get what I intended.
What I tried so far:
val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()
Any help would be appreciated
You will have to know all the elements of text column which can be done using collect_list by grouping all the rows of your dataframe as one. Then just check if element in text column in the collected array and count them as in the following code.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size)
df.withColumn("grouping", lit("g"))
.withColumn("array", collect_list("text").over(Window.partitionBy("grouping")))
.withColumn("count", valsum($"text", $"array"))
.drop("grouping", "array")
.show(false)
You should have following output
+---+-----------------------+-----+
|id |text |count|
+---+-----------------------+-----+
|1 |z |1 |
|2 |abs,abc,dfg |2 |
|3 |a,b,c,d,e,f,abs,abc,dfg|1 |
+---+-----------------------+-----+
I hope this is helpful.