replacing strings inside df using dictionary scala - scala

I'm new to Scala. Im trying to replace parts of strings using a dictionary.
my dictionary would be:
val dict = Seq(("fruits", "apples"),("color", "red"), ("city", "paris")).
toDF(List("old", "new").toSeq:_*)
+------+------+
| old| new|
+------+------+
|fruits|apples|
| color| red|
| city| paris|
+------+------+
I would then translate fields from a column in another df which is:
+--------------------------+
|oldCol |
+--------------------------+
|I really like fruits |
|they are colored brightly |
|i live in city!! |
+--------------------------+
the desired output:
+------------------------+
|newCol |
+------------------------+
|I really like apples |
|they are reded brightly |
|i live in paris!! |
+------------------------+
please help! I've tried to covert dict to a map and then use replaceAllIn() function but really can't solve this one.
I've also tried foldleft following this answer: Scala replace an String with a List of Key/Values.
Thanks

Create a Map from dict dataframe and then you can easily do this using udf like below
import org.apache.spark.sql.functions._
//Creating Map from dict dataframe
val oldNewMap=dict.map(row=>row.getString(0)->row.getString(1)).collect.toMap
//Creating udf
val replaceUdf=udf((str:String)=>oldNewMap.foldLeft (str) {case (acc,(key,value))=>acc.replaceAll(key+".", value).replaceAll(key, value)})
//Select old column from oldDf and apply udf
oldDf.withColumn("newCol",replaceUdf(oldDf.col("oldCol"))).drop("oldCol").show
//Output:
+--------------------+
| newCol|
+--------------------+
|I really like apples|
|they are reded br...|
| i live in paris!!|
+--------------------+
I hope this will help you

Related

How to extract the numeric part from a string column in spark?

I am new to spark and trying to play with data to get practice. I am using databricks in scala and for dataset I am using fifa 19 complete player dataset from kaggle. one of the column named "Weight" which contains the data that looks like
+------+
|Weight|
+------+
|136lbs|
|156lbs|
|136lbs|
|... |
|... |
+------+
I want to change the column in such a way to look like this
+------+
|Weight|
+------+
|136 |
|156 |
|136 |
|... |
|... |
+------+
Can any one help how I can change the column values in spark sql.
Here is another way using regex and the regexp_extract build-in function:
import org.apache.spark.sql.functions.regexp_extract
val df = Seq(
"136lbs",
"150lbs",
"12lbs",
"30kg",
"500kg")
.toDF("weight")
df.withColumn("weight_num", regexp_extract($"weight", "\\d+", 0))
.withColumn("weight_unit", regexp_extract($"weight", "[a-z]+", 0))
.show
//Output
+------+----------+-----------+
|weight|weight_num|weight_unit|
+------+----------+-----------+
|136lbs| 136| lbs|
|150lbs| 150| lbs|
| 12lbs| 12| lbs|
| 30kg| 30| kg|
| 500kg| 500| kg|
+------+----------+-----------+
You can create a new column and use regexp_replace
dataFrame.withColumn("Weight2", regexp_replace($"Weight" , lit("lbs"), lit("")))

Why "withColumn" transformation on spark dataframe is not checking records from an external list?

I am using Spark and Scala for learning purpose. I came around a situation wherein I need to compare the validity of records present in one of the columns of spark dataframe.
This is how I created one dataframe, "dataframe1":
import sparkSession.implicits._
val dataframe1 = Seq("AB","BC","CD","DA","AB","BC").toDF("col1")
dataframe1:
+----+
|col1|
+----+
| AB|
| BC|
| CD|
| DA|
| AB|
| BC|
+----+
The validity of records depends on the condition if the record is "AB" or "BC". Here is my first attempt:
val dataframe2 = dataframe1.withColumn("col2", when('col1.contains("AB") or 'col1.contains("BC"), "valid").otherwise("invalid"))
dataframe2:
+----+-------+
|col1| col2|
+----+-------+
| AB| valid|
| BC| valid|
| CD|invalid|
| DA|invalid|
| AB| valid|
| BC| valid|
+----+-------+
But I don't think this is a good way of doing because if I need to add more valid records then I need to add conditions in "when" clause which will increase the code length and disturbs the code readability.
So I tried to put all the valid records in one list and check if the record string is present in the list. If it is present then it is a valid record otherwise not. Here is the code snippet for this trial:
val validRecList = Seq("AB", "BC").toList
val dataframe3 = dataframe1.withColumn("col2", if(validRecList.contains('col1.toString())) lit("valid") else lit("invalid"))
But somehow it is not working as expected, as the result of this is:
+----+-------+
|col1| col2|
+----+-------+
| AB|invalid|
| BC|invalid|
| CD|invalid|
| DA|invalid|
| AB|invalid|
| BC|invalid|
+----+-------+
Can anybody tell me what mistake am I doing here? And, any other generic suggestion for such a scenario.
Thank you.
Try this:
import spark.implicits._
import org.apache.spark.sql.functions._
val dataframe1 = Seq("AB","BC","CD","DA","AB","BC", "XX").toDF("col1").as[(String)]
val validRecList = List("AB", "BC")
val dataframe2 = dataframe1.withColumn("col2", when($"col1".isin(validRecList: _*), lit("valid")).otherwise (lit("invalid")))
dataframe2.show(false)
returns:
+----+-------+
|col1|col2 |
+----+-------+
|AB |valid |
|BC |valid |
|CD |invalid|
|DA |invalid|
|AB |valid |
|BC |valid |
|XX |invalid|
+----+-------+
dataframe3 code is not working because when we see the documentation about "withColumn" function on Dataset https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset
We'd see the withColumn receive "String" and "Column" as the parameter type.
So this code
val dataframe3 = dataframe1.withColumn("col2", if(validRecList.contains('col1.toString())) lit("valid") else lit("invalid"))
will give col2 as the new column name, but will give lit("valid") or lit("invalid") as the Column name. The if(validRecList.contains('col1.toString) lit("valid") else lit("invalid") will be executed as scala code not executed as the Dataset operation nor the Column operation.
I mean this if(validRecList.contains('col1.toString) is executed by scala not spark because the "invalid" result is derived from validRecList is not have 'col1 on the List. But when you define val validRecList = Seq('col1, "AB", "BC") the validRecList.contains('col1) will return true
Also, IF operator is not supported on Dataset and on Column
If you want a condition on withColumn function, you need to express the Column type expression like this:
dataframe3.withColumn("isContainRecList", $"col1".isin(validRecList: _*))
this $"col1".isin(validRecList: _*) is a Column type expression because it will return Column (based on the documentation) or you can use when(the_condition, value_if_true, value_if_false).
So, I think it is important to understand the types that the spark engine will work with our data, if we are not give the Column type expression, it will not refer to the 'col1 data but it will refer to 'col1 as a scala symbol.
Also, when you want to use IF, maybe you could create a User Defined Functions.
import org.apache.spark.sql.functions.udf
def checkValidRecList(needle: String): String = if(validRecList.contains(needle)) "valid" else "invalid"
val checkUdf = udf[String, String](checkValidRecList)
val dataframe3 = dataframe1.withColumn("col2", checkUdf('col1))
the result is:
scala> dataframe3.show(false)
+----+-------+
|col1|col2 |
+----+-------+
|AB |valid |
|BC |valid |
|CD |invalid|
|DA |invalid|
|AB |valid |
|BC |valid |
+----+-------+
But, I think we should use remember this UDF stuff is not always recommended.

How to append an element to an array column of a Spark Dataframe?

Suppose I have the following DataFrame:
scala> val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
df1: org.apache.spark.sql.DataFrame = [id: string, nums: array<int>]
scala> df1.show()
+---+----+
| id|nums|
+---+----+
| a| [1]|
| b| [1]|
+---+----+
And I want to add elements to the array in the nums column, so that I get something like the following:
+---+-------+
| id|nums |
+---+-------+
| a| [1,5] |
| b| [1,5] |
+---+-------+
Is there a way to do this using the .withColumn() method of the DataFrame? E.g.
val df2 = df1.withColumn("nums", append(col("nums"), lit(5)))
I've looked through the API documentation for Spark, but can't find anything that would allow me to do this. I could probably use split and concat_ws to hack something together, but I would prefer a more elegant solution if one is possible. Thanks.
import org.apache.spark.sql.functions.{lit, array, array_union}
val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
val df2 = df1.withColumn("nums", array_union($"nums", lit(Array(5))))
df2.show
+---+------+
| id| nums|
+---+------+
| a|[1, 5]|
| b|[1, 5]|
+---+------+
The array_union() was added since spark 2.4.0 release on 11/2/2018, 7 months after you asked the question, :) see https://spark.apache.org/news/index.html
You can do it using a udf function as
def addValue = udf((array: Seq[Int])=> array ++ Array(5))
df1.withColumn("nums", addValue(col("nums")))
.show(false)
and you should get
+---+------+
|id |nums |
+---+------+
|a |[1, 5]|
|b |[1, 5]|
+---+------+
Updated
Alternative way is to go with dataset way and use map as
df1.map(row => add(row.getAs[String]("id"), row.getAs[Seq[Int]]("nums")++Seq(5)))
.show(false)
where add is a case class
case class add(id: String, nums: Seq[Int])
I hope the answer is helpful
If you are, like me, searching how to do this in a Spark SQL statement; here's how:
%sql
select array_union(array("value 1"), array("value 2"))
You can use array_union to join up two arrays. To be able to use this, you have to turn your value-to-append into an array. Do this by using the array() function.
You can enter a value like array("a string") or array(yourColumn).
Be careful with using spark array_join. It is removing duplicates. So you will not get expected results if you have duplicated entries in your array. And it is at least costing O(N). So when I use it with a array aggregate, it became an O(N^2) operation and took forever for some large arrays.

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+

Pass Array[seq[String]] to UDF in spark scala

I am new to UDF in spark. I have also read the answer here
Problem statement: I'm trying to find pattern matching from a dataframe col.
Ex: Dataframe
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
df.show()
+---+--------------------+
| id| text|
+---+--------------------+
| 1| z|
| 2| abs,abc,dfg|
| 3|a,b,c,d,e,f,abs,a...|
+---+--------------------+
df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row
Now I want to do this pattern matching for every row in column $text and add new column called count.
Result:
+---+--------------------+-----+
| id| text|count|
+---+--------------------+-----+
| 1| z| 1|
| 2| abs,abc,dfg| 2|
| 3|a,b,c,d,e,f,abs,a...| 1|
+---+--------------------+-----+
I tried to define a udf passing $text column as Array[Seq[String]. But I am not able to get what I intended.
What I tried so far:
val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()
Any help would be appreciated
You will have to know all the elements of text column which can be done using collect_list by grouping all the rows of your dataframe as one. Then just check if element in text column in the collected array and count them as in the following code.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size)
df.withColumn("grouping", lit("g"))
.withColumn("array", collect_list("text").over(Window.partitionBy("grouping")))
.withColumn("count", valsum($"text", $"array"))
.drop("grouping", "array")
.show(false)
You should have following output
+---+-----------------------+-----+
|id |text |count|
+---+-----------------------+-----+
|1 |z |1 |
|2 |abs,abc,dfg |2 |
|3 |a,b,c,d,e,f,abs,abc,dfg|1 |
+---+-----------------------+-----+
I hope this is helpful.