Replacing Strings with numbers in a pyspark dataframe - pyspark

I am new to pyspark and I want to replace names with numbers in a pyspark dataframe column dynamically because I have more than 5,00,000 names in my dataframe. How to proceed?
----------
| Name |
----------
| nameone|
----------
| nametwo|
----------
should become
--------
| Name |
--------
| 1 |
--------
| 2 |
--------

Well you have two options I can think of. In case you have only unique names, you can simply apply the monotonically_increasing_id function. This will create an unique but not consecutive id for each row.
import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer
l = [
('nameone', ),
('nametwo', ),
('nameone', )
]
columns = ['Name']
df=spark.createDataFrame(l, columns)
#use Name instead of uniqueId to overwrite the column
df = df.withColumn('uniqueId', F.monotonically_increasing_id())
df.show()
Output:
+-------+----------+
| Name| uniqueId|
+-------+----------+
|nameone| 0|
|nametwo|8589934592|
|nameone|8589934593|
+-------+----------+
In case you want to assign the same id to rows which have the same value for Name, you have to use a StringIndexer:
indexer = StringIndexer(inputCol="Name", outputCol="StringINdex")
df = indexer.fit(df).transform(df)
df.show()
Output:
+-------+----------+-----------+
| Name| uniqueId|StringINdex|
+-------+----------+-----------+
|nameone| 0| 0.0|
|nametwo|8589934592| 1.0|
|nameone|8589934593| 0.0|
+-------+----------+-----------+

Related

Scala Spark : Convert struct columns type to decimal type

I have csv stored in s3 location which has data like this
column1 | column2 |
--------+----------
| adsf | 2000.0 |
| fff | 232.34 |
I have a AWS Glue Job in Scala which reads this file into a dataframe
var srcDF= glueContext.getCatalogSource(database = '',
tableName = '',
redshiftTmpDir = "",
transformationContext = "").getDynamicFrame().toDF()
When I print the schema, it infers itself like this
srcDF.printSchema()
|-- column1 : string |
|-- column2 : struct (double, string) |
And the dataframe looks like
column1 | column2 |
--------+-------------
| adsf | [2000.0,] |
| fff | [232.34,] |
When I try to save the dataframe to csv it complains that
org.apache.spark.sql.AnalysisException CSV data source does not support struct<double:double,string:string> data type.
How do I convert dataframe so that only the columns of Struct type (if exist) to decimal type? Output like this
column1 | column2 |
--------+----------
| adsf | 2000.0 |
| fff | 232.34 |
Edit:
Thanks for the response. I have tried using following code
df.select($"column2._1".alias("column2")).show()
But got the same error for both
org.apache.spark.sql.AnalysisException No such struct field _1 in double, string;
Edit 2:
It seems the spark, the columns were flattened and renamed as "double,string"
So, this solution worked for me
df.select($"column2.double").show()
You can extract fields from struct using getItem. Code can be something like that:
import spark.implicits._
import org.apache.spark.sql.functions.{col, getItem}
val df = Seq(
("adsf", (2000.0,"")),
("fff", (232.34,""))
).toDF("A", "B")
df.show()
df.select(col("A"), col("B").getItem("_1").as("B")).show()
it will print:
before select:
+----+----------+
| A| B|
+----+----------+
|adsf|[2000.0, ]|
| fff|[232.34, ]|
+----+----------+
after select:
+----+------+
| A| B|
+----+------+
|adsf|2000.0|
| fff|232.34|
+----+------+
You can also use the dot notation column2._1 to get the struct field by name:
val df = Seq(
("adsf", (2000.0,"")),
("fff", (232.34,""))
).toDF("column1", "column2")
df.show
+-------+----------+
|column1| column2|
+-------+----------+
| adsf|[2000.0, ]|
| fff|[232.34, ]|
+-------+----------+
val df2 = df.select($"column1", $"column2._1".alias("column2"))
df2.show
+-------+-------+
|column1|column2|
+-------+-------+
| adsf| 2000.0|
| fff| 232.34|
+-------+-------+
df2.coalesce(1).write.option("header", "true").csv("output")
and your csv file will be in the output/ folder:
column1,column2
adsf,2000.0
fff,232.34

how to replace pyspark dataframe columns value with a dict

I have a dataframe as show below
+++++++++++++++++++++
colA | colB | colC |
+++++++++++++++++++++
123 | 3 | 0|
222 | 0 | 1|
200 | 0 | 2|
I want to replace the values in colB with a dict d to get the result like this.
d = {3:'a', 0:'b}
+++++++++++++++++++++
colA | colB | colC |
+++++++++++++++++++++
123 | a | 0|
222 | b | 1|
200 | b | 2|
You should simply use dataframe method replace that actually does not clearly explains this use case.
To use a dictionary, you have to simply setting the specified dict as first argument, a random value as second argument, and the name of the column as third argument.
At least in Spark 2.2, a warning will be raised expliciting that, since the first argument is a dictionary, the second argument will be not take into account.
data = [
(123,3,0),
(222,0,2),
(200,0,2)]
df = spark.createDataFrame(data,['colA','colB','colC'])
d = {3:'a', 0:'b}
df_renamed = df.replace(d,1,'colB')
df_renamed.show()
# +++++++++++++++++++++
# colA | colB | colC |
# +++++++++++++++++++++
# 123 | a | 0|
# 222 | b | 1|
# 200 | b | 2|
Please also note that, "When replacing, the new value will be cast to the type of the existing column" , as reported inside the docs. By consequence, your column will be casted to string.

Spark (scala) dataframes - Check whether strings in column exist in a column of another dataframe

I have a spark dataframe, and I wish to check whether each string in a particular column exists in a pre-defined a column of another dataframe.
I have found a same problem in Spark (scala) dataframes - Check whether strings in column contain any items from a set
but I want to Check whether strings in column exists in a column of another dataframe not a List or a set follow that question. Who can help me! I don't know convert a column to a set or a list and i don't know "exists" method in dataframe.
My data is similar to this
df1:
+---+-----------------+
| id| url |
+---+-----------------+
| 1|google.com |
| 2|facebook.com |
| 3|github.com |
| 4|stackoverflow.com|
+---+-----------------+
df2:
+-----+------------+
| id | urldetail |
+-----+------------+
| 11 |google.com |
| 12 |yahoo.com |
| 13 |facebook.com|
| 14 |twitter.com |
| 15 |youtube.com |
+-----+------------+
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"urldetail" column if exists in $"url"
+---+------------+-------------+
| id| urldetail | check |
+---+------------+-------------+
| 11|google.com | 1 |
| 12|yahoo.com | 0 |
| 13|facebook.com| 1 |
| 14|twitter.com | 0 |
| 15|youtube.com | 0 |
+---+------------+-------------+
I want to use UDF but i don't know how to check whether string exists in a column of a dataframe! Please help me!
I have a spark dataframe, and I wish to check whether each string in a
particular column contains any number of words from a pre-defined a
column of another dataframe.
Here is the way. using = or like
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, _}
object CompareColumns extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local").getOrCreate()
import spark.implicits._
val df1 = Seq(
(1, "google.com"),
(2, "facebook.com"),
(3, "github.com"),
(4, "stackoverflow.com")).toDF("id", "url").as("first")
df1.show
val df2 = Seq(
(11, "google.com"),
(12, "yahoo.com"),
(13, "facebook.com"),
(14, "twitter.com")).toDF("id", "url").as("second")
df2.show
val df3 = df2.join(df1, expr("first.url like second.url"), "full_outer").select(
col("first.url")
, col("first.url").contains(col("second.url")).as("check")).filter("url is not null")
df3.na.fill(Map("check" -> false))
.show
}
Result :
+---+-----------------+
| id| url|
+---+-----------------+
| 1| google.com|
| 2| facebook.com|
| 3| github.com|
| 4|stackoverflow.com|
+---+-----------------+
+---+------------+
| id| url|
+---+------------+
| 11| google.com|
| 12| yahoo.com|
| 13|facebook.com|
| 14| twitter.com|
+---+------------+
+-----------------+-----+
| url|check|
+-----------------+-----+
| google.com| true|
| facebook.com| true|
| github.com|false|
|stackoverflow.com|false|
+-----------------+-----+
with full outer join we can achive this...
For more details see my article with all joins here in my linked in post
Note : Instead of 0 for false 1 for true i have used boolean
conditions here.. you can translate them in to what ever you wanted...
UPDATE : If rows are increasing in second dataframe
you can use this, it wont miss any rows from second
val df3 = df2.join(df1, expr("first.url like second.url"), "full").select(
col("second.*")
, col("first.url").contains(col("second.url")).as("check"))
.filter("url is not null")
df3.na.fill(Map("check" -> false))
.show
Also, one more thing is you can try regexp_extract as shown in below post
https://stackoverflow.com/a/53880542/647053
read in your data and use the trim operation just to be conservative when joining on the strings to remove the whitesapace
val df= Seq((1,"google.com"), (2,"facebook.com"), ( 3,"github.com "), (4,"stackoverflow.com")).toDF("id", "url").select($"id", trim($"url").as("url"))
val df2 =Seq(( 11 ,"google.com"), (12 ,"yahoo.com"), (13 ,"facebook.com"),(14 ,"twitter.com"),(15,"youtube.com")).toDF( "id" ,"urldetail").select($"id", trim($"urldetail").as("urldetail"))
df.join(df2.withColumn("flag", lit(1)).drop("id"), (df("url")===df2("urldetail")), "left_outer").withColumn("contains_bool",
when($"flag"===1, true) otherwise(false)).drop("flag","urldetail").show
+---+-----------------+-------------+
| id| url|contains_bool|
+---+-----------------+-------------+
| 1| google.com| true|
| 2| facebook.com| true|
| 3| github.com| false|
| 4|stackoverflow.com| false|
+---+-----------------+-------------+

Replace words in Data frame using List of words in another Data frame in Spark Scala

I have two dataframes, lets say df1 and df2 in Spark Scala
df1 has two fields, 'ID' and 'Text' where 'Text' has some description (Multiple words). I have already removed all special characters and numeric characters from field 'Text' leaving only alphabets and spaces.
df1 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||helo how are you |
| 2 ||hai haiden |
| 3 ||hw are u uma |
--------------------------------------
df2 contains a list of words and corresponding replacement words
df2 Sample
+--------------++--------------------+
|Word ||Replace |
+--------------++--------------------+
| helo ||hello |
| hai ||hi |
| hw ||how |
| u ||you |
--------------------------------------
I would need to find all occurrence of words in df2("Word") from df1("Text") and replace it with df2("Replace")
With the sample dataframes above, I would expect a resulting dataframe, DF3 as given below
df3 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||hello how are you |
| 2 ||hi haiden |
| 3 ||how are you uma |
--------------------------------------
Your help is greatly appreciated in doing the same in Spark using Scala.
It'd be easier to accomplish this if you convert your df2 to a Map. Assuming it's not a huge table, you can do the following :
val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap
This will give you a Map to refer to :
scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)
Now you can use UDF to create a function that will utilize keyVal Map to replace values :
val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )
Now, you can call the udf getVal on your dataframe to get the desired result.
df1.withColumn("text" , getVal(df1("text")) ).show
+---+-----------------+
| id| text|
+---+-----------------+
| 1|hello how are you|
| 2| hi haiden|
| 3| how are you uma|
+---+-----------------+
I will demonstrate only for the first id and assume that you can not do a collect action on your df2. First you need to be sure that the schema for your dataframe is and array for text column on your df1
+---+--------------------+
| id| text|
+---+--------------------+
| 1|[helo, how, are, ...|
+---+--------------------+
with schema like this:
|-- id: integer (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
After that you can do an explode on the text column
res1.withColumn("text", explode(res1("text")))
+---+----+
| id|text|
+---+----+
| 1|helo|
| 1| how|
| 1| are|
| 1| you|
+---+----+
Assuming you're replace dataframe looks like this:
+----+-------+
|word|replace|
+----+-------+
|helo| hello|
| hai| hi|
+----+-------+
Joining the two dataframe will look like this:
res6.join(res8, res6("text") === res8("word"), "left_outer")
+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
| 1| you|null| null|
| 1| how|null| null|
| 1|helo|helo| hello|
| 1| are|null| null|
+---+----+----+-------+
Do a select with coalescing null values:
res26.select(res26("id"), coalesce(res26("replace"), res26("text")).as("replaced_text"))
+---+-------------+
| id|replaced_text|
+---+-------------+
| 1| you|
| 1| how|
| 1| hello|
| 1| are|
+---+-------------+
and then group by id and aggregate into a collect list function:
res33.groupBy("id").agg(collect_list("replaced_text"))
+---+---------------------------+
| id|collect_list(replaced_text)|
+---+---------------------------+
| 1| [you, how, hello,...|
+---+---------------------------+
Keep in mind that you should preserve you initial order of text elements.
I Suppose code below should solve your problem
I have solved this by using RDD
val wordRdd = df1.rdd.flatMap{ row =>
val wordList = row.getAs[String]("Text").split(" ").toList
wordList.map{word => Row.fromTuple(row.getAs[Int]("id"),word)}
}.zipWithIndex()
val wordDf = sqlContext.createDataFrame(wordRdd.map(x => Row.fromSeq(x._1.toSeq++Seq(x._2))),StructType(List(StructField("id",IntegerType),StructField("word",StringType),StructField("index",LongType))))
val opRdd = wordDf.join(df2,wordDf("word")===df2("word"),"left_outer").drop(df2("word")).rdd.groupBy(_.getAs[Int]("id")).map(x => Row.fromTuple(x._1,x._2.toList.sortBy(x => x.getAs[Long]("index")).map(row => if(row.getAs[String]("Replace")!=null) row.getAs[String]("Replace") else row.getAs[String]("word")).mkString(" ")))
val opDF = sqlContext.createDataFrame(opRdd,StructType(List(StructField("id",IntegerType),StructField("Text",StringType))))

How to perform merge operation on spark Dataframe?

I have spark dataframe mainDF and deltaDF both with a matching schema.
Content of the mainDF is as follows:
id | name | age
1 | abc | 23
2 | xyz | 34
3 | pqr | 45
Content of deltaDF is as follows:
id | name | age
1 | lmn | 56
4 | efg | 37
I want to merge deltaDF with mainDF based on value of id. So if my id already exists in mainDF then the record should be updated and if id doesn't exist then the new record should be added. So the resulting data frame should be like this:
id | name | age
1 | lmn | 56
2 | xyz | 34
3 | pqr | 45
4 | efg | 37
This is my current code and it is working:
val updatedDF = mainDF.as("main").join(deltaDF.as("delta"),$"main.id" === $"delta.id","inner").select($"main.id",$"main.name",$"main.age")
mainDF= mainDF.except(updateDF).unionAll(deltaDF)
However here I need to explicitly provide list columns again in the select function which feels like overhead to me. Is there any other better/cleaner approach to achieve the same?
If you don't want to provide the list of columns explicitly, you can map over the original DF's columns, something like:
.select(mainDF.columns.map(c => $"main.$c" as c): _*)
BTW you can do this without a union after the join: you can use outer join to get records that don't exist in both DFs, and then use coalesce to "choose" the non-null value prefering deltaDF's values. So the complete solution would be something like:
val updatedDF = mainDF.as("main")
.join(deltaDF.as("delta"), $"main.id" === $"delta.id", "outer")
.select(mainDF.columns.map(c => coalesce($"delta.$c", $"main.$c") as c): _*)
updatedDF.show
// +---+----+---+
// | id|name|age|
// +---+----+---+
// | 1| lmn| 56|
// | 3| pqr| 45|
// | 4| efg| 37|
// | 2| xyz| 34|
// +---+----+---+
You can achieve this by using dropDuplicates and specifying on wich column you don't want any duplicates.
Here's a working code :
val a = (1,"lmn",56)::(2,"abc",23)::(3,"pqr",45)::Nil
val b = (1,"opq",12)::(5,"dfg",78)::Nil
val df1 = sc.parallelize(a).toDF
val df2 = sc.parallelize(b).toDF
df1.unionAll(df2).dropDuplicates("_1"::Nil).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1|lmn| 56|
| 2|abc| 23|
| 3|pqr| 45|
| 5|dfg| 78|
+---+---+---+
Another way of doing so: pyspark implementation
updatedDF = mainDF.alias(“main”).join(deltaDF.alias(“delta”), main.id == delta.id,"left")
upsertDF = updatedDF.where(“main.id IS not null").select("main.*")
unchangedDF = updatedDF.where(“main.id IS NULL”).select("delta.*")
finalDF = upsertDF.union(unchangedDF)