how to replace pyspark dataframe columns value with a dict - pyspark

I have a dataframe as show below
+++++++++++++++++++++
colA | colB | colC |
+++++++++++++++++++++
123 | 3 | 0|
222 | 0 | 1|
200 | 0 | 2|
I want to replace the values in colB with a dict d to get the result like this.
d = {3:'a', 0:'b}
+++++++++++++++++++++
colA | colB | colC |
+++++++++++++++++++++
123 | a | 0|
222 | b | 1|
200 | b | 2|

You should simply use dataframe method replace that actually does not clearly explains this use case.
To use a dictionary, you have to simply setting the specified dict as first argument, a random value as second argument, and the name of the column as third argument.
At least in Spark 2.2, a warning will be raised expliciting that, since the first argument is a dictionary, the second argument will be not take into account.
data = [
(123,3,0),
(222,0,2),
(200,0,2)]
df = spark.createDataFrame(data,['colA','colB','colC'])
d = {3:'a', 0:'b}
df_renamed = df.replace(d,1,'colB')
df_renamed.show()
# +++++++++++++++++++++
# colA | colB | colC |
# +++++++++++++++++++++
# 123 | a | 0|
# 222 | b | 1|
# 200 | b | 2|
Please also note that, "When replacing, the new value will be cast to the type of the existing column" , as reported inside the docs. By consequence, your column will be casted to string.

Related

PySpark Return Exact Match from list of strings

I have a dataset as follows:
| id | text |
--------------
| 01 | hello world |
| 02 | this place is hell |
I also have a list of keywords I'm search for:
Keywords = ['hell', 'horrible', 'sucks']
When using the following solution using .rlike() or .contains(), sentences with either partial and exact matches to the list of words are returned to be true. I would like only exact matches to be returned.
Current code:
KEYWORDS = 'hell|horrible|sucks'
df = (
df
.select(
F.col('id'),
F.col('text'),
F.when(F.col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
Current output:
| id | text | keyword_found |
-------------------------------
| 01 | hello world | 1 |
| 02 | this place is hell | 1 |
Expected output:
| id | text | keyword_found |
--------------------------------
| 01 | hello world | 0 |
| 02 | this place is hell | 1 |
Try below code, I have just change the Keyword only :
from pyspark.sql.functions import col,when
data = [["01","hello world"],["02","this place is hell"]]
schema =["id","text"]
df2 = spark.createDataFrame(data, schema)
df2.show()
+---+------------------+
| id| text|
+---+------------------+
| 01| hello world|
| 02|this place is hell|
+---+------------------+
KEYWORDS = '(hell|horrible|sucks)$'
df = (
df2
.select(
col('id'),
col('text'),
when(col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
df.show()
+---+------------------+-------------+
| id| text|keyword_found|
+---+------------------+-------------+
| 01| hello world| 0|
| 02|this place is hell| 1|
+---+------------------+-------------+
Let me know if you need more help on this.
This should work
Keywords = 'hell|horrible|sucks'
df = (df.select(F.col('id'),F.col('text'),F.when(F.col('text').rlike('('+Keywords+')(\s|$)').otherwise(0).alias('keyword_found')))
id
text
keyword_found
01
hello world
0
02
this place is hell
1

Conditional string manipulation in Pyspark

I have a Pyspark dataframe, among others, a column of MSNs (of string type) like the following:
+------+
| Col1 |
+------+
| 654- |
| 1859 |
| 5875 |
| 784- |
| 596- |
| 668- |
| 1075 |
+------+
As you can see, those entries with a value of less than 1000 (i.e. three characters) have a - character at the end to make a total of 4 characters.
I want to get rid of that - character, so that I end up with something like:
+------+
| Col2 |
+------+
| 654 |
| 1859 |
| 5875 |
| 784 |
| 596 |
| 668 |
| 1075 |
+------+
I have tried the following code (where df is the dataframe containing the column, but it does not appear to work:
if df.Col1[3] == "-":
df = df.withColumn('Col2', df.series.substr(1, 3))
return df
else:
return df
Does anyone know how to do it?
You can replace - in the column with empty string ("") using F.regexp_replace
See the code below,
df.withColumn("Col2", F.regexp_replace("Col1", "-", "")).show()
+----+----+
|Col1|Col2|
+----+----+
|589-| 589|
|1245|1245|
|145-| 145|
+----+----+
Here is a solution using the .substr() method:
df.withColumn("Col2", F.when(F.col("Col1").substr(4, 1) == "-",
F.col("Col1").substr(1, 3)
).otherwise(
F.col("Col1"))).show()
+----+----+
|Col1|Col2|
+----+----+
|654-| 654|
|1859|1859|
|5875|5875|
|784-| 784|
|596-| 596|
|668-| 668|
|1075|1075|
+----+----+

Apache Spark calculating column value on the basis of distinct value of columns

I am processing the following tables and I would like to compute a new column (outcome) based on the distinct value of 2 other columns.
| id1 | id2 | outcome
| 1 | 1 | 1
| 1 | 1 | 1
| 1 | 3 | 2
| 2 | 5 | 1
| 3 | 1 | 1
| 3 | 2 | 2
| 3 | 3 | 3
The outcome should begin in incremental order starting from 1 based on the combined value of id1 and id2. Any hints how this can be accomplished in Scala. row_number doesn't seem to be useful here in this case.
The logic here is that for each unique value of id1 we will start numbering the outcome with min(id2) for corresponding id1 being assigned a value of 1.
You could try dense_rank()
with your example
val df = sqlContext
.read
.option("sep","|")
.option("header", true)
.option("inferSchema",true)
.csv("/home/cloudera/files/tests/ids.csv") // Here we read the .csv files
.cache()
df.show()
df.printSchema()
df.createOrReplaceTempView("table")
sqlContext.sql(
"""
|SELECT id1, id2, DENSE_RANK() OVER(PARTITION BY id1 ORDER BY id2) AS outcome
|FROM table
|""".stripMargin).show()
output
+---+---+-------+
|id1|id2|outcome|
+---+---+-------+
| 2| 5| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 3| 2|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
+---+---+-------+
Use Window function to club(partition) them by first id and then order each partition based on second id.
Now you just need to assign a rank (dense_rank) over each Window partition.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df
.withColumn("outcome", dense_rank().over(Window.partitionBy("id1").orderBy("id2")))

how to ignore special row in the collect_list

I have a table like below.
| COLUMN A| COLUMN b|
| Case| 1111111111|
| Rectype| ABCD|
| Key| UMUM_REF_ID=A1234|
| UMSV ERROR| UNITS_ALLOW must|
| NTNB ERROR| GGGGGGG Value|
| Case| 2222222222|
| Rectype| ABCD|
| Key| UMUM_REF_ID=B8765|
| UMSV ERROR| UNITS_ALLOW must|
| NTNB ERROR| Invalid Value|
I want to add new column "C".
C is the collect_list "Case", "Rectype", "key", "UMSV ERROR" and "NTNB ERRO" in A.
My code is
val window = Window.rowsBetween(0,4)
val begin = rddDF.withColumn("C", collect_list( $"value").over( window)).where( $"A" like "Case")
begin.show()
It works well.
Now, I want to get the collect_list again but ignore the "NTNB ERROR" where its value in column b is "Invalid Value".
What should I do please?

How to perform merge operation on spark Dataframe?

I have spark dataframe mainDF and deltaDF both with a matching schema.
Content of the mainDF is as follows:
id | name | age
1 | abc | 23
2 | xyz | 34
3 | pqr | 45
Content of deltaDF is as follows:
id | name | age
1 | lmn | 56
4 | efg | 37
I want to merge deltaDF with mainDF based on value of id. So if my id already exists in mainDF then the record should be updated and if id doesn't exist then the new record should be added. So the resulting data frame should be like this:
id | name | age
1 | lmn | 56
2 | xyz | 34
3 | pqr | 45
4 | efg | 37
This is my current code and it is working:
val updatedDF = mainDF.as("main").join(deltaDF.as("delta"),$"main.id" === $"delta.id","inner").select($"main.id",$"main.name",$"main.age")
mainDF= mainDF.except(updateDF).unionAll(deltaDF)
However here I need to explicitly provide list columns again in the select function which feels like overhead to me. Is there any other better/cleaner approach to achieve the same?
If you don't want to provide the list of columns explicitly, you can map over the original DF's columns, something like:
.select(mainDF.columns.map(c => $"main.$c" as c): _*)
BTW you can do this without a union after the join: you can use outer join to get records that don't exist in both DFs, and then use coalesce to "choose" the non-null value prefering deltaDF's values. So the complete solution would be something like:
val updatedDF = mainDF.as("main")
.join(deltaDF.as("delta"), $"main.id" === $"delta.id", "outer")
.select(mainDF.columns.map(c => coalesce($"delta.$c", $"main.$c") as c): _*)
updatedDF.show
// +---+----+---+
// | id|name|age|
// +---+----+---+
// | 1| lmn| 56|
// | 3| pqr| 45|
// | 4| efg| 37|
// | 2| xyz| 34|
// +---+----+---+
You can achieve this by using dropDuplicates and specifying on wich column you don't want any duplicates.
Here's a working code :
val a = (1,"lmn",56)::(2,"abc",23)::(3,"pqr",45)::Nil
val b = (1,"opq",12)::(5,"dfg",78)::Nil
val df1 = sc.parallelize(a).toDF
val df2 = sc.parallelize(b).toDF
df1.unionAll(df2).dropDuplicates("_1"::Nil).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1|lmn| 56|
| 2|abc| 23|
| 3|pqr| 45|
| 5|dfg| 78|
+---+---+---+
Another way of doing so: pyspark implementation
updatedDF = mainDF.alias(“main”).join(deltaDF.alias(“delta”), main.id == delta.id,"left")
upsertDF = updatedDF.where(“main.id IS not null").select("main.*")
unchangedDF = updatedDF.where(“main.id IS NULL”).select("delta.*")
finalDF = upsertDF.union(unchangedDF)