I am trying to update an existing column in pyspark but seems like old values in the column are getting updated as well despite no otherwise condition
+-----+-----+-----+-----+-----+----+
|cntry|cde_1|cde_2|rsn_1|rsn_2|FLAG|
+-----+-----+-----+-----+-----+----+
| MY| A| | 1| 2|null|
| MY| G| | 1| 2|null|
| MY| | G| 1| 2|null|
| TH| A| | 16| 2|null|
| TH| B| | 1| 16| 1|
| TH| | W| 16| 2| 1|
+-----+-----+-----+-----+-----+----+
df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"], ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))
df = df.withColumn('FLAG', F.when( (df.cntry == "MY") & ( (df.cde_1.isin("G") ) | (df.cde_2.isin("G") ) ) & ( (df.rsn_1 == "1") | (df.rsn_2 == "1") ) , 1))
df = df.withColumn('FLAG', F.when( (df.cntry == "TH") & ( (df.cde_1.isin("B", "W") ) | (df.cde_2.isin("B", "W") ) ) & ( (df.rsn_1 == "16") | (df.rsn_2 == "16") ) , 1))
You need to combine your conditions using the Boolean OR. Like this:
df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"], ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))
cond1 = (df.cntry == "MY") & ( (df.cde_1.isin("G") ) | (df.cde_2.isin("G") ) ) & ( (df.rsn_1 == "1") | (df.rsn_2 == "1") )
cond2 = (df.cntry == "TH") & ( (df.cde_1.isin("B", "W") ) | (df.cde_2.isin("B", "W") ) ) & ( (df.rsn_1 == "16") | (df.rsn_2 == "16") )
df.withColumn("FLAG", F.when(cond1 | cond2, 1)).show()
In your last line, you overwrite the FLAG column, as you’re not referencing its previous state. That’s why the previous values are not preserved.
Instead of combining the expressions, you could also use when(cond1, 1).otherwise(when(cond2, 1)). That’s a stylistic choice.
Related
I have a dataset that has null values
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 0|null|
| 1|null| 0|
|null| 1| 0|
| 1| 0| 0|
| 1| 0| 0|
|null| 0| 1|
| 1| 1| 0|
| 1| 1|null|
|null| 1| 0|
+----+----+----+
I wrote a function to count the percentage of null values of each column in the dataset and removing those columns from the dataset. Below is the function
import pyspark.sql.functions as F
def calc_null_percent(df, strength=None):
if strength is None:
strength = 80
total_count = df.count()
null_cols = []
df2 = df.select([F.count(F.when(F.col(c).contains('None') | \
F.col(c).contains('NULL') | \
(F.col(c) == '' ) | \
F.col(c).isNull() | \
F.isnan(c), c
)).alias(c)
for c in df.columns])
for i in df2.columns:
get_null_val = df2.first()[i]
if (get_null_val/total_count)*100 > strength:
null_cols.append(i)
df = df.drop(*null_cols)
return df
I am using a for loop to get the columns based on the condition. Can we use map or Is there any other way to optimise the for loop in pyspark?
Here's a way to do it with list comprehension.
data_ls = [
(1, 0, 'blah'),
(0, None, 'None'),
(None, 1, 'NULL'),
(1, None, None)
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id1', 'id2', 'id3'])
# +----+----+----+
# | id1| id2| id3|
# +----+----+----+
# | 1| 0|blah|
# | 0|null|None|
# |null| 1|NULL|
# | 1|null|null|
# +----+----+----+
Now, calculate the percentage of nulls in a dataframe and collect() it for further use.
# total row count
tot_count = data_sdf.count()
# percentage of null records per column
data_null_perc_sdf = data_sdf. \
select(*[(func.sum((func.col(k).isNull() | (func.upper(k).isin(['NONE', 'NULL']))).cast('int')) / tot_count).alias(k+'_nulls_perc') for k in data_sdf.columns])
# +--------------+--------------+--------------+
# |id1_nulls_perc|id2_nulls_perc|id3_nulls_perc|
# +--------------+--------------+--------------+
# | 0.25| 0.5| 0.75|
# +--------------+--------------+--------------+
# collection of the dataframe for list comprehension
data_null_perc = data_null_perc_sdf.collect()
# [Row(id1_nulls_perc=0.25, id2_nulls_perc=0.5, id3_nulls_perc=0.75)]
threshold = 0.5
# retain columns of `data_sdf` that have more null records than aforementioned threshold
cols2drop = [k for k in data_sdf.columns if data_null_perc[0][k+'_nulls_perc'] >= threshold]
# ['id2', 'id3']
Use cols2drop variable to drop the columns from data_sdf in the next step
new_data_sdf = data_sdf.drop(*cols2drop)
# +----+
# | id1|
# +----+
# | 1|
# | 0|
# |null|
# | 1|
# +----+
there is a dataframe contains two columns, one is key and another is value. Like below:
+--------------------+--------------------+
| key| value |
+--------------------+--------------------+
|a |abcde |
I want to slice the value into mutiple values with position and generate a new dataframe following the key. Like below:
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
|a |[a, 0] |
|a |[b, 1] |
|a |[c, 2] |
|a |[d, 3] |
|a |[e, 4] |
I have tried to use join() and StructType() but I failed. Are there any possible method to do that? THANKS!
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("a","abcd",3000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (value, index)") , f.col("Name")).show()
+-----+-----+----+
|value|index|Name|
+-----+-----+----+
| 0| a| a|
| 1| b| a|
| 2| c| a|
| 3| d| a|
+-----+-----+----+
# to explain the above functions more clearly.
f.expr( "..." ) #--> write a SQL expression
"posexplode( [..] )" #for an array turn them into rows with their index
"filter( [..] , x -> x != '' )" #for an array filter out ''
"split( Department, '' )" #split the column on null (extract characters) which will add null in out array that we need to filter out.
Here's the update to fit with your exact request, just a little more manipulation to put it into your required format:
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (myvalue, index)") , f.col("Name"),f.expr( "array(myvalue,index) as value ")).drop("index","myvalue").show()
+----+------+
|Name| value|
+----+------+
| a|[0, a]|
| a|[1, b]|
| a|[2, c]|
| a|[3, d]|
+----+------+
The following snippet transform your data into your specified format:
import pyspark.sql.functions as F
df = spark.createDataFrame([("a", "abcde",)], ["key", "value"])
df_split = df.withColumn("split", F.array_remove(F.split("value", ""), ""))
df_split.show()
df_exploded = df_split.select("key", F.posexplode("split"))
df_exploded.show()
df_array = df_exploded.select("key", F.array("col", "pos").alias("value"))
df_array.show()
Output:
+---+-----+---------------+
|key|value| split|
+---+-----+---------------+
| a|abcde|[a, b, c, d, e]|
+---+-----+---------------+
+---+---+---+
|key|pos|col|
+---+---+---+
| a| 0| a|
| a| 1| b|
| a| 2| c|
| a| 3| d|
| a| 4| e|
+---+---+---+
+---+------+
|key| value|
+---+------+
| a|[a, 0]|
| a|[b, 1]|
| a|[c, 2]|
| a|[d, 3]|
| a|[e, 4]|
+---+------+
First, the string is split into an array, where the split pattern is the empty string. Therefore, the last element needs to be removed.
Then each element of a the split column array is transformed to a row with its position in the array as column pos
Lastly, the columns are combined to an array.
I have the following simple join:
df_join = (df1.join(df2, on=['key'], how='left').select(df1['key'], df2['key']))
It returns the following error Attribute(s) with the same name appear in the operation
I understand I can't display two columns with the same name but I do I do it in this case.
You can give an alias to either one of them
df_join = df1.join(df2, on=['key'], how='left')\
.select(
df1['key'], df2['key'].alias('right_key')
)
Example -
input_str1 = """
|a|100
|b|100
|c|100
""".split("|")
input_values1 = list(map(lambda x:x.strip(),input_str1))[1:]
input_list1 = [(x,y) for x,y in zip(input_values1[0::2],input_values1[1::2])]
sparkDF1 = sql.createDataFrame(input_list1,['id','value'])
input_str2 = """
|a|20 |2020-01-02 01:30
|a|50 |2020-01-02 05:30
|b|50 |2020-01-15 07:30
|b|80 |2020-02-01 09:30
|c|50 |2020-02-01 09:30
""".split("|")
input_values2 = list(map(lambda x:x.strip(),input_str2))[1:]
input_list2 = [(x,y,z) for x,y,z in zip(input_values2[0::3],input_values2[1::3],input_values2[2::3])]
sparkDF2 = sql.createDataFrame(input_list2,['id','value','timestamp'])
finalDF = (sparkDF1.join(sparkDF2
,sparkDF1['id'] == sparkDF2['id']
,'inner'
).select(sparkDF2["*"],sparkDF2['id'].alias('id_right')))
finalDF.show()
+---+-----+--------+
| id|value|id_right|
+---+-----+--------+
| c| 100| c|
| b| 100| b|
| b| 100| b|
| a| 100| a|
| a| 100| a|
+---+-----+--------+
query I'm using:
I want to replace existing columns with new values on condition, if value of another col = ABC then column remain same otherwise should give null or blank.
It's giving result as per logic but only for last column it encounters in loop.
import pyspark.sql.functions as F
for i in df.columns:
if i[4:]!='ff':
new_df=df.withColumn(i,F.when(df.col_ff=="abc",df[i])\
.otherwise(None))
df:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
| a | a | d | abc |
| a | b | c | def |
| b | c | b | abc |
| c | d | a | def |
+------+----+-----+-------+
required output:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
| a | a | d | abc |
| null |null|null | def |
| b | c | b | abc |
| null |null|null | def |
+------+----+-----+-------+
The problem in your code is that you're overwriting new_df with the original DataFrame df in each iteration of the loop. You can fix it by first setting new_df = df outside of the loop, and then performing the withColumn operations on new_df inside the loop.
For example, if df were the following:
df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#| a| a| d| abc|
#| a| b| c| def|
#| b| c| b| abc|
#| c| d| a| def|
#+----+----+----+------+
Change your code to:
import pyspark.sql.functions as F
new_df = df
for i in df.columns:
if i[4:]!='ff':
new_df = new_df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i)))
Notice here that I removed the .otherwise(None) part because when will return null by default if the condition is not met.
You could also do the same using functools.reduce:
from functools import reduce # for python3
new_df = reduce(
lambda df, i: df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i))),
[i for i in df.columns if i[4:] != "ff"],
df
)
In both cases the result is the same:
new_df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#| a| a| d| abc|
#|null|null|null| def|
#| b| c| b| abc|
#|null|null|null| def|
#+----+----+----+------+
I have imported data in Spark dataframe in spark-shell. Data is filled in it like :
Col1 | Col2 | Col3 | Col4
A1 | 11 | B2 | a|b;1;0xFFFFFF
A1 | 12 | B1 | 2
A2 | 12 | B2 | 0xFFF45B
Here in Col4, the values are of different kinds and I want to separate them like (suppose "a|b" is type of alphabets, "1 or 2" is a type of digit and "0xFFFFFF or 0xFFF45B" is a type of hexadecimal no.):
So, the output should be :
Col1 | Col2 | Col3 | alphabets | digits | hexadecimal
A1 | 11 | B2 | a | 1 | 0xFFFFFF
A1 | 11 | B2 | b | 1 | 0xFFFFFF
A1 | 12 | B1 | | 2 |
A2 | 12 | B2 | | | 0xFFF45B
Hope I've made my query clear to you and I am using spark-shell. Thanks in advance.
Edit after getting this answer about how to make backreference in regexp_replace.
You can use regexp_replace with a backreference, then split twice and explode. It is, imo, cleaner than my original solution
val df = List(
("A1" , "11" , "B2" , "a|b;1;0xFFFFFF"),
("A1" , "12" , "B1" , "2"),
("A2" , "12" , "B2" , "0xFFF45B")
).toDF("Col1" , "Col2" , "Col3" , "Col4")
val regExStr = "^([A-z|]+)?;?(\\d+)?;?(0x.*)?$"
val res = df
.withColumn("backrefReplace",
split(regexp_replace('Col4,regExStr,"$1;$2;$3"),";"))
.select('Col1,'Col2,'Col3,
explode(split('backrefReplace(0),"\\|")).as("letter"),
'backrefReplace(1) .as("digits"),
'backrefReplace(2) .as("hexadecimal")
)
+----+----+----+------+------+-----------+
|Col1|Col2|Col3|letter|digits|hexadecimal|
+----+----+----+------+------+-----------+
| A1| 11| B2| a| 1| 0xFFFFFF|
| A1| 11| B2| b| 1| 0xFFFFFF|
| A1| 12| B1| | 2| |
| A2| 12| B2| | | 0xFFF45B|
+----+----+----+------+------+-----------+
you still need to replace empty strings by nullthough...
Previous Answer (somebody might still prefer it):
Here is a solution that sticks to DataFrames but is also quite messy. You can first use regexp_extract three times (possible to do less with backreference?), and finally split on "|" and explode. Note that you need a coalesce for explode to return everything (you still might want to change the empty strings in letter to null in this solution).
val res = df
.withColumn("alphabets", regexp_extract('Col4,"(^[A-z|]+)?",1))
.withColumn("digits", regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",2))
.withColumn("hexadecimal",regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",3))
.withColumn("letter",
explode(
split(
coalesce('alphabets,lit("")),
"\\|"
)
)
)
res.show
+----+----+----+--------------+---------+------+-----------+------+
|Col1|Col2|Col3| Col4|alphabets|digits|hexadecimal|letter|
+----+----+----+--------------+---------+------+-----------+------+
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| a|
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| b|
| A1| 12| B1| 2| null| 2| null| |
| A2| 12| B2| 0xFFF45B| null| null| 0xFFF45B| |
+----+----+----+--------------+---------+------+-----------+------+
Note: The regexp part could be so much better with backreference, so if somebody knows how to do it, please comment!
Not sure this is doable while staying 100% with Dataframes, here's a (somewhat messy?) solution using RDDs for the split itself:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// we switch to RDD to perform the split of Col4 into 3 columns
val rddWithSplitCol4 = input.rdd.map { r =>
val indexToValue = r.getAs[String]("Col4").split(';').map {
case s if s.startsWith("0x") => 2 -> s
case s if s.matches("\\d+") => 1 -> s
case s => 0 -> s
}
val newCols: Array[String] = indexToValue.foldLeft(Array.fill[String](3)("")) {
case (arr, (index, value)) => arr.updated(index, value)
}
(r.getAs[String]("Col1"), r.getAs[Int]("Col2"), r.getAs[String]("Col3"), newCols(0), newCols(1), newCols(2))
}
// switch back to Dataframe and explode alphabets column
val result = rddWithSplitCol4
.toDF("Col1", "Col2", "Col3", "alphabets", "digits", "hexadecimal")
.withColumn("alphabets", explode(split(col("alphabets"), "\\|")))
result.show(truncate = false)
// +----+----+----+---------+------+-----------+
// |Col1|Col2|Col3|alphabets|digits|hexadecimal|
// +----+----+----+---------+------+-----------+
// |A1 |11 |B2 |a |1 |0xFFFFFF |
// |A1 |11 |B2 |b |1 |0xFFFFFF |
// |A1 |12 |B1 | |2 | |
// |A2 |12 |B2 | | |0xFFF45B |
// +----+----+----+---------+------+-----------+