modify a string column and replace substring pypsark - pyspark

I have a pyspark dataframe with a Name column with sample values as follows:
id NAME
---+-------
1 aaa bb c
2 xx yy z
3 abc def
4 qw er
5 jon lee ls G
I have to flip the right most part and populate it on the left side with comma and delete the right most substring(split using the space)
Expected output
id NAME
---+-------
1 c, aaa bb
2 z, xx yy
3 def, abc
4 er, qw
5 G, jon lee ls
I was able to get the right most part to add it with comma by using below code:
split_col=split(df['NAME'],' ')
df2 = df2.withColumn('NAME_RIGHT',split_col.getItem(F.size(split_col) - 1))
the above line gives
NAME_RIGHT
c
z
def
er
I want to replace the values in NAME_RIGHT i.e. the right most values from the NAME column, I tried using below code but it replaces nothing, how can this be achieved?
df3 = df2.withColumn('NEW_NAME', regexp_replace(F.col("NAME"), str(df2.NAME_RIGHT),""))

Regex would be a bit cumbersome, I'd suggest to use split and concat instead.
from pyspark.sql import functions as F
(df
.withColumn('n1', F.split('name', ' '))
.withColumn('n2', F.reverse('n1')[0])
.withColumn('n3', F.concat_ws(' ', F.array_except('n1', F.array('n2'))))
.withColumn('n4', F.concat_ws(', ', F.array('n2', 'n3')))
.show()
)
# +---+------------+-----------------+---+----------+-------------+
# | id| name| n1| n2| n3| n4|
# +---+------------+-----------------+---+----------+-------------+
# | 1| aaa bb c| [aaa, bb, c]| c| aaa bb| c, aaa bb|
# | 2| xx yy z| [xx, yy, z]| z| xx yy| z, xx yy|
# | 3| abc def | [abc, def, ]| | abc def| , abc def|
# | 4| qw er| [qw, er]| er| qw| er, qw|
# | 5|jon lee ls G|[jon, lee, ls, G]| G|jon lee ls|G, jon lee ls|
# +---+------------+-----------------+---+----------+-------------+

Related

assign values to a new column depending on old column values in dataframe

I have assigned values to 4 variables in a conf or application.properties file,
A = 1
B = 2
C = 3
D = 4
I have a dataframe as follows,
+-----+
|name |
+-----+
| A |
| C |
| B |
| D |
| B |
+-----+
I want to add a new column that has the values assigned from the conf variables declared above for A,B,C,D respectively depending on the value in the name column.
Final Dataframe should have,
+----+----------+
|name|NAME_VALUE|
+----+----------+
| A | 1 |
| C | 3 |
| B | 2 |
| D | 4 |
| B | 2 |
+----+----------+
I tried lit function in .WITHCOLUMN with conf.getint($name), not accepting Column in lit func requires string, I have to hardcode the variable names in lit. Is there anyway for me to dynamically assign those respective conf variable names in LIT so it can automatically assign values to another column in spark scala?
For this moment i dont have any ideas how to do it as you intended with dynamic usage of vals names.
My proposition is to use a seq of tuples instead of multiple vals, in such case you can create some udf and try to map this value for each row, but you can also use join which i am showing in below example:
val data = Seq(("A"),("C"), ("B"), ("D"), ("B"))
val df = data.toDF("name")
val mappings = Seq(("A",1), ("B",2), ("C",3), ("D",4))
val mappingsDf = mappings.toDF("name", "value")
df.join(broadcast(mappingsDf), df("name") === mappingsDf("name"), "left")
.select(
df("name"),
mappingsDf("value")
).show
output is as expected:
+----+-----+
|name|value|
+----+-----+
| A| 1|
| C| 3|
| B| 2|
| D| 4|
| B| 2|
+----+-----+
This solution is pretty generic as your mapping are df here so you can hardcode them as showed in my example or load them from some csv or json easily with spark api
Due to broadcast join it should be quite efficient (you should remove this hint if you want to use big amount of mappings!)
I think its easy to understand and maintain as its not udf but only Spark api

how I can search in a list of items and extract some key words based on elements of 3 lists?

Suppose I have 3 lists and I have a column which is an array and I want to search in it to extract elements of those 3 lists..
Dataframe:
id desxription
1 ['this is bad', 'summerfull']
2 ['city tehran, country iran']
3 ['uA is a country', 'winternice']
5 ['this, is, summer']
6 ['this is winter','uAsal']
7 ['this is canada' ,'great']
8 ['this is toronto']
Lists:
L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']
Now I want to make a new column with respect of each list (L1,L2,L3). Then search in each element of the lists in the description column. If the row has an element of the list, extract it in the column, otherwise NA:
Note: I want the exact match to be extracted. for example summerfull should not be extracted by summer.
id desxription L1 L2 L3
1 ['this is bad', 'summerfull'] NA NA NA
2 ['city tehran, country iran'] NA iran tehran
3 ['uA is a country', 'winternice'] NA uA NA
5 ['this, is, summer'] summer NA NA
6 ['this is winter','uAsal'] winter NA NA
7 ['this is canada' ,'great'] NA NA canada
8 ['this is toronto'] NA NA toronto
Annotated code
# Create dictionary of key-vals
L = {'L1': L1, 'L2': L2, 'L3': L3}
for key, vals in L.items():
# regex pattern for extracting vals
pat = r'\\b(%s)\\b' % '|'.join(vals)
# extract matching occurrences
col = F.expr("regexp_extract_all(array_join(desxription, ' '), '%s')" % pat)
# Mask the rows with null when there are no matches
df = df.withColumn(key, F.when(F.size(col) == 0, None).otherwise(col))
>>> df.show()
+---+--------------------+--------+------+---------+
| id| desxription| L1| L2| L3|
+---+--------------------+--------+------+---------+
| 1|[this is bad, sum...| null| null| null|
| 2|[city tehran, cou...| null|[iran]| [tehran]|
| 3|[uA is a country,...| null| [uA]| null|
| 5| [this, is, summer]|[summer]| null| null|
| 6|[this is winter, ...|[winter]| null| null|
| 7|[this is canada, ...| null| null| [canada]|
| 8| [this is toronto]| null| null|[toronto]|
+---+--------------------+--------+------+---------+

Advanced join two dataframe spark scala

I have to join two Dataframes.
Sample:
Dataframe1 looks like this
df1_col1 df1_col2
a ex1
b ex4
c ex2
d ex6
e ex3
Dataframe2
df2_col1 df2_col2
1 a,b,c
2 d,c,e
3 a,e,c
In result Dataframe I would like to get result like this
res_col1 res_col2 res_col3
a ex1 1
a ex1 3
b ex4 1
c ex2 1
c ex2 2
c ex2 3
d ex6 2
e ex3 2
e ex3 3
What will be the best way to achieve this join?
I have updated the code below
val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3")))
val df2 = sc.parallelize(Seq(List(("1","a,b,c"),("2","d,c,e")))).toDF
df2.withColumn("df2_col2_explode", explode(split($"_2", ","))).select($"_1".as("df2_col1"),$"df2_col2_explode").join(df1.select($"_1".as("df1_col1"),$"_2".as("df1_col2")), $"df1_col1"===$"df2_col2_explode","inner").show
You just need to split the values and generate multiple rows by exploding it and then join with the other dataframe.
You can refer this link, How to split pipe-separated column into multiple rows?
I used spark sql for this join, here is a part of code;
df1.createOrReplaceTempView("temp_v_df1")
df2.createOrReplaceTempView("temp_v_df2")
val df_result = spark.sql("""select
| b.df1_col1 as res_col1,
| b.df1_col2 as res_col2,
| a.df2_col1 as res_col3
| from (select df2_col1, exp_col
| from temp_v_df2
| lateral view explode(split(df2_col2,",")) dummy as exp_col) a
| join temp_v_df1 b on a.exp_col = b.df1_col1""".stripMargin)
I used spark scala data frame to achieve your desire output.
val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3"))).toDF("df1_col1","df1_col2")
val df2 = sc.parallelize(Seq((1,("a,b,c")),(2,("d,c,e")),(3,("a,e,c")))).toDF("df2_col1","df2_col2")
df2.withColumn("_tmp", explode(split($"df2_col2", "\\,"))).as("temp").join (df1,$"temp._tmp"===df1("df1_col1"),"inner").drop("_tmp","df2_col2").show
Desire Output
+--------+--------+--------+
|df2_col1|df1_col1|df1_col2|
+--------+--------+--------+
| 2| e| ex3|
| 3| e| ex3|
| 2| d| ex6|
| 1| c| ex2|
| 2| c| ex2|
| 3| c| ex2|
| 1| b| ex4|
| 1| a| ex1|
| 3| a| ex1|
+--------+--------+--------+
Rename the Column according to your requirement.
Here the screenshot of running code
Happy Hadoooooooooooooooppppppppppppppppppp

Update data from two Data Frames Scala-Spark

I have two Data Frames:
DF1:
ID | Col1 | Col2
1 a aa
2 b bb
3 c cc
DF2:
ID | Col1 | Col2
1 ab aa
2 b bba
4 d dd
How can I join these two DFs and the result should be:
Result:
1 ab aa
2 b bba
3 c cc
4 d dd
My code is:
val df = DF1.join(DF2, Seq("ID"), "outer")
.select($"ID",
when(DF1("Col1").isNull, lit(0)).otherwise(DF1("Col1")).as("Col1"),
when(DF1("Col2").isNull, lit(0)).otherwise(DF2("Col2")).as("Col2"))
.orderBy("ID")
And it works, but I don't want to specify each column, because I have large files.
So, is there any way to update the dataframe (and to add some recors if in the second DF are new one) without specifying each column?
A simple leftanti join of df1 with df2 and merging of the result into df2 should get your desired output as
df2.union(df1.join(df2, Seq("ID"), "leftanti")).orderBy("ID").show(false)
which should give you
+---+----+----+
|ID |Col1|Col2|
+---+----+----+
|1 |ab |aa |
|2 |b |bba |
|3 |c |cc |
|4 |d |dd |
+---+----+----+
The solution doesn't match the logic you have in your code but generates the expected result

Merge two tables in Scala/Spark

I have two tab separated data files like below:
file 1:
number type data_present
1 a yes
2 b no
file 2:
type group number recorded
d aa 10 true
c cc 20 false
I want to merge these two files so that output file looks like below:
number type data_present group recorded
1 a yes NULL NULL
2 b no NULL NULL
10 d NULL aa true
20 cc NULL cc false
As you can see, for columns which are not present in other file, I'm filling those places with NULL.
Any ideas on how to do this in Scala/Spark?
Create two files for your data set:
$ cat file1.csv
number type data_present
1 a yes
2 b no
$ cat file2.csv
type group number recorded
d aa 10 true
c cc 20 false
Convert them to CSV:
$ sed -e 's/^[ \t]*//' file1.csv | tr -s ' ' | tr ' ' ',' > f1.csv
$ sed -e 's/^[ ]*//' file2.csv | tr -s ' ' | tr ' ' ',' > f2.csv
Use spark-csv module to load CSV files as dataframes:
$ spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df1 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "f1.csv", "header" -> "true"))
val df2 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "f2.csv", "header" -> "true"))
Now perform joins:
scala> df1.join(df2, df1("number") <=> df2("number") && df1("type") <=> df2("type"), "outer").show()
+------+----+------------+----+-----+------+--------+
|number|type|data_present|type|group|number|recorded|
+------+----+------------+----+-----+------+--------+
| 1| a| yes|null| null| null| null|
| 2| b| no|null| null| null| null|
| null|null| null| d| aa| 10| true|
| null|null| null| c| cc| 20| false|
+------+----+------------+----+-----+------+--------+
For more details goto here, here and here.
This will give you the desired output:
val output = file1.join(file2, Seq("number","type"), "outer")
Simple convert all columns into to String, than do union on two DF.