Text file comparison using Spark data frames

Text file comparison using Spark data frames - pyspark

I would like to implement below requirement using Spark dataframes to compare 2 text/csv
List item
files. Ideally, File1.txt should compare with File2.txt and result should be in other txt file with flag as (SAME/UPDATE/INSERT/DELETE).
UPDATE - if any record values are updated in file2 when compared to file1
INSERT - if a new record exist in file2
DELETE - only if the record exist in file1 (not in file2)
SAME - if same record exist in both files
File1.txt
NO DEPT NAME SAL
1 IT RAM 1000
2 IT SRI 600
3 HR GOPI 1500
5 HW MAHI 700
File2.txt
NO DEPT NAME SAL
1 IT RAM 1000
2 IT SRI 900
4 MT SUMP 1200
5 HW MAHI 700
Outputfile.txt
NO DEPT NAME SAL FLAG
1 IT RAM 1000 S
2 IT SRI 900 U
4 MT SUMP 1200 I
5 HW MAHI 700 S
3 HR GOPI 1500 D
So far, i did below coding. But not able to proceed further. Pls help.
from pyspark.shell import spark
sc = spark.sparkContext
df1 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\inputs\\file1.csv")
df2 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\inputs\\file2.csv")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
sqlDF1 = spark.sql( "select * from table1" )
sqlDF2 = spark.sql( "select * from table2" )
leftJoinDF = sqlDF1.join(sqlDF2, 'id', how='left')
rightJoinDF = sqlDF1.join(sqlDF2, 'id', how='right')
innerJoinDF = sqlDF1.join(sqlDF2, 'id')
Is there any way if we merge the data, after performing leftJoin, rightJoin, innerJoin. With this whether i could get desired output or any other way.
Thanks,

You can find my solution below. I create 4 dataframe for SAME/UPDATE/INSERT/DELETE cases and then union them
>>> from functools import reduce
>>> from pyspark.sql import DataFrame
>>> import pyspark.sql.functions as F
>>> df1 = sc.parallelize([
... (1,'IT','RAM',1000),
... (2,'IT','SRI',600),
... (3,'HR','GOPI',1500),
... (5,'HW','MAHI',700)
... ]).toDF(['NO','DEPT','NAME','SAL'])
>>> df1.show()
+---+----+----+----+
| NO|DEPT|NAME| SAL|
+---+----+----+----+
| 1| IT| RAM|1000|
| 2| IT| SRI| 600|
| 3| HR|GOPI|1500|
| 5| HW|MAHI| 700|
+---+----+----+----+
>>> df2 = sc.parallelize([
... (1,'IT','RAM',1000),
... (2,'IT','SRI',900),
... (4,'MT','SUMP',1200),
... (5,'HW','MAHI',700)
... ]).toDF(['NO','DEPT','NAME','SAL'])
>>> df2.show()
+---+----+----+----+
| NO|DEPT|NAME| SAL|
+---+----+----+----+
| 1| IT| RAM|1000|
| 2| IT| SRI| 900|
| 4| MT|SUMP|1200|
| 5| HW|MAHI| 700|
+---+----+----+----+
#DELETE
>>> df_d = df1.join(df2, df1.NO == df2.NO, 'left').filter(F.isnull(df2.NO)).select(df1.NO,df1.DEPT,df1.NAME,df1.SAL, F.lit('D').alias('FLAG'))
#INSERT
>>> df_i = df1.join(df2, df1.NO == df2.NO, 'right').filter(F.isnull(df1.NO)).select(df2.NO,df2.DEPT,df2.NAME,df2.SAL, F.lit('I').alias('FLAG'))
#SAME/
>>> df_s = df1.join(df2, df1.NO == df2.NO, 'inner').filter(F.concat(df2.NO,df2.DEPT,df2.NAME,df2.SAL) == F.concat(df1.NO,df1.DEPT,df1.NAME,df1.SAL)).\
... select(df1.NO,df1.DEPT,df1.NAME,df1.SAL, F.lit('S').alias('FLAG'))
#UPDATE
>>> df_u = df1.join(df2, df1.NO == df2.NO, 'inner').filter(F.concat(df2.NO,df2.DEPT,df2.NAME,df2.SAL) != F.concat(df1.NO,df1.DEPT,df1.NAME,df1.SAL)).\
... select(df2.NO,df2.DEPT,df2.NAME,df2.SAL, F.lit('U').alias('FLAG'))
>>> dfs = [df_s,df_u,df_u,df_i]
>>> df = reduce(DataFrame.unionAll, dfs)
>>>
>>> df.show()
+---+----+----+----+----+
| NO|DEPT|NAME| SAL|FLAG|
+---+----+----+----+----+
| 5| HW|MAHI| 700| S|
| 1| IT| RAM|1000| S|
| 2| IT| SRI| 900| U|
| 2| IT| SRI| 900| U|
| 4| MT|SUMP|1200| I|
+---+----+----+----+----+

You can use 'outer' join after concatenating all columns first. Then create an udf for flags.
import pyspark.sql.functions as F
df = sql.createDataFrame([
(1,'IT','RAM',1000),
(2,'IT','SRI',600),
(3,'HR','GOPI',1500),
(5,'HW','MAHI',700)],
['NO' ,'DEPT', 'NAME', 'SAL' ])
df1 = sql.createDataFrame([
(1,'IT','RAM',1000),
(2,'IT','SRI',900),
(4,'MT','SUMP',1200 ),
(5,'HW','MAHI',700)],
['NO' ,'DEPT', 'NAME', 'SAL' ])
def flags(x,y):
if not x:
return y+'-I'
if not y:
return x+'-D'
if x == y:
return x+'-S'
return y+'-U'
_cols = df.columns
flag_udf = F.udf(lambda x,y: flags(x,y),StringType())
df = df.select(['NO']+ [F.concat_ws('-', *[F.col(_c) for _c in df.columns]).alias('f1')])\
.join(df1.select(['NO']+ [F.concat_ws('-', *[F.col(_c1) for _c1 in df1.columns]).alias('f2')]), 'NO', 'outer')\
.select(flag_udf('f1','f2').alias('combined'))
df.show()
The result will be,
+----------------+
| combined|
+----------------+
| 5-HW-MAHI-700-S|
| 1-IT-RAM-1000-S|
|3-HR-GOPI-1500-D|
| 2-IT-SRI-900-U|
|4-MT-SUMP-1200-I|
+----------------+
Finally, split the combined column.
split_col = F.split(df['combined'], '-')
df = df.select([split_col.getItem(i).alias(s) for i,s in enumerate(_cols+['FLAG'])])
df.show()
You get the desired output,
+---+----+----+----+----+
| NO|DEPT|NAME| SAL|FLAG|
+---+----+----+----+----+
| 5| HW|MAHI| 700| S|
| 1| IT| RAM|1000| S|
| 3| HR|GOPI|1500| D|
| 2| IT| SRI| 900| U|
| 4| MT|SUMP|1200| I|
+---+----+----+----+----+

Related

filter record in dataframe base on list of value

I have below scenario.
li = ['g1','g2','g3']
df1 = id name goal
1 raj g1
2 harsh g3/g1
3 ramu g1
Above as you can see dataframe df1 and list li
i wanted to filter record in df1 base on list values of li but you can see in goal column first we need to split value base of / del but getting error
df1 = df1.filter(~df1.goal.isin(li))
but this is not returning any record...
is there any way to get record

Using this exemple:
from pyspark.sql import functions as F
from pyspark.sql.types import *
li = ['g1','g2','g3']
df1 = spark.createDataFrame(
[
('1','raj','g1'),
('2','harsh','g3/g1'),
('3','ramu','g1'),
('4','luiz','g2/g4')
],
["id", "name", "goal"]
)
df1.show()
# +---+-----+-----+
# | id| name| goal|
# +---+-----+-----+
# | 1| raj| g1|
# | 2|harsh|g3/g1|
# | 3| ramu| g1|
# | 4| luiz|g2/g4|
# +---+-----+-----+
You can use split to split the goal column and then array_except to find which records are not in your list:
result = df1\
.withColumn('goal_split', F.split(F.col('goal'), "/"))\
.withColumn('li', F.array([F.lit(x) for x in li]))\
.withColumn("test",F.array_except('goal_split','li'))\
.filter(F.col('test') == F.array([]))\
result.show()
# +---+-----+-----+----------+------------+----+
# | id| name| goal|goal_split| li|test|
# +---+-----+-----+----------+------------+----+
# | 1| raj| g1| [g1]|[g1, g2, g3]| []|
# | 2|harsh|g3/g1| [g3, g1]|[g1, g2, g3]| []|
# | 3| ramu| g1| [g1]|[g1, g2, g3]| []|
# +---+-----+-----+----------+------------+----+
Than, select the columns you want for the result:
result.select('id', 'name', 'goal').show().
# +---+-----+-----+
# | id| name| goal|
# +---+-----+-----+
# | 1| raj| g1|
# | 2|harsh|g3/g1|
# | 3| ramu| g1|
# +---+-----+-----+

Unsure how to apply row-wise normalization on pyspark dataframe

Disclaimer: I'm a beginner when it comes to Pyspark.
For each cell in a row, I'd like to apply the following function
new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)
At the very end, I'd like the range of values to go from 0.0 to 1.0.
Here are the details of my dataframe:
Dimensions: (6.5M, 2905)
Dtypes: Double
Initial DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 7.5| 0.1| 2.0|
| 2| 0.3| 3.5| 10.5|
+-----+-------+-------+-------+
Updated DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 1.0| 0.013| 0.26|
| 2| 0.028| 0.33| 1.0|
+-----+-------+-------+-------+
Any help would be appreciated.

You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.
cols = df.columns[1:]
import builtins as p
df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \
for c in cols:
df2 = df2.withColumn(c, col(c) / col('max'))
df2.show()
+---+-------------------+--------------------+-------------------+----+
| id| col_1| col_2| col_n| max|
+---+-------------------+--------------------+-------------------+----+
| 1| 1.0|0.013333333333333334|0.26666666666666666| 7.5|
| 2|0.02857142857142857| 0.3333333333333333| 1.0|10.5|
+---+-------------------+--------------------+-------------------+----+

Merge many dataframes into one in Pyspark [non pandas df]

I will be getting dataframes generated one by one through a process. I have to merge them into one.
+--------+----------+
| Name|Age |
+--------+----------+
|Alex | 30|
+--------+----------+
+--------+----------+
| Name|Age |
+--------+----------+
|Earl | 32|
+--------+----------+
+--------+----------+
| Name|Age |
+--------+----------+
|Jane | 15|
+--------+----------+
Finally:
+--------+----------+
| Name|Age |
+--------+----------+
|Alex | 30|
+--------+----------+
|Earl | 32|
+--------+----------+
|Jane | 15|
+--------+----------+
Tried many options like concat, merge, append but all are I guess pandas libraries. I am not using pandas. Using version python 2.7 and Spark 2.2
Edited to cover final scenario with foreachpartition:
l = [('Alex', 30)]
k = [('Earl', 32)]
ldf = spark.createDataFrame(l, ('Name', 'Age'))
ldf = spark.createDataFrame(k, ('Name', 'Age'))
# option 1:
union_df(ldf).show()
#option 2:
uxdf = union_df(ldf)
uxdf.show()
output in both cases:
+-------+---+
| Name|Age|
+-------+---+
|Earl | 32|
+-------+---+

You can use unionAll() for dataframes:
from functools import reduce # For Python 3.x
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.union, dfs)
df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))
unionAll(df1, df2, df3).show()
## +---+----+
## | k| v|
## +---+----+
## | 1|foo1|
## | 2|bar1|
## | 3|foo2|
## | 4|bar2|
## | 5|foo3|
## | 6|bar3|
## +---+----+
EDIT:
You can create an empty dataframe, and keep doing a union to it:
# Create first dataframe
ldf = spark.createDataFrame(l, ["Name", "Age"])
ldf.show()
# Save it's schema
schema = ldf.schema
# Create an empty DF with the same schema, (you need to provide schema to create empty dataframe)
empty_df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
empty_df.show()
# Union the first DF with the empty df
empty_df = empty_df.union(ldf)
empty_df.show()
# New dataframe after some operations
ldf = spark.createDataFrame(k, schema)
# Union with the empty_df again
empty_df = empty_df.union(ldf)
empty_df.show()
# First DF ldf
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
+----+---+
# Empty dataframe empty_df
+----+---+
|Name|Age|
+----+---+
+----+---+
# After first union empty_df.union(ldf)
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
+----+---+
# After second union with new ldf
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
|Earl| 32|
+----+---+

HI,Could you please help me resolving Issue while creating new column in Pyspark: I explained the issue as below:

query I'm using:
I want to replace existing columns with new values on condition, if value of another col = ABC then column remain same otherwise should give null or blank.
It's giving result as per logic but only for last column it encounters in loop.
import pyspark.sql.functions as F
for i in df.columns:
if i[4:]!='ff':
new_df=df.withColumn(i,F.when(df.col_ff=="abc",df[i])\
.otherwise(None))
df:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
| a | a | d | abc |
| a | b | c | def |
| b | c | b | abc |
| c | d | a | def |
+------+----+-----+-------+
required output:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
| a | a | d | abc |
| null |null|null | def |
| b | c | b | abc |
| null |null|null | def |
+------+----+-----+-------+

The problem in your code is that you're overwriting new_df with the original DataFrame df in each iteration of the loop. You can fix it by first setting new_df = df outside of the loop, and then performing the withColumn operations on new_df inside the loop.
For example, if df were the following:
df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#| a| a| d| abc|
#| a| b| c| def|
#| b| c| b| abc|
#| c| d| a| def|
#+----+----+----+------+
Change your code to:
import pyspark.sql.functions as F
new_df = df
for i in df.columns:
if i[4:]!='ff':
new_df = new_df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i)))
Notice here that I removed the .otherwise(None) part because when will return null by default if the condition is not met.
You could also do the same using functools.reduce:
from functools import reduce # for python3
new_df = reduce(
lambda df, i: df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i))),
[i for i in df.columns if i[4:] != "ff"],
df
)
In both cases the result is the same:
new_df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#| a| a| d| abc|
#|null|null|null| def|
#| b| c| b| abc|
#|null|null|null| def|
#+----+----+----+------+

how to apply joins in spark scala when we have multiple values in the join column

I have data in two text files as
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
| 3| u,t,k|
| 4| f,o,k|
| 5| e,o,u|
+----------+-------+
file2(diagnosis code,diagnosis description) Time T1
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| yen|
| t| ten|
| k| ken|
| u| uen|
| p| pen|
| f| fen|
| o| oen|
| e| een|
+-------+---------+
data in file 2 is not fixed and keeps on changing, means at any given point of time diagnosis code y can have diagnosis description as yen and at other point of time it can have diagnosis description as ten. For example below
file2 at Time T2
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| ten|
| t| yen|
| k| uen|
| u| oen|
| p| ken|
| f| pen|
| o| een|
| e| fen|
+-------+---------+
I have to read these two files data in spark and want only those patients id who are diagnosed with uen.
it can be done using spark sql or scala both.
I tried to read the file1 in spark-shell. The two columns in file1 are pipe delimited.
scala> val tes1 = sc.textFile("file1.txt").map(x => x.split('|')).filter(y => y(1).contains("u")).collect
tes1: Array[Array[String]] = Array(Array(2, u,t,p), Array(3, u,t,k), Array(5, e,o,u))
But as the diagnosis code related to a diagnosis description is not constant in file2 so will have to use the join condition. But I dont know how to apply joins when the diag_cd column in file1 has multiple values.
any help would be appreciated.

Please find the answer below
//Read the file1 into a dataframe
val file1DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file1PATH")
//Read the file2 into a dataframe
val file2DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file2path")
//get the patient id dataframe for the diag_desc as uen
file1DF.join(file2DF,file1DF.col("diag_cd").contains(file2DF.col("diag_cd")),"inner")
.filter(file2DF.col("diag_desc") === "uen")
.select("patient_id").show

Convert the table t1 from format1 to format2 using explode method.
Format1:
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
+----------+-------+
to
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y |
| 1| t |
| 1| k |
| 2| u |
| 2| t |
| 2| p |
+----------+-------+
Code:
scala> val data = Seq("1|y,t,k", "2|u,t,p")
data: Seq[String] = List(1|y,t,k, 2|u,t,p)
scala> val df1 = sc.parallelize(data).toDF("c1").withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).withColumn("col2", split(col("c1"), "\\|").getItem(1)).select("patient_id", "col2").withColumn("diag_cd", explode(split($"col2", "\\,"))).select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
scala> df1.collect()
res4: Array[org.apache.spark.sql.Row] = Array([1,y], [1,t], [1,k], [2,u], [2,t], [2,p])
I have created dummy data here for illustration. Note how we are exploding the particular column above using
scala> val df1 = sc.parallelize(data).toDF("c1").
| withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).
| withColumn("col2", split(col("c1"), "\\|").getItem(1)).
| select("patient_id", "col2").
| withColumn("diag_cd", explode(split($"col2", "\\,"))).
| select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
Now you can create df2 for file 2 using -
scala> val df2 = sc.textFile("file2.txt").map(x => (x.split(",")(0),x.split(",")(1))).toDF("diag_cd", "diag_desc")
df2: org.apache.spark.sql.DataFrame = [diag_cd: string, diag_desc: string]
Join df1 with df2 and filter as per the requirement.
df1.join(df2, df1.col("diag_cd") === df2.col("diag_cd")).filter(df2.col("diag_desc") === "ten").select(df1.col("patient_id")).collect()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Text file comparison using Spark data frames - pyspark

Related

filter record in dataframe base on list of value

Unsure how to apply row-wise normalization on pyspark dataframe

Merge many dataframes into one in Pyspark [non pandas df]

HI,Could you please help me resolving Issue while creating new column in Pyspark: I explained the issue as below:

how to apply joins in spark scala when we have multiple values in the join column

Categories

Resources