Spark For loop for Large Tables using inner join - pyspark

Create a new table from two existing tables A & B, A has history data for 1 year and B has Data of ID's. I need to join this two tables using Spark where performance is fine as well as loop the data for each and every day or month as business_day is the partition. I cannot consider entire tables as every business day has 30 million each.
Table A - has n number of columns such as ID, Business_Day, Name
Table B - has n number of columns - ID, ID_Code
Table A should join table B using ID=ID and get ID_Code along with other columns of A
insert into output_table
select ID, ID_CODE,Business_Day, Name
from A,B where
A.ID=B.ID
I am not sure how to write For loop for the above, insert script works but for a single day it takes 2 hours and that I need to change business day manually for a year which is impossible, but a loop and other performance steps will help it run very much faster.

Spark SQL Query with Python
Source
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import pandas as pd
sc = SparkContext(conf=SparkConf())
sqlContext = SQLContext(sc)
# Table A read and spark create dataframe --> df_A
# df_A = sqlContext.createDataFrame(...)
# Table B read and spark create dataframe --> df_B
# df_B = sqlContext.createDataFrame(...)
# Example:
df1 = sqlContext.createDataFrame(
pd.DataFrame.from_records(
[
[1,12,'Test'],
[2,22,'RD']
],
columns=['ID','ID_CODE','Departman']
))
df2 = sqlContext.createDataFrame(
pd.DataFrame.from_records(
[
[1,'friday','Shan'],
[2,'friday','ramazan'],
[3,'friday','bozkir']
],
columns=['ID','Business_Day','Name']))
### pyspark method SQL
df = df_A.join(df_B,df_B.ID == df_A.ID)
.select('ID_CODE','Business_Day','Name')
### Spark SQL method
df1.registerTempTable('df_A')
df2.registerTempTable('df_B')
df = sqlContext.sql("""
SELECT ID_CODE,Business_Day,Name
FROM (
SELECT *
FROM df_A A LEFT JOIN df_B B ON B.ID = A.ID
) df
""")
""").show()
[In]: df.show()
[Out]:
+-------+------------+-------+
|ID_CODE|Business_Day| Name|
+-------+------------+-------+
| 12| friday| Shan|
| 22| friday|ramazan|
+-------+------------+-------+

Related

String aggregation and group by in PySpark

I have a dataset that has Id, Value and Timestamp columns. Id and Value columns are strings. Sample:
Id
Value
Timestamp
Id1
100
1658919600
Id1
200
1658919602
Id1
300
1658919601
Id2
433
1658919677
I want to concatenate Values that belong to the same Id, and order them by Timestamp. E.g. for rows with Id1 the result would look like:
Id
Values
Id1
100;300;200
Some pseudo code would be:
res = SELECT Id,
STRING_AGG(Value,";") WITHING GROUP ORDER BY Timestamp AS Values
FROM table
GROUP BY Id
Can someone help me write this in Databricks? PySpark and SQL are both fine.
You can collect lists of struct ofTimestamp and Value (in that order) for each Id, sort them (sort_array will sort by the first value of struct, i.e Timestamp) and combine Value's values into string using concat_ws.
PySpark (Spark 3.1.2)
import pyspark.sql.functions as F
(df
.groupBy("Id")
.agg(F.expr("concat_ws(';', sort_array(collect_list(struct(Timestamp, Value))).Value) as Values"))
).show(truncate=False)
# +---+-----------+
# |Id |Values |
# +---+-----------+
# |Id1|100;300;200|
# |Id2|433 |
# +---+-----------+
in SparkSQL
SELECT Id, concat_ws(';', sort_array(collect_list(struct(Timestamp, Value))).Value) as Values
FROM table
GROUP BY Id
This is a beautiful question!! This is a perfect use case for Fugue which can port Python and Pandas code to PySpark. I think this is something that is hard to express in Spark but easy to express in native Python or Pandas.
Let's just concern ourselves with 1 ID first. For one ID, using pure native Python, it would look like below. Assume the Timestamps are already sorted when this is applied.
import pandas as pd
df = pd.DataFrame({"Id": ["Id1", "Id1", "Id1", "Id2","Id2","Id2"],
"Value": [100,200,300,433, 500,600],
"Timestamp": [1658919600, 1658919602, 1658919601, 1658919677, 1658919670, 1658919672]})
from typing import Iterable, List, Dict, Any
def logic(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
_id = df[0]['Id']
items = []
for row in df:
items.append(row['Value'])
yield {"Id": _id, "Values": items}
Now we can call Fugue with one line of code to run this on Pandas. Fugue uses the type annotation from the logic function to handle conversions for you as it enters the function. We can run this for 1 ID (not sorted yet).
from fugue import transform
transform(df.loc[df["Id"] == "Id1"], logic, schema="Id:str,Values:[int]")
and that generates this:
Id Values
0 Id1 [100, 200, 300]
Now we are ready to bring it to Spark. All we need to do is add the engine and partitioning strategy to the transform call.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = transform(df,
logic,
schema="Id:str,Values:[int]",
partition={"by": "Id", "presort": "Timestamp asc"},
engine=spark)
sdf.show()
Because we passed in the SparkSession, this code will run on Spark.sdf is a SparkDataFrame so we need .show() because it evaluates lazily. Schema is a requirement for Spark so we need it too on Fugue but it's significantly simplified. The partitioning strategy will run logic on each Id, and will sort the items by Timestamp for each partition.
For the FugueSQL version, you can do:
from fugue_sql import fsql
fsql(
"""
SELECT *
FROM df
TRANSFORM PREPARTITION BY Id PRESORT Timestamp ASC USING logic SCHEMA Id:str,Values:[int]
PRINT
"""
).run(spark)
Easiest Solution :
df1=df.sort(asc('Timestamp')).groupBy("id").agg(collect_list('Value').alias('newcol'))
+---+---------------+
| id| newcol|
+---+---------------+
|Id1|[100, 300, 200]|
|Id2| [433]|
+---+---------------+
df1.withColumn('newcol',concat_ws(";",col("newcol"))).show()
+---+-----------+
| id| newcol|
+---+-----------+
|Id1|100;300;200|
|Id2| 433|
+---+-----------+

spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

There are two dataframes: df1, and df2 with the same schema. ID is the primary key.
I need merge the two df1, and df2. This can be done by union except one special requirement: if there are duplicates rows with the same ID in df1 and df2. I need keep the one in df1.
df1:
ID col1 col2
1 AA 2019
2 B 2018
df2:
ID col1 col2
1 A 2019
3 C 2017
I need the following output:
df1:
ID col1 col2
1 AA 2019
2 B 2018
3 C 2017
How to do this? Thanks. I think it is possible to register two tmp tables, do full joins and use coalesce. but I do not prefer this way, because there are about 40 columns, in fact, instead of 3 in the above example.
Given that the two DataFrames have the same schema, you could simply union df1 with the left_anti join of df2 & df1:
df1.union(df2.join(df1, Seq("ID"), "left_anti")).show
// +---+---+----+
// | ID|co1|col2|
// +---+---+----+
// | 1| AA|2019|
// | 2| B|2018|
// | 3| C|2017|
// +---+---+----+
One way to do this is, unioning the dataframes with an identifier column that specifies the dataframe and use it thereafter for prioritizing row from df1 with a function like row_number.
PySpark SQL solution shown here.
from pyspark.sql.functions import lit,row_number,when
from pyspark.sql import Window
df1_with_identifier = df1.withColumn('identifier',lit('df1'))
df2_with_identifier = df2.withColumn('identifier',lit('df2'))
merged_df = df1_with_identifier.union(df2_with_identifier)
#Define the Window with the desired ordering
w = Window.partitionBy(merged_df.id).orderBy(when(merged_df.identifier == 'df1',1).otherwise(2))
result = merged_df.withColumn('rownum',row_number().over(w))
result.select(result.rownum == 1).show()
A solution with a left join on df1 could be a lot simpler, except that you have to write multiple coalesces.

cannot select columns in a table as one of the column name is limit

The above code is resulting in issues as it has a column name as keyword
limit. If I remove the column 'limit' from the select list, the script is
working fine.
Table A has following contents
\**** Table A *******\\\\
There are two tables A , B Table A as follows
ID Day Name Description limit
1 2016-09-01 Sam Retail 100
2 2016-01-28 Chris Retail 200
3 2016-02-06 ChrisTY Retail 50
4 2016-02-26 Christa Retail 10
3 2016-12-06 ChrisTu Retail 200
4 2016-12-31 Christi Retail 500
Table B has following contents
\\\**** Table B *****\\\\\\\
Table B
ID SkEY
1 1.1
2 1.2
3 1.3
The above code is resulting in issues as it has a column name as keyword
limit. If I remove the column 'limit' from the select list, the script is
working fine.
\\\**** Tried Code *****\\\\\\\
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark import HiveContext
hiveContext= HiveContext(sc)
ABC2 = spark.sql(
"select * From A where day ='{0}'".format(i[0])
)
Join = ABC2.join(
Tab2,
(
ABC2.ID == Tab2.ID
)
)\
.select(
Tab2.skey,
ABC2.Day,
ABC2.Name,
ABC2.limit,)
withColumn('newcol1, lit('')),
withColumn('newcol2, lit('A'))
ABC2 .show()
ABC=spark.sql(
"select distinct day from A where day= '2016-01-01' "
)
\\\**** Expected Result *****\\\\\\\
How can we amend the code so that the limit is also selected
It worked this wasy. not sure functional reason but is successful, Renaming
the limit as alias before and there after getting it back
\\**** Tried Code *****\\\\\\\
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark import HiveContext
hiveContext= HiveContext(sc)
ABC2 = spark.sql( "select Day,Name,Description,limit as liu From A where day
='{0}'".format(i[0]) )
Join = ABC2.join( Tab2, ( ABC2.ID == Tab2.ID ) )\
.selectexpr( "skey as skey",
"Day as Day",
"Name as Day",
"liu as limit",)
withColumn('newcol1, lit('')),
withColumn('newcol2, lit('A'))
ABC2 .show()

Rename column names when select from dataframe

I have 2 dataframes : df1 and df2 and I am left joining both of them on id column and saving it to another dataframe named df3. Below is the code that I am using, which works fine as expected.
val df3 = df1.alias("tab1").join(df2.alias("tab2"),Seq("id"),"left_outer").select("tab1.*","tab2.name","tab2.dept","tab2.descr");
I would like to rename the tab2.descr column to dept_full_description within the above statement.
I am aware that I could create a seq val like below and use toDF method
val columnsRenamed = Seq("id", "empl_name", "name","dept","dept_full_description") ;
df4 = df3.toDF(columnsRenamed: _*);
Is there any other way to to aliasing in the first statement itself. My end goal is not to list about 30-40 columns explicitly .
I'd rename before join:
df1.alias("tab1").join(
df2.withColumnRenamed("descr", "dept_full_description").alias("tab2"),
Seq("id"), "left_outer")

How to use NOT IN from a CSV file in Spark

I use Spark sql to load data into a val like this
val customers = sqlContext.sql("SELECT * FROM customers")
But I have a separate txt file that contains one column CUST_ID and 50,00 rows. i.e.
CUST_ID
1
2
3
I want my customers val to have all customers in customers table that are not in the TXT file.
Using Sql I would do this by SELECT * FROM customers NOT IN cust_id ('1','2','3')
How can I do this using Spark?
I've read the textFile and I can print rows of it but I'm not sure how to match this with my sql query
scala> val custids = sc.textFile("cust_ids.txt")
scala> custids.take(4).foreach(println)
CUST_ID
1
2
3
You can import your text file as a dataframe and do a left outer join:
val customers = Seq(("1", "AAA", "shipped"), ("2", "ADA", "delivered") , ("3", "FGA", "never received")).toDF("id","name","status")
val custId = Seq(1,2).toDF("custId")
customers.join(custId,'id === 'custId,"leftOuter")
.where('custId.isNull)
.drop("custId")
.show()
+---+----+--------------+
| id|name| status|
+---+----+--------------+
| 3| FGA|never received|
+---+----+--------------+