PySpark best way to filter df based on columns from different df's

PySpark best way to filter df based on columns from different df's - pyspark

I have a DF A_DF which has among others two columns say COND_B and COND_C. Then I have 2 different df's B_DF with COND_B column and C_DF with COND_C column.
Now I would like to filter A_DF where the value match in one OR the other. Something like:
df = A_DF.filter((A_DF.COND_B == B_DF.COND_B) | (A_DF.COND_C == C_DF.COND_C))
But I found out it is not possible like this.
EDIT
error: Attribute CON_B#264,COND_C#6 is missing from the schema: [... COND_B#532, COND_C#541 ]. Attribute(s) with the same name appear in the operation: COND_B,COND_C. Please check if the right attribute(s) are used.; looks like I can filter only on same DF because of the #number added on the fly..
So I first tried to do a list from B_DF and C_DF and use filter based on that but it was too expensive to use collect() on 100m of records.
So I tried:
AB_DF = A_DF.join(B_DF, 'COND_B', 'left_semi')
AC_DF = A_DF.join(C_DF, 'COND_C', 'left_semi')
df = AB_DF.unionAll(AC_DF).dropDuplicates()
dropDuplicates() I used to removed duplicate records where both conditions where true. But even with that I got some unexpected results.
Is there some other - smoother solution to do it simply? Something like an EXISTS statement in SQL?
EDIT2
I tried SQL based on #mck response:
e.createOrReplaceTempView('E')
b.createOrReplaceTempView('B')
p.createOrReplaceTempView('P')
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
my_output.write_dataframe(df)
with error:
Traceback (most recent call last):
File "/myproject/abc.py", line 45, in my_compute_function
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
TypeError: sql() missing 1 required positional argument: 'sqlQuery'
Thanks a lot!

Your idea of using exists should work. You can do:
A_DF.createOrReplaceTempView('A')
B_DF.createOrReplaceTempView('B')
C_DF.createOrReplaceTempView('C')
df = spark.sql("""
select * from A
where exists (select 1 from B where A.COND_B = B.COND_B)
or exists (select 1 from C where A.COND_C = C.COND_C)
""")

Related

Is there any pyspark function or substitute for like in sql

I have a python list of all the columns of the dataframe as below.
['Timestamp',
'ScheduleCode__VALUE',
'ScheduleCode__i:nil',
'ProductionCode__VALUE',
'ProductionCode__i:nil',
'ProductCode__VALUE',
'ProductCode__i:nil',
'ProductCategory__VALUE',
'ProductCategory__i:nil']
I need to drop all the columns from the above list which ends with __i:nil and rename all the columns with __value to only it's prefix like ProductCode__VALUE should be renamed to ProductCode.

Try this:
column_list = ['Timestamp',
'ScheduleCode__VALUE',
'ScheduleCode__i:nil',
'ProductionCode__VALUE',
'ProductionCode__i:nil',
'ProductCode__VALUE',
'ProductCode__i:nil',
'ProductCategory__VALUE',
'ProductCategory__i:nil']
for element in column_list:
if(element.endswith('__Value')):
df = (
df.withColumnRenamed(element, element.split('__')[0])
)
df = df.drop(*[element for element in column_list if element.endswith('__i:nil')])

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below.
I get the error: & is not a supported operation for types str and str. Please review your code.
Any idea how to get this right? I've never tried to join more than 1 column for aliased tables. Thx!!
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (("crm.id=cng.id") & ("crm.cpid = cng.cpid")), how = "inner")

Try using as below -
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (["id"] and ["cpid"]), how = "inner")

Your join condition is overcomplicated. It can be as simple as this
df_initial_sample = df_crm.join(df_cngpt, on=['id', 'cpid'], how = 'inner')

AnalysisException: cannot resolve given input columns:

I am running in to this error when I am trying to select a couple of columns from the temporary table.
pd_df = pd.read_sql('select * from abc.cars limit 10', conn)
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("cars_tmp")
df.show()
print('***************')
print("Reading from tmp table")
data = spark.sql('select location_id from cars_tmp')
data.show()
AnalysisException: cannot resolve '`location_id`' given input columns: [cars_tmp.abc.product_id, cars_tmp.abc.location_id ...]
When I select all the columns I get the results. So this is successful:
data = spark.sql('select * from cars_tmp')
data.show()
I tried below queries but they fail as well with the same error:
data = spark.sql('select cars_tmp.abc.location_id from cars_tmp')
data.show()
data = spark.sql('select cars_tmp.location_id from cars_tmp')
data.show()
data = spark.sql('select abc.location_id from cars_tmp')
data.show()
I am running these in datbricks.
Databricks runtime version: 7.0
Apache Spark version: 3.0
scala: 2.12
or "spark_version": "7.0.x-scala2.12",
Any help will be highly appreciated.
Thanks

The column name does not exist in the table. select * from cars_tmp works because you do not specify the column name.
Please see this answer https://stackoverflow.com/a/64042756/8913402 with the same error handling.

I resolved the issue by add each column in the panda select query. So something like this:
pd_df = pd.read_sql('select id, location_id, product_id from abc.cars limit 10', conn)

Counting how many times each distinct value occurs in a column in PySparkSQL Join

I have used PySpark SQL to join together two tables, one containing crime location data with longitude and latitude and the other containing postcodes with their corresponding longitude and latitude.
What I am trying to work out is how to tally up how many crimes have occurred within each postcode. I am new to PySpark and my SQL is rusty so I am unsure where I am going wrong.
I have tried to use COUNT(DISTINCT) but that is simply giving me the total number of distinct postcodes.
mySchema = StructType([StructField("Longitude", StringType(),True), StructField("Latitude", StringType(),True)])
bgl_df = spark.createDataFrame(burglary_rdd, mySchema)
bgl_df.registerTempTable("bgl")
rdd2 = spark.sparkContext.textFile("posttrans.csv")
mySchema2 = StructType([StructField("Postcode", StringType(),True), StructField("Lon", StringType(),True), StructField("Lat", StringType(),True)])
pcode_df = spark.createDataFrame(pcode_rdd, mySchema2)
pcode_df.registerTempTable("pcode")
count = spark.sql("SELECT COUNT(DISTINCT pcode.Postcode)
FROM pcode RIGHT JOIN bgl
ON (bgl.Longitude = pcode.Lon
AND bgl.Latitude = pcode.Lat)")
+------------------------+
|count(DISTINCT Postcode)|
+------------------------+
| 523371|
+------------------------+
Instead I want something like:
+--------+---+
|Postcode|Num|
+--------+---+
|LN11 9DA| 2 |
|BN10 8JX| 5 |
| EN9 3YF| 9 |
|EN10 6SS| 1 |
+--------+---+

You can do a groupby count to get a distinct count of values for a column:
group_df = df.groupby("Postcode").count()
You will get the ouput you want.
For an SQL query:
query = """
SELECT pcode.Postcode, COUNT(pcode.Postcode) AS Num
FROM pcode
RIGHT JOIN bgl
ON (bgl.Longitude = pcode.Lon AND bgl.Latitude = pcode.Lat)
GROUP BY pcode.Postcode
"""
count = spark.sql(query)
Also, I have copied in from your FROM and JOIN clause to make the query more relevant for copy-pasta.

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!

I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PySpark best way to filter df based on columns from different df's - pyspark

Related

Is there any pyspark function or substitute for like in sql

Pyspark join on multiple aliased table columns

AnalysisException: cannot resolve given input columns:

Counting how many times each distinct value occurs in a column in PySparkSQL Join

Column name cannot be resolved in SparkSQL join

Categories

Resources