'tuple' object has no attribute 'select'

'tuple' object has no attribute 'select' - pyspark

I am trying to rewrite this SQL query into Pyspark
SELECT b.BrandId,a.PaymentIdentifier,a.PaymentName
FROM checkpaymentdetailmca a
INNER JOIN dimension b
ON a.Brand = b.Brand
AND a.StoreNumber = b.Storenumber
Here is my attempt at writing this in pyspark:
df = spark.read.parquet('abfss://container_name#xyz.dfs.core.windows.net/folder_name/checkpaymentdetailmca/parquet/**/*')
df1= spark.read.parquet('abfss://container_name#xyz.dfs.core.windows.net/folder_name/dimension//parquet/**/*')
df_final = df.join(df1,df['sbr_brand']==df1['brand'])\
&(df1,df['sbr_number']==df1['store_number'])\
.select(df['payment_number','payment_type_name','payment_identifier'],df1['brand_id'])
display(df_final)
Here is the types of error I get:
'tuple' object has no attribute 'select'
'Column' object is not callable

Related

PySpark best way to filter df based on columns from different df's

I have a DF A_DF which has among others two columns say COND_B and COND_C. Then I have 2 different df's B_DF with COND_B column and C_DF with COND_C column.
Now I would like to filter A_DF where the value match in one OR the other. Something like:
df = A_DF.filter((A_DF.COND_B == B_DF.COND_B) | (A_DF.COND_C == C_DF.COND_C))
But I found out it is not possible like this.
EDIT
error: Attribute CON_B#264,COND_C#6 is missing from the schema: [... COND_B#532, COND_C#541 ]. Attribute(s) with the same name appear in the operation: COND_B,COND_C. Please check if the right attribute(s) are used.; looks like I can filter only on same DF because of the #number added on the fly..
So I first tried to do a list from B_DF and C_DF and use filter based on that but it was too expensive to use collect() on 100m of records.
So I tried:
AB_DF = A_DF.join(B_DF, 'COND_B', 'left_semi')
AC_DF = A_DF.join(C_DF, 'COND_C', 'left_semi')
df = AB_DF.unionAll(AC_DF).dropDuplicates()
dropDuplicates() I used to removed duplicate records where both conditions where true. But even with that I got some unexpected results.
Is there some other - smoother solution to do it simply? Something like an EXISTS statement in SQL?
EDIT2
I tried SQL based on #mck response:
e.createOrReplaceTempView('E')
b.createOrReplaceTempView('B')
p.createOrReplaceTempView('P')
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
my_output.write_dataframe(df)
with error:
Traceback (most recent call last):
File "/myproject/abc.py", line 45, in my_compute_function
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
TypeError: sql() missing 1 required positional argument: 'sqlQuery'
Thanks a lot!

Your idea of using exists should work. You can do:
A_DF.createOrReplaceTempView('A')
B_DF.createOrReplaceTempView('B')
C_DF.createOrReplaceTempView('C')
df = spark.sql("""
select * from A
where exists (select 1 from B where A.COND_B = B.COND_B)
or exists (select 1 from C where A.COND_C = C.COND_C)
""")

replace column and get ltrim of the column value

I want to replace an column in an dataframe. need to get the scala
syntax code for this
Controlling_Area = CC2
Hierarchy_Name = CC2HIDNE
Need to write as : HIDENE
ie: remove the Controlling_Area present in Hierarchy_Name .
val dfPC = ReadLatest("/Full", "parquet")
.select(
LRTIM( REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"") ),
Col(ColumnN),
Col(ColumnO)
)
notebook:3: error: not found: value REPLACE
REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"")
^
Expecting to get the LTRIM and replace code in scala

You can use withColumnRenamed to achieve that:
import org.apache.spark.sql.functions
val dfPC = ReadLatest("/Full", "parquet")
.withColumnRenamed("Hierarchy_Name","Controlling_Area")
.withColumn("Controlling_Area",ltrim(col("Controlling_Area")))

How to execute hql file in spark with arguments

I have a hql file which accepts several arguments and I then in stand alone spark application, I am calling this hql script to create a dataframe.
This is a sample hql code from my script:
select id , name, age, country , created_date
from ${db1}.${table1} a
inner join ${db2}.${table2} b
on a.id = b.id
And in this is how I am calling it in my Spark script:
import scala.io.Source
val queryFile = `path/to/my/file`
val db1 = 'cust_db'
val db2 = 'cust_db2'
val table1 = 'customer'
val table2 = 'products'
val query = Source.fromFile(queryFile).mkString
val df = spark.sql(query)
When I am using this way, I am getting:
org.apache.spark.sql.catylyst.parser.ParserException
Is there a way to pass arguments directly to my hql file and then create a df out of the hive code.

Parameters can be injected with such code:
val parametersMap = Map("db1" -> db1, "db2" -> db2, "table1" -> table1, "table2" -> table2)
val injectedQuery = parametersMap.foldLeft(query)((acc, cur) => acc.replace("${" + cur._1 + "}", cur._2))

pyspark sql : AttributeError: 'NoneType' object has no attribute 'join'

def main(inputs, output):
sdf = spark.read.csv(inputs, schema=observation_schema)
sdf.registerTempTable('filtertable')
result = spark.sql("""
SELECT * FROM filtertable WHERE qflag IS NULL
""").show()
temp_max = spark.sql(""" SELECT date, station, value FROM filtertable WHERE (observation = 'TMAX')""").show()
temp_min = spark.sql(""" SELECT date, station, value FROM filtertable WHERE (observation = 'TMIN')""").show()
result = temp_max.join(temp_min, condition1).select(temp_max('date'), temp_max('station'), ((temp_max('TMAX')-temp_min('TMIN'))/10)).alias('Range'))
Error:
Traceback (most recent call last):
File "/Users/syedikram/Documents/temp_range_sql.py", line 96, in <module>
main(inputs, output)
File "/Users/syedikram/Documents/temp_range_sql.py", line 52, in main
result = temp_max.join(temp_min, condition1).select(temp_max('date'), temp_max('station'), ((temp_max('TMAX')-temp_min('TMIN')/10)).alias('Range'))
AttributeError: 'NoneType' object has no attribute 'join'
Performing on join operation gives me Nonetype object error. Looking online didn't help as there is little documentation online for pyspark sql.
What am I doing wrong here?

Remove the .show() from temp_max and temp_min because show only prints a string and does not return anything (hence you get AttributeError: 'NoneType' object has no attribute 'join').

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!

I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

'tuple' object has no attribute 'select' - pyspark

Related

PySpark best way to filter df based on columns from different df's

replace column and get ltrim of the column value

How to execute hql file in spark with arguments

pyspark sql : AttributeError: 'NoneType' object has no attribute 'join'

Column name cannot be resolved in SparkSQL join

Categories

Resources