I am running in to this error when I am trying to select a couple of columns from the temporary table.
pd_df = pd.read_sql('select * from abc.cars limit 10', conn)
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("cars_tmp")
df.show()
print('***************')
print("Reading from tmp table")
data = spark.sql('select location_id from cars_tmp')
data.show()
AnalysisException: cannot resolve '`location_id`' given input columns: [cars_tmp.abc.product_id, cars_tmp.abc.location_id ...]
When I select all the columns I get the results. So this is successful:
data = spark.sql('select * from cars_tmp')
data.show()
I tried below queries but they fail as well with the same error:
data = spark.sql('select cars_tmp.abc.location_id from cars_tmp')
data.show()
data = spark.sql('select cars_tmp.location_id from cars_tmp')
data.show()
data = spark.sql('select abc.location_id from cars_tmp')
data.show()
I am running these in datbricks.
Databricks runtime version: 7.0
Apache Spark version: 3.0
scala: 2.12
or "spark_version": "7.0.x-scala2.12",
Any help will be highly appreciated.
Thanks
The column name does not exist in the table. select * from cars_tmp works because you do not specify the column name.
Please see this answer https://stackoverflow.com/a/64042756/8913402 with the same error handling.
I resolved the issue by add each column in the panda select query. So something like this:
pd_df = pd.read_sql('select id, location_id, product_id from abc.cars limit 10', conn)
Related
I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?
The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}
I have a DF A_DF which has among others two columns say COND_B and COND_C. Then I have 2 different df's B_DF with COND_B column and C_DF with COND_C column.
Now I would like to filter A_DF where the value match in one OR the other. Something like:
df = A_DF.filter((A_DF.COND_B == B_DF.COND_B) | (A_DF.COND_C == C_DF.COND_C))
But I found out it is not possible like this.
EDIT
error: Attribute CON_B#264,COND_C#6 is missing from the schema: [... COND_B#532, COND_C#541 ]. Attribute(s) with the same name appear in the operation: COND_B,COND_C. Please check if the right attribute(s) are used.; looks like I can filter only on same DF because of the #number added on the fly..
So I first tried to do a list from B_DF and C_DF and use filter based on that but it was too expensive to use collect() on 100m of records.
So I tried:
AB_DF = A_DF.join(B_DF, 'COND_B', 'left_semi')
AC_DF = A_DF.join(C_DF, 'COND_C', 'left_semi')
df = AB_DF.unionAll(AC_DF).dropDuplicates()
dropDuplicates() I used to removed duplicate records where both conditions where true. But even with that I got some unexpected results.
Is there some other - smoother solution to do it simply? Something like an EXISTS statement in SQL?
EDIT2
I tried SQL based on #mck response:
e.createOrReplaceTempView('E')
b.createOrReplaceTempView('B')
p.createOrReplaceTempView('P')
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
my_output.write_dataframe(df)
with error:
Traceback (most recent call last):
File "/myproject/abc.py", line 45, in my_compute_function
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
TypeError: sql() missing 1 required positional argument: 'sqlQuery'
Thanks a lot!
Your idea of using exists should work. You can do:
A_DF.createOrReplaceTempView('A')
B_DF.createOrReplaceTempView('B')
C_DF.createOrReplaceTempView('C')
df = spark.sql("""
select * from A
where exists (select 1 from B where A.COND_B = B.COND_B)
or exists (select 1 from C where A.COND_C = C.COND_C)
""")
I have used PySpark SQL to join together two tables, one containing crime location data with longitude and latitude and the other containing postcodes with their corresponding longitude and latitude.
What I am trying to work out is how to tally up how many crimes have occurred within each postcode. I am new to PySpark and my SQL is rusty so I am unsure where I am going wrong.
I have tried to use COUNT(DISTINCT) but that is simply giving me the total number of distinct postcodes.
mySchema = StructType([StructField("Longitude", StringType(),True), StructField("Latitude", StringType(),True)])
bgl_df = spark.createDataFrame(burglary_rdd, mySchema)
bgl_df.registerTempTable("bgl")
rdd2 = spark.sparkContext.textFile("posttrans.csv")
mySchema2 = StructType([StructField("Postcode", StringType(),True), StructField("Lon", StringType(),True), StructField("Lat", StringType(),True)])
pcode_df = spark.createDataFrame(pcode_rdd, mySchema2)
pcode_df.registerTempTable("pcode")
count = spark.sql("SELECT COUNT(DISTINCT pcode.Postcode)
FROM pcode RIGHT JOIN bgl
ON (bgl.Longitude = pcode.Lon
AND bgl.Latitude = pcode.Lat)")
+------------------------+
|count(DISTINCT Postcode)|
+------------------------+
| 523371|
+------------------------+
Instead I want something like:
+--------+---+
|Postcode|Num|
+--------+---+
|LN11 9DA| 2 |
|BN10 8JX| 5 |
| EN9 3YF| 9 |
|EN10 6SS| 1 |
+--------+---+
You can do a groupby count to get a distinct count of values for a column:
group_df = df.groupby("Postcode").count()
You will get the ouput you want.
For an SQL query:
query = """
SELECT pcode.Postcode, COUNT(pcode.Postcode) AS Num
FROM pcode
RIGHT JOIN bgl
ON (bgl.Longitude = pcode.Lon AND bgl.Latitude = pcode.Lat)
GROUP BY pcode.Postcode
"""
count = spark.sql(query)
Also, I have copied in from your FROM and JOIN clause to make the query more relevant for copy-pasta.
val scc = spark.read.jdbc(url,table,properties)
val d = scc.createOrReplaceTempView(“k”)
spark.sql(“select * from k”).show()
if you observe here #1 we are reading complete table and then #3 we are fetching the results based on desired query. Here reading complete table and then querying takes alot of time. Can’t we execute our query while establishing connection ? please do help me if you have any prior knowledge about this .
Check this out.
var dbTable =
"(select emp_no, concat_ws(' ', first_name, last_name) as full_name from employees) as employees_name";
Dataset<Row> jdbcDF =
sparkSession.read().jdbc(CONNECTION_URL, dbTable,connectionProperties);
I want to generate a query by using a list in PySpark
list = ["hi#gmail.com", "goodbye#gmail.com"]
query = "SELECT * FROM table WHERE email IN (" + list + ")"
This is my desired output:
query
SELECT * FROM table WHERE email IN ("hi#gmail.com", "goodbye#gmail.com")
Instead I'm getting: TypeError: cannot concatenate 'str' and 'list' objects
Can anyone help me achieve this? Thanks
If someone's having the same issue, I found that you can use the following code:
"'"+"','".join(map(str, emails))+"'"
and you will have the following output:
SELECT * FROM table WHERE email IN ('hi#gmail.com', 'goodbye#gmail.com')
Try this:
Dataframe based approach -
df = spark.createDataFrame([(1,"hi#gmail.com") ,(2,"goodbye#gmail.com",),(3,"abc#gmail.com",),(4,"xyz#gmail.com")], ['id','email_id'])
email_filter_list = ["hi#gmail.com", "goodbye#gmail.com"]
df.where(col('email_id').isin(email_filter_list)).show()
Spark SQL based approach -
df = spark.createDataFrame([(1,"hi#gmail.com") ,(2,"goodbye#gmail.com",),(3,"abc#gmail.com",),(4,"xyz#gmail.com")], ['id','email_id'])
df.createOrReplaceTempView('t1')
sql_filter = ','.join(["'" +i + "'" for i in email_filter_list])
spark.sql("SELECT * FROM t1 WHERE email_id IN ({})".format(sql_filter)).show()