Here in the below code I've to use groupby and count distinct operation in a single dataframe but I'm getting syntax error while running the below code :
df_retail_sales_transaction = df_retail_sales_transaction.join(df_transaction_line_item,['transaction_key','transaction_id'] , 'left_outer')
df_retail_sales_transaction = df_retail_sales_transaction.join(df_transaction_payment, ['transaction_id'], 'left_outer')
df1 = df_retail_sales_transaction.groupby('business_week', 'personnel_key', 'location_id','country_code') \
.agg(df_retail_sales_transaction.select(countDistinct(df_transaction_line_item.transaction_id) , df_retail_sales_transaction.business_week, df_retail_sales_transaction.country_code, df_retail_sales_transaction.location_id , df_retail_sales_transaction.personnel_key)) \
.filter((df_product_hierarchy_type.product_hierarchy_level == "1") & (df_product_hierarchy_type.product_hierarchy_code == "1") \
& (df_transaction_payment.method_of_payment_name.isin( \
['VISA', 'Bank Debit', 'Cash', 'Mastercard', 'Split', 'American Express', 'BPme', 'BP Prepaid Card', 'BPrewards', \
'Diners Club', 'Manual JCB', 'Union Pay', 'Manual MASTERCARD', 'Manual VISA', 'Manual BANKCARD', 'JCB', \
'Manual AMEX', 'Manual DINERS CLUB'])))
The error I'm getting is :
pyspark.sql.utils.AnalysisException: "grouping expressions sequence is empty, and '`business_week`' is not an aggregate function. Wrap '(count(DISTINCT `transaction_id`) AS `count(DISTINCT transaction_id)`)' in windowing function(s) or wrap '`business_week`' in first() (or first_value) if you don't care which value you get.;;
Related
I am removing some words from a column. at the end of the day some rows will be empty becouse all their string has been removed. there might be space or whitespace or nothing. How I can remove these rows?
I tried this but for some reason it does not work for all kind of rows:
df = df.withColumn('col1',trim(regexp_replace('col1','\n','')))
df=df.filter(df.col1!='')
the filter you've applied will work for blanks, but not if it has whitespaces.
try trim(<column>) != ''.
Example
spark.sparkContext.parallelize([('',), (' ',), (' ',)]).toDF(['foo']). \
filter(func.col('foo') != ''). \
count()
# 2
spark.sparkContext.parallelize([('',), (' ',), (' ',)]).toDF(['foo']). \
filter(func.trim(func.col('foo')) != ''). \
count()
# 0
I am trying to applicate jointure between two dataframes.
df_temp_5=df_temp_4 \
.join(df_position_g, cond, "left")
Where
cond1=df_position_g.position_pk==df_keys_position_g.position_pk
cond2=df_position_g.dt_deb_val==df_keys_position_g.max_dt
cond = [cond1 & cond2]
and
df_temp_4 =df_key_hub_instrument\
.join(df_lnk_position,["instrument_pk"], "outer") \
.join(df_key_hub_portefeuille,["portefeuille_pk"], "outer") \
.join(df_lnk_tiers_instrument,["instrument_pk"], "outer") \
.join(df_keys_position_hors_bilan,["position_pk"], "outer") \
.join(df_keys_portefeuille_sigma,["portefeuille_pk"], "outer") \
.join(df_keys_instrument_sigma,["instrument_pk"], "outer") \
.join(df_keys_cotation_sigma,["instrument_pk"], "outer")
Note that df_temp_4 is fine and there is no problem there .
But, I have issue when trying to join with df_temp_5
As an error :
Py4JJavaError: An error occurred while calling o466.join. :
org.apache.spark.sql.AnalysisException: Resolved attribute(s)
max_dt#238 missing from valeur_actuelle#88,montant_coupon_couru_acha
Any help, please ? thanks
You are trying to join df_temp_4 with df_position_g but the join cond mentions df_keys_position_g and df_position_g. This seems to be a mistake.
Also, cond1 and cond2 must have brackets around them because the operator precedence of & is higher than ==.
# This will work as expected
cond1=(df_position_g.position_pk==df_keys_position_g.position_pk)
cond2=(df_position_g.dt_deb_val==df_keys_position_g.max_dt)
cond = [cond1 & cond2]
Alternatively, avoid brackets if & is not mentioned explicitly.
# This will also work as expected
cond1=df_position_g.position_pk==df_keys_position_g.position_pk
cond2=df_position_g.dt_deb_val==df_keys_position_g.max_dt
cond = [cond1, cond2]
Finally, based on the error message shared it looks like there is more than one column called max_dt in df_position_g, hence it cannot identify the right column. It could also be due to this spark bug. So try renaming the columns used in the join condition just to be safe.
df_keys_position_g = df_keys_position_g.withColumnRenamed("position_pk", "keys_position_pk")\
.withColumnRenamed("max_dt", "keys_position_max_dt")
The condition should be
cond = cond1 & cond2
Instead of
cond = [cond1 & cond2]
This is because writing cond with brackets will treat it as a list.
looking at this, source code of DataFrame.join
>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name='Alice', age=2), Row(name='Bob', age=5)]
I am trying to use Time Travel just to see how to use it, but when I run:
spark.sql("SELECT * FROM RIDES.YELLOW_TAXI VERSION AS OF 0")
It throws the error:
ParseException:
extraneous input '0' expecting {<EOF>, ';'}(line 1, pos 38)
== SQL ==
SELECT * FROM RIDES.YELLOW_TAXI AS OF 0
--------------------------------------^^^
I've tried changing the query to double quotes and placing the version number in single quotes. The really odd thing is that the query returns results when I simply run:
spark.sql("SELECT * FROM RIDES.YELLOW_TAXI AS OF")
That's just a side note though as I thought that it would fail in parsing. Data I'm using is:
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.parquet
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-02.parquet
Full code:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName('test') \
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config('spark.ui.port', '4050') \
.getOrCreate()
delta_dir = ('./output/delta')
spark.sql('CREATE DATABASE IF NOT EXISTS RIDES')
spark.sql('''
CREATE TABLE IF NOT EXISTS RIDES.YELLOW_TAXI(
VendorID long,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp,
passenger_count double,
trip_distance double,
RatecodeID double,
store_and_fwd_flag string,
PULocationID long,
DOLocationID long,
payment_type long,
fare_amount double,
extra double,
mta_tax double,
tip_amount double,
tolls_amount double,
improvement_surcharge double,
total_amount double,
congestion_surcharge double,
airport_fee double
) USING DELTA
LOCATION "{0}"
'''.format(delta_dir)
)
df = spark.read.parquet('yellow_tripdata_2021-01.parquet')
def update():
spark.sql("""MERGE INTO RIDES.YELLOW_TAXI
USING load_table
ON RIDES.YELLOW_TAXI.VendorID = load_table.VendorID and
RIDES.YELLOW_TAXI.tpep_pickup_datetime = load_table.tpep_pickup_datetime and
RIDES.YELLOW_TAXI.tpep_dropoff_datetime = load_table.tpep_dropoff_datetime and
RIDES.YELLOW_TAXI.PULocationID = load_table.PULocationID and
RIDES.YELLOW_TAXI.DOLocationID = load_table.DOLocationID
WHEN NOT MATCHED THEN
INSERT (VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
congestion_surcharge,
airport_fee) VALUES (VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
congestion_surcharge,
airport_fee)
""")
And then I load the files into load_table and run the update. When I run
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, './output/delta')
fullHistoryDF = deltaTable.history()
fullHistoryDF.show()
I can see all the versions there, but just can't figure out how to query a specific version.
The syntax is incorrect - if you're fetching the specific version, then you need to use syntax VERSION AS OF NNNN, but you're simply use AS OF NNNN. See docs for details.
Update: This syntax will be supported only starting with Spark 3.3 (See SPARK-37219 for details) that will be released in the near future.
I have the following query written in sqlalchemy
_stats = db.session.query(
func.count(a.status==s.STATUS_IN_PROGRESS).label("in_progress_"),
func.count(a.status==s.STATUS_COMPLETED).label("completed_"),
) \
.filter(a.uuid==some_uuid) \
.first()
this returns (7, 7) which is incorrect it should return (7,0) i.e. in_progress_ = 7, completed_ = 0
When I do this in two queries I get the correct values
_stats_in_progress = db.session.query(
func.count(a.status==s.STATUS_IN_PROGRESS).label("in_progress_"),
) \
.filter(a.uuid==some_uuid) \
.first()
_stats_in_complete = db.session.query(
func.count(a.status==s.STATUS_COMPLETED).label("completed_"),
) \
.filter(a.uuid==some_uuid) \
.first()
The corresponding SQL also does not work when using the two counts
SELECT count(a.status = 'IN_PROGRESS') AS in_progress_,
count(a.status = 'STATUS_COMPLETED') AS completed_
FROM a
WHERE a.uuid = '9a353554a6874ebcbf0fe88eb8223d33'
this returns 7,7 too, while if I do the query with just one count I get the correct values.
Does anyone know what I'm doing wrong?
I am getting this postgresql exception. This is my query:
select * from (\
select \
st.station_type_id as "id", \
case when \
( \
mast.station_type_desc is null \
) \
then st.station_type_desc else mast.station_type_desc end as "description" \
from station_type st left outer \
join master_agent_station_type mast on st.station_type_id = mast.station_type_id \
and mast.party_id='%s' where st.is_active='Y'\
) as result
I am new to postgresql.