Getting Error: Using PythonUDF in join condition of join type LeftSemi is not supported - pyspark

I have a pypark.sql Dataframe which was created using an inner join of two data frames. I have also created one column after joining which provides week_start date based on the date.
Joined_data=Joined_data.withColumn("Week_start_date",date_sub(next_day('AsOfDate','Sun'),7))
Now, when I want to create a list(collection) of all week. I am using the below code.
DateList=Joined_data.select('Week_start_date').dropDuplicates()
I am getting the Error: "Using PythonUDF in join condition of join type LeftSemi is not supported."
If I remove dropDuplicates() method from the above line it runs fine without any error.
Does anyone have any idea why I am getting this error with dropDuplicates() method?

Related

Error while extracting the data from two dataframe using SQL

I'm trying to extract the data by joining the two table, in pyspark. My join Query looks like:
SELECT COUNT(DISTINCT m.ticker),to_date(m.date) FROM extractalpha_cam2 m LEFT OUTER JOIN TOP1000 u ON u.date = to_date(m.date) GROUP BY m.date ORDER BY m.date
It is throwing the error:
Error:Py4JJavaError: An error occurred while calling
z:org.apache.zeppelin.spark.ZeppelinContext.showDF
But when, i tried extracting the data from each table, it's working fine. My queries from single table are like
SELECT to_date(date) FROM extractalpha_cam2
SELECT date from TOP1000
These two queries working fine. Can anyone help me in extracting the data from both table by joining.
It would be really helpful if anyone can share any such link, which can guide me in writing the efficient queries in pyspark.
I checked and found that, this error comes when, the job you are running took more time than the time you set for timeout. In my case it was 300 seconds.
Let me know if anyone has more valuable answer than this. Thanks

coumn reference is ambiguous

Im getting error
"org.postgresql.util.PSQLException: ERROR: column reference "date_created" is ambiguous"
I have a Base class that defines the date_created field and then all the other classes extend it.
Im makeing a set of REST controllers. All of them use
"sqlRestriction("GREATEST(date_created, last_updated) >= ?", [fromLastUpdated])"
All of them use the same piece of code. All the other 10 cases it works, but with the 11th case it does not work. I dont get why. Ist nearly identical to all the other cases(difference is the other columns).
Where can this issue come from?
SOLUTION
Grails domain classes allow you to have refrences to other Tables
like
Table2 table
within your domain class.
This causes the hilbernate to create a join clause between table1 and table2.
So printed out the criteria created and made small modifications to fix the issue with ambiguiti
"sqlRestriction("GREATEST(this_.date_created, this_.last_updated) >= ?", [fromLastUpdated])"
this_ is the alias given to the domain on whitch you create the criteria.

spark join raises "Detected cartesian product for INNER join"

I have a dataframe and I want to add for each row new_col=max(some_column0) grouped by some other column1:
maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid")
df0.join(maxs, df0.catalog == maxs.catalogid).take(4)
And in second string I get an error:
AnalysisException: u'Detected cartesian product for INNER join between
logical plans\nProject ... Use the CROSS JOIN syntax to allow
cartesian products between these relations.;'
What do I not understand: why spark finds here cartesian product?
A possible way to get this error: I save DF to Hive table, then init DF again as select from table. Or replace these 2 strings with hive query - no matter. But I don't want to save DF.
As described in Why does spark think this is a cross/cartesian join, it may be caused by:
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.
As for how the cartesian product was generated? You can refer to Identifying and Eliminating the Dreaded Cartesian Product.
Try to persist the dataframes before joining them. Worked for me.
I've faced the same problem with cartesian product for my join.
In order to overcome it I used aliases on DataFrames. See example
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))

Transact-SQL Ambiguous column name

I'm having trouble with the 'Ambiguous column name' issue in Transact-SQL, using the Microsoft SQL 2012 Server Management Studio.
I´ve been looking through some of the answers already posted on Stackoverflow, but they don´t seem to work for me, and parts of it I simply don´t understand or loses the general view of.
Executing the following script :
USE CDD
SELECT Artist, Album_title, track_title, track_number, Release_Year, EAN_code
FROM Artists AS a INNER JOIN CD_Albumtitles AS c
ON a.artist_id = c.artist_id
INNER JOIN Track_lists AS t
ON c.title_id = t.title_id
WHERE track_title = 'bohemian rhapsody'
triggers the following error message :
Msg 209, Level 16, State 1, Line 3
Ambiguous column name 'EAN_code'.
Not that this is a CD database with artists names, album titles and track lists. Both the tables 'CD_Albumtitles' and 'Track_lists' have a column, with identical EAN codes. The EAN code is an important internationel code used to uniquely identify CD albums, which is why I would like to keep using it.
You need to put the alias in front of all the columns in your select list and your where clause. You're getting that error because one of the columns you have currently is coming from multiple tables in your join. If you alias the columns, it will essentially pick one or the other of the tables.
SELECT a.Artist,c.Album_title,t.track_title,t.track_number,c.Release_Year,t.EAN_code
FROM Artists AS a INNER JOIN CD_Albumtitles AS c
ON a.artist_id = c.artist_id
INNER JOIN Track_lists AS t
ON c.title_id = t.title_id
WHERE t.track_title = 'bohemian rhapsody'
so choose one of the source tables, prefixing the field with the alias (or table name)
SELECT Artist,Album_title,track_title,track_number,Release_Year,
c.EAN_code -- or t.EAN_code, which should retrieve the same value
By the way, try to prefix all the fields (in the select, the join, the group by, etc.), it's easier for maintenance.

Aggregate (sum,max,avg) function in Critera API JPA

In my criteria API query the following query where I query for three columns of my table works.
cq.multiselect(root.get("point").get("id"), root.get("player").get("userid"), root.get("amount"));
but when I want the sum of the column amount using the following query it gives a sql error. The query is
cq.multiselect(root.get("point").get("id"), root.get("player").get("userid"), cb.sum(root.get("amount")) );
The error that I am getting is.
{"id":"6","result":null,"error":"\r\nInternal Exception: com.sap.dbtech.jdbc.exceptions.jdbc40.SQLSyntaxErrorException: [-8017] (at 8): Column must be group column:ID\r\nError Code: -8017\r\n
Please help me with this, as I have been stuck on this for hours now. Thanks
The message is telling you that you need a group by clause in your query. Every column in the select clause (except the ones which are the result of an aggregate function) must be in the group by clause:
criteriaQuery.groupBy(root.get("point").get("id"),
root.get("player").get("userid"))