What is Alteryx right outer self join equal to, in pyspark?

What is Alteryx right outer self join equal to, in pyspark? - pyspark

I am trying to migrate the alteryx workflow in pyspark dataframes, as part of which I came across this right outer self join on different columns (ph_id_1 and ph_id_2), while doing the same in pyspark, i am not getting the correct output, have tried Anti, left anti join. All are giving the same result. Any suggestion how to do it in pyspark way or sql way.
Tried:
df_new = df_1.join(df_2,[df_1['ph_id_1'] == df_2['ph_id_2']],how='left_anti')
, and
df_new = df_1.filter(df_1['ph_id_1'] != df_2['ph_id_2'])
Both are giving same results, different from actual one.

Related

Spark scala join dataframe within a dataframe

I have a requirement where I need to join dataframes A and B and calculate a column and use that calculated value in another join between the same 2 dataframes with different Join conditions.
e.g.:
DF_Combined = A_DF.join(B_DF,'Join-Condition',"left_outer").withColumn(col1,'value')
after doing the above I need to do the same join but use the value calculated in the previous join.
DF_Final=A_DF.join(B_DF,'New join COndition',"left_outer").withcolumn(col2,DF_Combined.col1*vol1*10)
When I try to do this I get a Cartesian product issue.

You can't use a column that is not present in dataframe. I mean when you do A_DF.join(B_DF,... in the resulting dataframe you only have columns from A_DF and B_DF. If you want to have the new column - you need to use DF_Combined.
From your question i believe you don't need to have another join, but you have 2 possible options:
1. When you do first join - at this place calculate vol1*10.
2. After join do DF_Combined.withColumn....
But please remember - withColumn(name, expr) creates a column with a namesetting value to result of expr. So .withcolumn(DF_Combined.col1,vol1*10) does not make sense.

Converting Pandas into Pyspark

So I'm trying to convert my python algorithm to Spark friendly code and I'm having trouble with this one:
indexer = recordlinkage.SortedNeighbourhoodIndex \
(left_on=column1, right_on=column2, window=41)
pairs = indexer.index(df_1,df_2)
It basically compares one column against the other and generates index pairs for those likely to be the same (Record Matching).
My code:
df1 = spark.read.load(*.csv)
df2 = spark.read.load(*.csv)
func_udf = udf(index.indexer) ????
df = df.withColumn('column1',func_udf(df1.column1,df2.column2)) ???
I've been using udf for transformations involving just one dataframe and one column, but how do I run a function that requires two arguments, one column from one dataframe and other from other dataframe? I can't join both dataframes as they have different lengths.

That's not how udf work. UserDefinedFunctions can operate only on data that comes from a single DataFrame
Standard udf on data from a single row.
pandas_udf on data from a single partition or single group.
I can't join both dataframes as they have different lengths.
Join is exactly what you should do (standard or manual broadcast). There is no need for objects to be of the same length - Spark join is a relational join not row-wise merge.
For similarity joins you can use built-in approx join tools:
Efficient string matching in Apache Spark
Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

Error while extracting the data from two dataframe using SQL

I'm trying to extract the data by joining the two table, in pyspark. My join Query looks like:
SELECT COUNT(DISTINCT m.ticker),to_date(m.date) FROM extractalpha_cam2 m LEFT OUTER JOIN TOP1000 u ON u.date = to_date(m.date) GROUP BY m.date ORDER BY m.date
It is throwing the error:
Error:Py4JJavaError: An error occurred while calling
z:org.apache.zeppelin.spark.ZeppelinContext.showDF
But when, i tried extracting the data from each table, it's working fine. My queries from single table are like
SELECT to_date(date) FROM extractalpha_cam2
SELECT date from TOP1000
These two queries working fine. Can anyone help me in extracting the data from both table by joining.
It would be really helpful if anyone can share any such link, which can guide me in writing the efficient queries in pyspark.

I checked and found that, this error comes when, the job you are running took more time than the time you set for timeout. In my case it was 300 seconds.
Let me know if anyone has more valuable answer than this. Thanks

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir

is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

spark join raises "Detected cartesian product for INNER join"

I have a dataframe and I want to add for each row new_col=max(some_column0) grouped by some other column1:
maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid")
df0.join(maxs, df0.catalog == maxs.catalogid).take(4)
And in second string I get an error:
AnalysisException: u'Detected cartesian product for INNER join between
logical plans\nProject ... Use the CROSS JOIN syntax to allow
cartesian products between these relations.;'
What do I not understand: why spark finds here cartesian product?
A possible way to get this error: I save DF to Hive table, then init DF again as select from table. Or replace these 2 strings with hive query - no matter. But I don't want to save DF.

As described in Why does spark think this is a cross/cartesian join, it may be caused by:
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.
As for how the cartesian product was generated? You can refer to Identifying and Eliminating the Dreaded Cartesian Product.

Try to persist the dataframes before joining them. Worked for me.

I've faced the same problem with cartesian product for my join.
In order to overcome it I used aliases on DataFrames. See example
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

What is Alteryx right outer self join equal to, in pyspark? - pyspark

Related

Spark scala join dataframe within a dataframe

Converting Pandas into Pyspark

Error while extracting the data from two dataframe using SQL

Spark DataFrame row count is inconsistent between runs

spark join raises "Detected cartesian product for INNER join"

Categories

Resources