Spark: Faster way to join two dataframe?

Spark: Faster way to join two dataframe? - scala

I have two dataframe df1 and ip2Country.
df1 contains the IP addresses and I am trying to map the ip addresses into geolocation information like longitude and latitude which are columns in ip2Country.
I am running it as a Spark-submit job, but the operations took a very long time even though df1 only has less than 2500 lines.
My code:
val agg =df1.join(ip2Country, ip2Country("network_start_int")=df1("sint")
, "inner")
.select($"src_ip"
,$"country_name".alias("scountry")
,$"iso_3".alias("scode")
,$"longitude".alias("slong")
,$"latitude".alias("slat")
,$"dst_ip",$"dint",$"count")
.filter($"slong".isNotNull)
val agg1 =agg.join(ip2Country, ip2Country("network_start_int")=agg("dint")
, "inner")
.select($"src_ip",$"scountry"
,$"scode",$"slong"
,$"slat",$"dst_ip"
,$"country_name".alias("dcountry")
,$"iso_3".alias("dcode")
,$"longitude".alias("dlong")
,$"latitude".alias("dlat"),$"count")
.filter($"dlong".isNotNull)
Is there any other way to join the two table? Or am I doing it the wrong way?

If you have a big dataframe which needs to be joined with a small one - Broadcast joins are very effective. Read here: Broadcast Joins (aka Map-Side Joins)
bigdf.join(broadcast(smalldf))

Related

How can i do "union" on multiple spark structured streaming dataframes?

I would like to make a union operation on multiple structured streaming dataframe, connected to kafka topics, in order to watermark them all at the same moment.
For instance:
df1=socket_streamer(spark,topic1)
df2=socket_streamer(spark,topic2)
where spark=sparksession and socket_streamer = spark.readstream
then i'll do:
Dataframe=df1.union(df2)
Dataframe=Dataframe.withWatermark("timestamp","5 minutes")
then I try to writeStream Dataframe.
The issue is: the union displays only the first df to receive rows.
Do you have any idea, to get all my data received by the union or how can I apply a same watermark on multiple dataframes ?
Tank you !

Does df1 and df2 have the same structure? Union function in spark resolves columns by position (not by name).
To union by name, use:
df1.unionByName(df2, allowMissingColumns=True)
(available from Spark 3.1.X)

Compare two dataframes row count. Assign dataframe with high row count to a new dataframe object

I have two physical nodes that are not synchronised.
Both nodes produce captured data. (Two nodes technology was put in place for resilience).
I am facing following challenge:
nodes produce two identical files (timestamps may not be the same, no unique identifier in order to remove duplicates). Both frames share the same schema.
Is there a way to write in data frame using pyspark something like:
df3= case
when df1.count()<df2.count() then df2,
when df1.count()>df2.count() then df1,
ELSE df1

Resolved following case by defining "comparison" function.
def compare(df1, df2):
if df1.count() > df2.count():
return df1
if df1.count() < df2.count():
return df2
else:
return df1
Seems possibility to work with dataframes as an object can be achieved via functions

Converting Pandas into Pyspark

So I'm trying to convert my python algorithm to Spark friendly code and I'm having trouble with this one:
indexer = recordlinkage.SortedNeighbourhoodIndex \
(left_on=column1, right_on=column2, window=41)
pairs = indexer.index(df_1,df_2)
It basically compares one column against the other and generates index pairs for those likely to be the same (Record Matching).
My code:
df1 = spark.read.load(*.csv)
df2 = spark.read.load(*.csv)
func_udf = udf(index.indexer) ????
df = df.withColumn('column1',func_udf(df1.column1,df2.column2)) ???
I've been using udf for transformations involving just one dataframe and one column, but how do I run a function that requires two arguments, one column from one dataframe and other from other dataframe? I can't join both dataframes as they have different lengths.

That's not how udf work. UserDefinedFunctions can operate only on data that comes from a single DataFrame
Standard udf on data from a single row.
pandas_udf on data from a single partition or single group.
I can't join both dataframes as they have different lengths.
Join is exactly what you should do (standard or manual broadcast). There is no need for objects to be of the same length - Spark join is a relational join not row-wise merge.
For similarity joins you can use built-in approx join tools:
Efficient string matching in Apache Spark
Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir

is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Multiple data inserts in same dataframe

We have customer data in a Hive table and sales data in another Hive table, which has data in TB's. We are trying to pull the sales data for multiple customers and save it to a file.
What we tried so far:
We tired with left outer join between customer and sales tables, but because of the huge sales data it is not working.
val data = customer.join(sales,"customer.id" = "sales.customerID",leftouter)
So the alternative is to pull the data form sales table based on specific customer region list and see if this region data has the customer data, if data exist save it in other dataframe and load the data to the same dataframe for all the regions.
My question here is, whether the multiple insert of data for the dataframe is supported in spark.

If the sales dataframe is larger than the customer dataframe then you could simply switch the order of the dataframes in the join operation.
val data = sales.join(customer,"customer.id" === "sales.customerID", "left_outer")
You could also add a hint for Spark to broadcast the smaller dataframe, though I belive it needs to be smaller than 2GB:
import org.apache.spark.sql.functions.broadcast
val data = sales.join(broadcast(customer),"customer.id" === "sales.customerID", "leftouter")
To use the other approach and iterativly merge dataframes is also possible. For this purpose you can use the union method (Spark 2.0+) or unionAll (older versions). This method will append a dataframe to another. In the case where you have a list of dataframes that you want to merge with each other you can use union together with reduce:
val dataframes = Seq(df1, df2, df3)
dataframes.reduce(_ union _)

Categories

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark: Faster way to join two dataframe? - scala

If you have a big dataframe which needs to be joined with a small one - Broadcast joins are very effective. Read here: Broadcast Joins (aka Map-Side Joins) bigdf.join(broadcast(smalldf))

Related

How can i do "union" on multiple spark structured streaming dataframes?

Compare two dataframes row count. Assign dataframe with high row count to a new dataframe object

Converting Pandas into Pyspark

Spark DataFrame row count is inconsistent between runs

Multiple data inserts in same dataframe

Categories

Resources