spark structured streaming joining aggregate dataframe to dataframe - scala

I have a streaming dataframe that could look at some point like:
+--------------------+--------------------+
| owner| fruits|
+--------------------+--------------------+
|Brian | apple|
Brian | pear |
Brian | date|
Brian | avocado|
Bob | avocado|
Bob | apple|
........
+--------------------+--------------------+
I performed a groupBy, agg collect_list to clean things up.
val myFarmDF = farmDF.withWatermark("timeStamp", "1 seconds").groupBy("fruits").agg(collect_list(col("fruits")) as "fruitsA")
the output is a single row for each owner and an array of every fruit.
I would now like to join this cleaned up array to the original streaming dataframe dropping the fruits col and just having the fruitsA column
val joinedDF = farmDF.join(myFarmDF, "owner").drop("fruits")
this seems to work in my head, but spark doesn't seem to agree.
I get a
Failure when resolving conflicting references in Join:
'Join Inner
...
+- AnalysisBarrier
+- Aggregate [name#17], [name#17, collect_list(fruits#61, 0, 0) AS fruitA#142]
When I turn everything into a static dataframe, it works just fine. Is this not possible in a streaming context?

Have you tried renaming the column name? There is a similar problem https://issues.apache.org/jira/browse/SPARK-19860

Related

Scala/Spark join on a second key if the first key doesn't exist in one of the dataframes

I have two dataframes:
RegionValues:
+-----------+----------+----------------------+
|marketplace|primary_id|values |
+-----------+----------+----------------------+
|xyz |0000000001|[cat, dog, cow] |
|reg |PRT0000001|[hippo, dragon, moose]|
|asz |0000001333|[mouse, rhino, lion] |
+-----------+----------+----------------------+
Marketplace:
+----------+-----------+----------+
|primary_id|marketplace|parent_id |
+----------+-----------+----------+
|0000000001|xyz |PRT0000001|
|0000000002|wrt |PRT0000001|
|PRT0000001|reg |PRT0000001|
|PRT00MISS0|asz |PRT00MISS0|
|000000000B|823 |PRT0000002|
+----------+-----------+----------+
when I join the dataframes together I want to join them based on the primary_id value but if the primary_id field is not present in the RegionValues dataframe, then I want to fallback to joining on parent_id === primary_id. So my desired output would be:
+----------+--------------+-----------+-------------------------------------+
|primary_id|marketplace |parent_id |values |
+----------+--------------+-----------+-------------------------------------+
|0000000001|... |PRT0000001 |[cat, dog, cow] |
|0000000002|... |PRT0000001 |[hippo, dragon, moose] |
|PRT0000001|... |PRT0000001 |[hippo, dragon, moose] |
|PRT00MISS0| |PRT00MISS0 |null |
|0000001333| |0000001333 |[mouse, rhino, lion] |
|000000000B| |PRT0000002 |null |
+----------+--------------+-----------+-------------------------------------+
note that 0000000001 maintained its original values but that 0000000002 took on its parent_id's values since its not present in RegionValues. Is it possible to accomplish this logic within a join statement? I am using Scala and Spark.
I have tried to use a join statement like this but this results in a null value for the 0000000002 values:
val parentIdJoinCondition = when(
(regionValuesDf.col("primary_id") === marketplaceDf.col("primary_id")).isNull,
marketplaceDf.col("parent_id") === regionValuesDf.col("primary_id")
).otherwise(regionValuesDf.col("primary_id") === marketplaceDf.col("primary_id"))
val joinedDf = regionDf.join(
marketplaceDf,
parentIdJoinCondition,
"outer"
)
I think I could get my desired result by using 3 distinct joins but this seems unnecessary and harder to read.
Creating custom conditions will result to Spark performing a cross-join, that is a very inefficient way to join. Moreover, there is no way for Spark to know that a column does not match before performing actual join, so your condition regionValuesDf.col("primary_id") === marketplaceDf.col("primary_id")).isNull will always return false.
So, as you correctly guessed, the best solution is to perform several joins. You can end with two joins. First join to determine if we should use primary_id or parent_id value for outer join, and the actual outer join. Then, you can merge primary_id, marketplace and parent_id and drop useless columns
So the code would be:
import org.apache.spark.sql.functions.{coalesce, col, when}
val joinedDf = marketplaceDf.join(regionDf.drop("marketPlace"), Seq("primary_id"), "left")
.withColumn("join_key", when(col("values").isNotNull, col("primary_id")).otherwise(col("parent_id")))
.drop("values")
.join(
regionDf
.withColumnRenamed("primary_id", "join_key")
.withColumnRenamed("marketplace", "region_marketplace"),
Seq("join_key"),
"outer"
)
.withColumn("primary_id", coalesce(col("primary_id"), col("join_key")))
.withColumn("parent_id", coalesce(col("parent_id"), col("join_key")))
.withColumn("marketplace", coalesce(col("marketplace"), col("region_marketplace")))
.drop("join_key", "region_marketplace")
That gives you the following joinDf dataframe:
+----------+-----------+----------+----------------------+
|primary_id|marketplace|parent_id |values |
+----------+-----------+----------+----------------------+
|0000000001|xyz |PRT0000001|[cat, dog, cow] |
|0000001333|asz |0000001333|[mouse, rhino, lion] |
|0000000002|wrt |PRT0000001|[hippo, dragon, moose]|
|PRT0000001|reg |PRT0000001|[hippo, dragon, moose]|
|000000000B|823 |PRT0000002|null |
|PRT00MISS0|asz |PRT00MISS0|null |
+----------+-----------+----------+----------------------+
Shouldn't regionValuesDf.col("primary_id") =!= marketplaceDf.col("primary_id")) instead of regionValuesDf.col("primary_id") === marketplaceDf.col("primary_id")).isNull in your join statement help ?

spark merge datasets based on the same input of one column and concat the others

Currently I have several Dataset[UserRecord], and it looks like this
case class UserRecord(
Id: String,
ts: Timestamp,
detail: String
)
Let's call the several datasets datasets.
Previously I tried this
datasets.reduce(_ union _)
.groupBy("Id")
.agg(collect_list("ts", "detail"))
.as[(String, Seq[DetailRecord]]
but this code gives me an OOM error. I think the root cause is collect_list.
Now I'm thinking if I can do the groupBy and agg for each of the dataset first and then join them together to solve the OOM issue. Any other good advice is welcome too :)
I have an IndexedSeq of datasets look like this
|name| lists |
| x |[[1,2], [3,4]]|
|name| lists |
| y |[[5,6], [7,8]]|
|name| lists |
| x |[[9,10], [11,12]]|
How can I combine them to get a Dataset that looks like
|name| lists |
| x |[[1,2], [3,4],[9,10], [11,12]]|
| y |[[5,6], [7,8]] |
I tried ds.reduce(_ union _) but it didn't seem to work
You can aggregate after union:
val ds2 = ds.reduce(_ unionAll _).groupBy("name").agg(flatten(collect_list("lists")).as("lists"))

Spark Column merging all list into 1 single list

I want the below column to merge into a single list for n-gram calculation. I am not sure how can I merge all the lists in a column into a single one.
+--------------------+
| author|
+--------------------+
| [Justin, Lee]|
|[Chatbots, were, ...|
|[Our, hopes, were...|
|[And, why, wouldn...|
|[At, the, Mobile,...|
+--------------------+
(Edit)Some more info:
I would like this as a spark df column and all the words including the repeated ones in a single list. The data is kind of big so I want to try avoiding methods like collect
OP wants to aggregate all the arrays/lists into the top row.
values = [(['Justin','Lee'],),(['Chatbots','were'],),(['Our','hopes','were'],),
(['And','why','wouldn'],),(['At','the','Mobile'],)]
df = sqlContext.createDataFrame(values,['author',])
df.show()
+------------------+
| author|
+------------------+
| [Justin, Lee]|
| [Chatbots, were]|
|[Our, hopes, were]|
|[And, why, wouldn]|
| [At, the, Mobile]|
+------------------+
This step suffices.
from pyspark.sql import functions as F
df = df.groupby().agg(F.collect_list('author').alias('list_of_authors'))
df.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|list_of_authors |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray(Justin, Lee), WrappedArray(Chatbots, were), WrappedArray(Our, hopes, were), WrappedArray(And, why, wouldn), WrappedArray(At, the, Mobile)]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
DataFrames, same as other distributed data structures, are not iterable and by only using dedicated higher order function and / or SQL methods can be accessed
Suppose your Dataframe is DF1 and Output is DF2
You need something like :
values = [(['Justin', 'Lee'],), (['Chatbots', 'were'],), (['Our', 'hopes', 'were'],),
(['And', 'why', 'wouldn'],), (['At', 'the', 'Mobile'],)]
df = spark.createDataFrame(values, ['author', ])
df.agg(F.collect_list('author').alias('author')).show(truncate=False)
Upvote if works

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful