I am using Spark 2.3 and am trying to join two streams of data. My left and my right stream both have an array. I want to join the two streams only when the right stream array is a subset of the left stream array.
For instance, my streamA looks like this:
StreamA:
|---|------|---------------------|-----------|
|id | dept | employeesInMeetings | DateTime |
|---|------|---------------------|-----------|
| 1 | sales| [John] | 7/2 14:00 |
| 2 | mktg | [Adam, Mike] | 7/2 12:30 |
| 3 | hr | [Rick, Jill, Andy] | 7/2 14:00 |
|---|------|---------------------|-----------|
and my streamB looks as follows:
StreamB:
|--------------|--------------|----------|
|employees | confRooms | DateTime |
|--------------|--------------|----------|
| [John, Jane] | A | 7/2 14:00|
| [Adam, Mike] | C | 7/2 12:30|
| [Jill, Andy] | B | 7/2 14:00|
|--------------|--------------|----------|
I only care about employees from the same department that are in the same meeting. Hence, as a result of the intersection, my resulting stream needs to look like:
|---|------|---------------------|-----------|----------|
|id | dept | employeesInMeetings | DateTime | confRoom |
|---|------|---------------------|-----------|----------|
| 2 | mktg | [Adam, Mike] | 7/2 12:30 | C |
| 3 | hr | [Rick, Jill, Andy] | 7/2 14:00 | B |
|---|------|---------------------|-----------|----------|
I created a UDF to do the intersect:
val arrayIntersect = udf((leftArr: Array[String], rightArr: Array[String]) => {
import spark.implicits._
if(leftArr.intersect(rightArr.toSeq).length == rightArr.size){
true
} else {
false
}
})
And tried to use it as follows:
streamA.joinWith(streamB, expr("arrayIntersect(leftArr, rightArr) AND streamA.DateTime BETWEEN streamB.DateTime and streamB.DateTime + INTERVAL 12 hours"))
However, I get the error:
org.apache.spark.sql.AnalysisException: Stream stream joins without equality predicate is not supported;
Does anybody know if there is a workaround here? Any help will be appreciated! Thanks!
Unfortunately, there is no workaround for this in stream-stream joins :(
We really need an equality predicate because we use that to perform the join using a streaming symmetric hash join algorithm -- both the streams are partitioned using the common key so that the related records from both streams end up in the same partition.
First convert your array into string and then search you right Array String in left Array String.
val arrayToString = udf{arr: Seq[String] => arr.sorted.map(_.trim.toLowerCase).mkString(",")}
streamA.withColumn("leftArrStr", arrayToString(col("leftArr"))).joinWith(
streamB.withColumn("rightArrStr", arrayToString(col("rightArr")))
, expr("instr(leftArrStr, rightArrStr) != 0 " +
"AND streamA.DateTime BETWEEN streamB.DateTime and streamB.DateTime + INTERVAL 12 hours"))
Related
I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show
I have Spark 2.3 very big dataframe like this:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.
Every result dataframe I need to save as different csv file.
Count of keys is not big (20-30) but total count of data is (~200 millions records).
I have the solution where in the loop is selected every part of data and then saved to file:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.
I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :)
May be, someone have ideas about it?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).
You need to partition by column and save as csv file. Each partition save as one file.
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?
I have 2 column spark Scala DataFrame. The first is of one variable, the second one is an array of letters. What I am trying to do is find a way to code a tally (without using a for loop) of the variables in an array.
For example, this is what I have (I am sorry its not that neat, this is my first stack post). You have 5 computers, each person is represented by a letter. I want to find a way to find out how many computers a person (A,B,C,D,E) has used.
+-----------------+--------------+
| id | [person] |
+-----------------+--------------+
| Computer 1 | [A,B,C,D] |
| Computer 2 | [A,B] |
| Computer 3 | [A,B,E] |
| Computer 4 | [A,C,D] |
| Computer 5 | [A,B,C,D,E] |
+-----------------+--------------+
What I would like to code up or asking if anyone has a solution would be something like this:
+---------+-----------+
| Person | [Count] |
+---------+-----------+
| A | 5 |
| B | 4 |
| C | 3 |
| D | 3 |
| E | 2 |
+---------+-----------+
Somehow count the people who are in arrays within the dataframe.
There's a function called explode which will expand the arrays into one row for each item:
| id | person
+-----------------+------------------------+
| Computer 1| A |
| Computer 1| B |
| Computer 1| C |
| Computer 1| D |
....
+---+----+----+----+----+
Then you can group by the person and count. Something like:
val df2 = df.select(explode($"person").as("person"))
val result = df2.groupBy($"person").count
I have this dataframe
+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid| | event | A | B | C |
+----------------+-----------------------------+--------------------+--------------+----------------+
| 1222222 | 2019-02-07 06:50:40.0 |aaaaaa | 25 | 5025 |
| 1222222 | 2019-02-07 06:50:42.0 |aaaaaa | 35 | 5000 |
| 1222222 | 2019-02-07 06:51:56.0 |aaaaaa | 100 | 4965 |
+----------------+-----------------------------+--------------------+--------------+----------------+
I want to update the value of column C by event (tiemstamp) and keep only the row with the latest value update in new dataframe like this
+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid| | event | A | B | C |
+----------------+-----------------------------+--------------------+--------------+----------------+
| 1222222 | 2019-02-07 06:51:56.0 |aaaaaa | 100 | 4965 |
+----------------+-----------------------------+--------------------+--------------+----------------+
the data are coming in streaming mode with spark streaming
You can try creating row number partitioned by customerid and order by event desc and take the rows where rownum is 1. I hope this helps.
df.withColumn("rownum", row_number().over(Window.partitionBy("customerid").orderBy(col("event").desc)))
.filter(col("rownum") === 1)
.drop("rownum")
I have data set like this which I am taking from csv file and converting it into RDD using scala.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234 | 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
How to sort the data on all columns ?
Thanks
Suppose your input RDD/DataFrame is called df.
To sort recent in descending order, Freq and Monitor both in ascending you can do:
import org.apache.spark.sql.functions._
val sorted = df.sort(desc("recent"), asc("Freq"), asc("Monitor"))
You can use df.orderBy(...) as well, it's an alias of sort().
csv.sortBy(r => (r.recent, r.freq)) or equivalent should do it