Spark scala finding value in another dataframe - scala

Hello I'm fairly new to spark and I need help with this little exercise. I want to find certain values in another dataframe but if those values aren't present I want to reduce the length of each value until I find the match. I have these dataframes:
----------------
|values_to_find|
----------------
| ABCDE |
| CBDEA |
| ACDEA |
| EACBA |
----------------
------------------
| list | Id |
------------------
| EAC | 1 |
| ACDE | 2 |
| CBDEA | 3 |
| ABC | 4 |
------------------
And I expect the next output:
--------------------------------
| Id | list | values_to_find |
--------------------------------
| 4 | ABC | ABCDE |
| 3 | CBDEA | CBDEA |
| 2 | ACDE | ACDEA |
| 1 | EAC | EACBA |
--------------------------------
For example ABCDE isn't present so I reduce its length by one (ABCD), again it doesn't match any so I reduce it again and this time I get ABC, which matches so I use that value to join and form a new dataframe. There is no need to worry about duplicates values when reducing the length but I need to find the exact match. Also, I would like to avoid using a UDF if possible.
I'm using a foreach to get every value in the first dataframe and I can do a substring there (if there is no match) but I'm not sure how to lookup these values in the 2nd dataframe. What's the best way to do it? I've seen tons of UDFs that could do the trick but I want to avoid that as stated before.
df1.foreach { values_to_find =>
df1.get(0).toString.substring(0, 4)}
Edit: Those dataframes are examples, I have many more values, the solution should be dynamic... iterate over some values and find their match in another dataframe with the catch that I need to reduce their length if not present.
Thanks for the help!

You can load the dataframe as temporary view and write the SQL. Is the above scenario you are implementing for the first time in Spark or already did in the previous code ( i mean before spark have you implemented in the legacy system). With Spark you have the freedom to write udf in scala or use SQL. Sorry i don't have solution handy so just giving a pointer.

the following will help you.
val dataDF1 = Seq((4,"ABC"),(3,"CBDEA"),(2,"ACDE"),(1,"EAC")).toDF("Id","list")
val dataDF2 = Seq(("ABCDE"),("CBDEA"),("ACDEA"),("EACBA")).toDF("compare")
dataDF1.createOrReplaceTempView("table1")
dataDF2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 on table1.list like concat('%',SUBSTRING(table2.compare,1,3),'%')").show()
Output:
+---+-----+-------+
| Id| list|compare|
+---+-----+-------+
| 4| ABC| ABCDE|
| 3|CBDEA| CBDEA|
| 2| ACDE| ACDEA|
| 1| EAC| EACBA|
+---+-----+-------+

Related

Include null values in collect_list in pyspark

I am trying to include null values in collect_list while using pyspark, however the collect_list operation excludes nulls. I have looked into the following post Pypsark - Retain null values when using collect_list . However, the answer given is not what I am looking for.
I have a dataframe df like this.
| id | family | date |
----------------------------
| 1 | Prod | null |
| 2 | Dev | 2019-02-02 |
| 3 | Prod | 2017-03-08 |
Here's my code so far:
df.groupby("family").agg(f.collect_list("date").alias("entry_date"))
This gives me an output like this:
| family | date |
-----------------------
| Prod |[2017-03-08]|
| Dev |[2019-02-02]|
What I really want is as follows:
| family | date |
-----------------------------
| Prod |[null, 2017-03-08]|
| Dev |[2019-02-02] |
Can someone please help me with this? Thank you!
A possible workaround for this could be to replace all null-values with another value. (Perhaps not the best way to do this, but it's a solution nonetheless)
df = df.na.fill("my_null") # Replace null with "my_null"
df = df.groupby("family").agg(f.collect_list("date").alias("entry_date"))
Should give you:
| family | date |
-----------------------------
| Prod |[my_null, 2017-03-08]|
| Dev |[2019-02-02] |

Fast split Spark dataframe by keys in some column and save as different dataframes

I have Spark 2.3 very big dataframe like this:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.
Every result dataframe I need to save as different csv file.
Count of keys is not big (20-30) but total count of data is (~200 millions records).
I have the solution where in the loop is selected every part of data and then saved to file:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.
I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :)
May be, someone have ideas about it?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).
You need to partition by column and save as csv file. Each partition save as one file.
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?

Tallying in Scala DataFrame Array

I have 2 column spark Scala DataFrame. The first is of one variable, the second one is an array of letters. What I am trying to do is find a way to code a tally (without using a for loop) of the variables in an array.
For example, this is what I have (I am sorry its not that neat, this is my first stack post). You have 5 computers, each person is represented by a letter. I want to find a way to find out how many computers a person (A,B,C,D,E) has used.
+-----------------+--------------+
| id | [person] |
+-----------------+--------------+
| Computer 1 | [A,B,C,D] |
| Computer 2 | [A,B] |
| Computer 3 | [A,B,E] |
| Computer 4 | [A,C,D] |
| Computer 5 | [A,B,C,D,E] |
+-----------------+--------------+
What I would like to code up or asking if anyone has a solution would be something like this:
+---------+-----------+
| Person | [Count] |
+---------+-----------+
| A | 5 |
| B | 4 |
| C | 3 |
| D | 3 |
| E | 2 |
+---------+-----------+
Somehow count the people who are in arrays within the dataframe.
There's a function called explode which will expand the arrays into one row for each item:
| id | person
+-----------------+------------------------+
| Computer 1| A |
| Computer 1| B |
| Computer 1| C |
| Computer 1| D |
....
+---+----+----+----+----+
Then you can group by the person and count. Something like:
val df2 = df.select(explode($"person").as("person"))
val result = df2.groupBy($"person").count

Stream stream joins without equality predicate is not supported

I am using Spark 2.3 and am trying to join two streams of data. My left and my right stream both have an array. I want to join the two streams only when the right stream array is a subset of the left stream array.
For instance, my streamA looks like this:
StreamA:
|---|------|---------------------|-----------|
|id | dept | employeesInMeetings | DateTime |
|---|------|---------------------|-----------|
| 1 | sales| [John] | 7/2 14:00 |
| 2 | mktg | [Adam, Mike] | 7/2 12:30 |
| 3 | hr | [Rick, Jill, Andy] | 7/2 14:00 |
|---|------|---------------------|-----------|
and my streamB looks as follows:
StreamB:
|--------------|--------------|----------|
|employees | confRooms | DateTime |
|--------------|--------------|----------|
| [John, Jane] | A | 7/2 14:00|
| [Adam, Mike] | C | 7/2 12:30|
| [Jill, Andy] | B | 7/2 14:00|
|--------------|--------------|----------|
I only care about employees from the same department that are in the same meeting. Hence, as a result of the intersection, my resulting stream needs to look like:
|---|------|---------------------|-----------|----------|
|id | dept | employeesInMeetings | DateTime | confRoom |
|---|------|---------------------|-----------|----------|
| 2 | mktg | [Adam, Mike] | 7/2 12:30 | C |
| 3 | hr | [Rick, Jill, Andy] | 7/2 14:00 | B |
|---|------|---------------------|-----------|----------|
I created a UDF to do the intersect:
val arrayIntersect = udf((leftArr: Array[String], rightArr: Array[String]) => {
import spark.implicits._
if(leftArr.intersect(rightArr.toSeq).length == rightArr.size){
true
} else {
false
}
})
And tried to use it as follows:
streamA.joinWith(streamB, expr("arrayIntersect(leftArr, rightArr) AND streamA.DateTime BETWEEN streamB.DateTime and streamB.DateTime + INTERVAL 12 hours"))
However, I get the error:
org.apache.spark.sql.AnalysisException: Stream stream joins without equality predicate is not supported;
Does anybody know if there is a workaround here? Any help will be appreciated! Thanks!
Unfortunately, there is no workaround for this in stream-stream joins :(
We really need an equality predicate because we use that to perform the join using a streaming symmetric hash join algorithm -- both the streams are partitioned using the common key so that the related records from both streams end up in the same partition.
First convert your array into string and then search you right Array String in left Array String.
val arrayToString = udf{arr: Seq[String] => arr.sorted.map(_.trim.toLowerCase).mkString(",")}
streamA.withColumn("leftArrStr", arrayToString(col("leftArr"))).joinWith(
streamB.withColumn("rightArrStr", arrayToString(col("rightArr")))
, expr("instr(leftArrStr, rightArrStr) != 0 " +
"AND streamA.DateTime BETWEEN streamB.DateTime and streamB.DateTime + INTERVAL 12 hours"))

LibreOffice - RANDBETWEEN return a name

I got two columns list like this
+----+-------+
| Nr | Name |
+----+-------+
| 1 | Alice |
| 2 | Bob |
| 3 | Joe |
| 4 | Ann |
| 5 | Jane |
+----+-------+
And would like to generate a random name from this list.
For now I am only able to randomly select a number and then manually pick out the corresponding name - using this function =RANDBETWEEN(A2;A10) How can I pick out the name instead?
Assuming that the data of your table are in cells E7:F11 the following code can do what you need:
=VLOOKUP(RANDBETWEEN(1;5);E7:F11;2)
Further, in case you need to create a random permutation of the names you may also use the Calc extension Permutate at https://sourceforge.net/projects/permutate/.
Hope that helps.
Assuming your data is with Nr in A1 I suggest:
=INDEX(B$2:B$6;RANDBETWEEN(1;5))
then there is no need for the Nr column in making the selection.