How to join and reduce two datasets with arrays? - scala

I need an idea for how to join two datasets with millions of arrays. Each dataset will have Longs numbered 1-10,000,000. But with different groupings in each one ex. [1,2] [3, 4] and [1], [2, 3], [4] output should be [1,2,3,4]
I need some way to join these sets efficiently.
I have tried an approach where I explode and group by multiple times, finally sorting and distincting the arrays. This works on small sets but is very inefficent for large sets because it explodes the number of rows many times over.
Any ideas on how to use another approach like a reducer or aggregation to solve this problem more efficiently.
The following is a scala code example. However, I would need an approach that works in java as well.
val rdd1 = spark.sparkContext.makeRDD(Array("""{"groupings":[1,2,3]}""", """{"groupings":[4,5,6]}""", """{"groupings":[7,8,9]}""", """{"groupings":[10]}""", """{"groupings":[11]}"""))
val rdd2 = spark.sparkContext.makeRDD(Array("""{"groupings":[1]}""", """{"groupings":[2,3,4]}""", """{"groupings":[7,8]}""", """{"groupings":[9]}""", """{"groupings":[10,11]}"""))
val srdd1 = spark.read.json(rdd1)
val srdd2 = spark.read.json(rdd2)
Dataset 1:
+---------+
|groupings|
+---------+
|[1, 2, 3]|
|[4, 5, 6]|
|[7, 8, 9]|
| [10]|
| [11]|
+---------+
Dataset 2:
+---------+
|groupings|
+---------+
| [1]|
|[2, 3, 4]|
| [7, 8]|
| [9]|
| [10, 11]|
+---------+
Output should be
+------------------+
| groupings|
+------------------+
|[1, 2, 3, 4, 5, 6]|
| [7, 8, 9]|
| [10, 11]|
+------------------+
Update:
This was my original code, which I had problems running, #AyanGuha had me thinking that perhaps it would be simpler to just use a series of joins instead, I am testing that now and will post a solution if it works out.
srdd1.union(srdd2).withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.select(callUDF("sortLongArray", col("groupings")).alias("groupings"))
.distinct()
What this code showed was that after 3 iterations the data coalesced, ideally then 3 joins would do the same.
Update 2:
Looks like I have a new working version, still seems inefficient but I think this will be handled better by spark.
val ardd1 = spark.sparkContext.makeRDD(Array("""{"groupings":[1,2,3]}""", """{"groupings":[4,5,6]}""", """{"groupings":[7,8,9]}""", """{"groupings":[10]}""", """{"groupings":[11,12]}""", """{"groupings":[13,14]}"""))
val ardd2 = spark.sparkContext.makeRDD(Array("""{"groupings":[1]}""", """{"groupings":[2,3,4]}""", """{"groupings":[7,8]}""", """{"groupings":[9]}""", """{"groupings":[10,11]}""", """{"groupings":[12,13]}""", """{"groupings":[14, 15]}"""))
var srdd1 = spark.read.json(ardd1)
var srdd2 = spark.read.json(ardd2)
val addUDF = udf((x: Seq[Long], y: Seq[Long]) => if(y == null) x else (x ++ y).distinct.sorted)
val encompassUDF = udf((x: Seq[Long], y: Seq[Long]) => if(x.size == y.size) false else (x diff y).size == 0)
val arrayContainsAndDiffUDF = udf((x: Seq[Long], y: Seq[Long]) => (x.intersect(y).size > 0) && (y diff x).size > 0)
var rdd1 = srdd1
var rdd2 = srdd2.withColumnRenamed("groupings", "groupings2")
for (i <- 1 to 3){
rdd1 = rdd1.join(rdd2, arrayContainsAndDiffUDF(col("groupings"), col("groupings2")), "left")
.select(addUDF(col("groupings"), col("groupings2")).alias("groupings"))
.distinct
.alias("rdd1")
rdd2 = rdd1.select(col("groupings").alias("groupings2")).alias("rdd2")
}
rdd1.join(rdd2, encompassUDF(col("groupings"), col("groupings2")), "leftanti")
.show(10, false)
Outputs:
+------------------------+
|groupings |
+------------------------+
|[10, 11, 12, 13, 14, 15]|
|[1, 2, 3, 4, 5, 6] |
|[7, 8, 9] |
+------------------------+
I will try this at scale and see what I get.

This works on small sets but is very inefficent for large sets because it explodes the number of rows many times over.
I don't think you have other options than explodeing the arrays, join followed by distinct. Spark is fairly good at such computations and tries doing them as much using internal binary rows as possible. The datasets are compressed and often comparisons are done at byte level (outside JVM)
That's just a matter of enough memory to hold all the elements which may not that big deal.
I'd recommend giving your solution a try and check out the physical plan and the stats. It could in the end turn out to be the only available solution.

Here is an alternate solution using ARRAY data type that is supported as part of HiveQL. This will at least make your coding simple [i.e. building out the logic]. Code below assumes that the raw data is in a text file.
Step 1. Create table
create table array_table1(array_col1 array<int>) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED
BY'\n' STORED AS text;
Step 2: Load data into both tables
LOAD DATA INPATH('/path/to/file') OVERWRITE INTO TABLE array_table1
Step 3: Apply sql functions to get results
select distinct(explode(array_col1)) from array_table1 union
select distinct(explode(array_col2)) from array_table2
I am not clear on what is the final output you are looking for, from the example. Is it just union of all distinct numbers - or are they supposed to have a grouping? But anyway, with the tables created you can use a combination of distinct, explode(), left anti join and union - to get the expected results.
You may want to optimize this code to filter the final data set again for duplicates.
Hope that helps!

OK I finally figured it out.
First of all with my array joins I was doing something very wrong, which I overlooked initially.
When joining two arrays with a equivalency. EX. does [1,2,3] equal [1,2,3]? the arrays are hashed. I was doing an intersection match using a UDF. Given x in [1,2,3] is any x in [1, 2, 3, 4, 5]. This cannot be hashed and therefore requires a plan which will check every row with every row.
So to do this you have to explode both arrays first, then join them.
You can then apply other criteria. For example I saved time by only joining arrays which were not equal and when summed were less then the other.
Example with a self join:
rdd2 = rdd2.withColumn("single", explode(col("grouping"))) // Explode the grouping
temp = rdd2.withColumnRenamed("grouping", "grouping2").alias("temp") // Alias for self join
rdd2 = rdd2.join(temp, rdd2.col("single").equalTo(temp.col("single")) // Compare singles which will be hashed
.and(col("grouping").notEqual(col("grouping2"))) // Apply further conditions
.and(callUDF("lessThanArray", col("grouping"), col("grouping2"))) // Make it so only [1,2,3] [4,5,6] is joined and not the duplicate [4,5,6] [1,2,3], for efficiency
, "left") // Left so that the efficiency criteria do not drop rows
I then grouped by the grouping which was joined against aggregated the groupings from the self join.
rdd2.groupBy("grouping")
.agg(callUDF("list_agg",col('grouping2')).alias('grouping2')) // List agg is a UserDefinedAggregateFunction which aggregates lists into a distinct list
.select(callUDF("addArray", col("grouping"), col("grouping2")).alias("grouping")) // AddArray is a udf which concats and distincts 2 arrays
grouping grouping2
[1,2,3] [3,4,5]
[1,2,3] [2,6,7]
[1,2,3] [2,8,9]
becomes just
[1,2,3] [2,3,4,5,6,7,8,9]
after addArray
[1,2,3,4,5,6,7,8,9]
I then iterated that code 3 times, which seems to make everything coalesce and threw in a distinct for good measure.
Notice from the original question I had two datasets, for my specific problem I discovered some assumptions about the first and second set. The first set I could assume had no duplicates as it was a master list, the second set had duplicates hence I only needed to apply the above code to the second set, then join it with the first. I would assume if both sets had duplicates they could be unioned together first.

Related

Alternative to GroupBy for Pyspark Dataframe?

I have a dataset like this:
timestamp vars
2 [1,2]
2 [1,2]
3 [1,2,3]
3 [1,2]
And I want a dataframe like this. Basically each value in the above dataframe is an index and the frequency of that value is the value at that index. This computation is done over every unique timestamp.
timestamp vars
2 [0, 2, 2]
3 [0,2,2,1]
Right now, I'm grouping by timestamp, and aggregrating/flattening vars (to get something like (1,2,1,2 for timestamp 2 or 1,2,3,1,2 for timestamp 3) and then I have a udf that uses collections.Counter to to get a key->value dict. I then turn this dict into the format I want.
The groupBy/agg can get arbitrarily large (arrays size can be in the millions) and this seems like a good usecase for the Window function, but I'm not sure how to put it all together.
Thinks it's also worth mentioning that I've tried repartioning, and converting to an RDD and using groupByKey. Both are arbitrarily slow (>24 hours) on large datasets.
Edit: As discussed in comments, the issue for the original methods could be from count using the filter or aggregate functions which triggers unnecessary data scans. Below we explode the arrays and do the aggregation(count) before creating the final array column:
from pyspark.sql.functions import collect_list, struct
df = spark.createDataFrame([(2,[1,2]), (2,[1,2]), (3,[1,2,3]), (3,[1,2])],['timestamp', 'vars'])
df.selectExpr("timestamp", "explode(vars) as var") \
.groupby('timestamp','var') \
.count() \
.groupby("timestamp") \
.agg(collect_list(struct("var","count")).alias("data")) \
.selectExpr(
"timestamp",
"transform(data, x -> x.var) as indices",
"transform(data, x -> x.count) as values"
).selectExpr(
"timestamp",
"transform(sequence(0, array_max(indices)), i -> IFNULL(values[array_position(indices,i)-1],0)) as new_vars"
).show(truncate=False)
+---------+------------+
|timestamp|new_vars |
+---------+------------+
|3 |[0, 2, 2, 1]|
|2 |[0, 2, 2] |
+---------+------------+
Where:
(1) we explode the array and do count() for each timestamp + var
(2) groupby timestamp and create an array of structs containing two fields var and count
(3) convert the array of structs into two arrays: indices and values (similar to what we define SparseVector)
(4) transform the sequence sequence(0, array_max(indices)), for each i in the sequence, use array_position to find the index of i in indices array and then retrieve the value from values array at the same position, see below:
IFNULL(values[array_position(indices,i)-1],0)
notice that the function array_position uses 1-based index and array indexing is 0-based, thus we have a -1 in the above expression.
Old methods:
(1) Use transform + filter/size
from pyspark.sql.functions import flatten, collect_list
df.groupby('timestamp').agg(flatten(collect_list('vars')).alias('data')) \
.selectExpr(
"timestamp",
"transform(sequence(0, array_max(data)), x -> size(filter(data, y -> y = x))) as vars"
).show(truncate=False)
+---------+------------+
|timestamp|vars |
+---------+------------+
|3 |[0, 2, 2, 1]|
|2 |[0, 2, 2] |
+---------+------------+
(2) Use aggregate function:
df.groupby('timestamp').agg(flatten(collect_list('vars')).alias('data')) \
.selectExpr("timestamp", """
aggregate(
data,
/* use an array as zero_value, size = array_max(data))+1 and all values are zero */
array_repeat(0, int(array_max(data))+1),
/* increment the ith value of the Array by 1 if i == y */
(acc, y) -> transform(acc, (x,i) -> IF(i=y, x+1, x))
) as vars
""").show(truncate=False)

How to merge two columns into a new DataFrame?

I have two DataFrames (Spark 2.2.0 and Scala 2.11.8). The first DataFrame df1 has one column called col1, and the second one df2 has also 1 column called col2. The number of rows is equal in both DataFrames.
How can I merge these two columns into a new DataFrame?
I tried join, but I think that there should be some other way to do it.
Also, I tried to apply withColumm, but it does not compile.
val result = df1.withColumn(col("col2"), df2.col1)
UPDATE:
For example:
df1 =
col1
1
2
3
df2 =
col2
4
5
6
result =
col1 col2
1 4
2 5
3 6
If that there's no actual relationship between these two columns, it sounds like you need the union operator, which will return, well, just the union of these two dataframes:
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.union(df2).show
+---+
|one|
+---+
| a |
| b |
| c |
| d |
| e |
| f |
+---+
[edit]
Now you've made clear that you just want two columns, then with DataFrames you can use the trick of adding a row index with the function monotonically_increasing_id() and joining on that index value:
import org.apache.spark.sql.functions.monotonically_increasing_id
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), Seq("id"))
.drop("id")
.show
+---+---+
|one|two|
+---+---+
| a | d |
| b | e |
| c | f |
+---+---+
As far as I know, the only way to do want you want with DataFrames is by adding an index column using RDD.zipWithIndex to each and then doing a join on the index column. Code for doing zipWithIndex on a DataFrame can be found in this SO answer.
But, if the DataFrames are small, it would be much simpler to collect the two DFs in the driver, zip them together, and make the result into a new DataFrame.
[Update with example of in-driver collect/zip]
val df3 = spark.createDataFrame(df1.collect() zip df2.collect()).withColumnRenamed("_1", "col1").withColumnRenamed("_2", "col2")
Depends in what you want to do.
If you want to merge two DataFrame you should use the join. There are the same join's types has in relational algebra (or any DBMS)
You are saying that your Data Frames just had one column each.
In that case you might want todo a cross join (cartesian product) with give you a two columns table of all possible combination of col1 and col2, or you might want the uniao (as referred by #Chondrops) witch give you a one column table with all elements.
I think all other join's types uses can be done specialized operations in spark (in this case two Data Frames one column each).

Converting a Dataframe to a scala Mutable map doesn't produce equal number of records

I am new to Scala/spark. I am working on Scala/Spark application that selects a couple of columns from a hive table and then converts it into a Mutable map with the first column being the keys and second column being the values. For example:
+--------+--+
| c1 |c2|
+--------+--+
|Newyork |1 |
| LA |0 |
|Chicago |1 |
+--------+--+
will be converted to Scala.mutable.Map(Newyork -> 1, LA -> 0, Chicago -> 1)
Here is my code for the above conversion:
val testDF = hiveContext.sql("select distinct(trim(c1)),trim(c2) from default.table where trim(c1)!=''")
val testMap = scala.collection.mutable.Map(testDF.map(r => (r(0).toString,r(1).toString)).collectAsMap().toSeq: _*)
I have no problem with the conversion. However, when I print the counts of rows in the Dataframe and the size of the Map, I see that they don't match:
println("Map - "+testMap.size+" DataFrame - "+testDF.count)
//Map - 2359806 DataFrame - 2368295
My idea is to convert the Dataframes to collections and perform some comparisons. I am also picking up data from other tables, but they are just single columns. and I have no problem converting them to ArrayBuffer[String] - The counts match.
I don't understand why I am having a problem with the testMap. Generally, the counts rows in the DF and the size of the Map should match, right?
Is it because there are too many records? How do I get the same number of records in the DF into the Map?
Any help would be appreciated. Thank you.
I believe the mismatch in counts is caused by elimination of duplicated keys (i.e. city names) in Map. By design, Map maintains unique keys by removing all duplicates. For example:
val testDF = Seq(
("Newyork", 1),
("LA", 0),
("Chicago", 1),
("Newyork", 99)
).toDF("city", "value")
val testMap = scala.collection.mutable.Map(
testDF.rdd.map( r => (r(0).toString, r(1).toString)).
collectAsMap().toSeq: _*
)
// testMap: scala.collection.mutable.Map[String,String] =
// Map(Newyork -> 99, LA -> 0, Chicago -> 1)
You might want to either use a different collection type or include an identifying field to your Map key to make it unique. Depending on your data processing need, you can also aggregate data into a Map-like dataframe via groupBy like below:
testDF.groupBy("city").agg(count("value").as("valueCount"))
In this example, the total of valueCount should match the original row count.
If you add entries with duplicate key to your map, duplicates are automatically removed. So what you should compare is:
println("Map - "+testMap.size+" DataFrame - "+testDF.select($"c1").distinct.count)

How to calculate product of columns followed by sum over all columns?

Table 1 --Spark DataFrame table
There is a column called "productMe" in Table 1; and there are also other columns like a, b, c and so on whose schema name is contained in a schema array T.
What I want is the inner product of columns(product each row of the two columns) in schema array T with the column productMe(Table 2). And sum each column of Table 2 to get Table 3.
Table 2 is not necessary if you have good idea to get Table 3 in one step.
Table 2 -- Inner product table
For example, the column "a·productMe" is (3*0.2, 6*0.6, 5*0.4) to get (0.6, 3.6, 2)
Table 3 -- sum table
For example, the column "sum(a·productMe)" is 0.6+3.6+2=6.2.
Table 1 is DataFrame of Spark, how can I get Table 3?
You can try something like the following :
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
import org.apache.spark.sql.functions.col
val columnsToSum = df.
columns. // <-- grab all the columns by their name
tail. // <-- skip productMe
map(col). // <-- create Column objects
map(c => round(sum(c * col("productMe")), 3).as(s"sum_${c}_productMe"))
val df2 = df.select(columnsToSum: _*)
df2.show()
# +---------------+---------------+---------------+
# |sum_a_productMe|sum_b_productMe|sum_c_productMe|
# +---------------+---------------+---------------+
# | 6.2| 6.3| 4.3|
# +---------------+---------------+---------------+
The trick is to use df.select(columnsToSum: _*) which means that you want to select all the columns on which we did the sum of columns times the productMe column. The :_* is a Scala-specific syntax to specify that we are passing repeated arguments because we don't have a fix number of arguments.
We can do it with simple SparkSql
val table1 = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
table1.show
table1.createOrReplaceTempView("table1")
val table2 = spark.sql("select a*productMe, b*productMe, c*productMe from table1") //spark is sparkSession here
table2.show
val table3 = spark.sql("select sum(a*productMe), sum(b*productMe), sum(c*productMe) from table1")
table3.show
All the other answers use sum aggregation that use groupBy under the covers.
groupBy always introduces a shuffle stage and usually (always?) is slower than corresponding window aggregates.
In this particular case, I also believe that window aggregates give better performance as you can see in their physical plans and details for their only one job.
CAUTION
Either solution uses one single partition to do the calculation that in turn makes them unsuitable for large datasets as their size together may easily exceed the memory size of a single JVM.
Window Aggregates
What follows is a window aggregate-based calculation which, in this particular case where we group over all the rows in a dataset, unfortunately gives the same physical plan. That makes my answer just a (hopefully) nice learning experience.
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
// yes, I did borrow this trick with columns from #eliasah's answer
import org.apache.spark.sql.functions.col
val columns = df.columns.tail.map(col).map(c => c * col("productMe") as s"${c}_productMe")
val multiplies = df.select(columns: _*)
scala> multiplies.show
+------------------+------------------+------------------+
| a_productMe| b_productMe| c_productMe|
+------------------+------------------+------------------+
|0.6000000000000001| 1.5|1.2000000000000002|
|3.5999999999999996|1.7999999999999998|0.6000000000000001|
| 2.0| 3.0| 2.5|
+------------------+------------------+------------------+
def sumOverRows(name: String) = sum(name) over ()
val multipliesCols = multiplies.
columns.
map(c => sumOverRows(c) as s"sum_${c}")
val answer = multiplies.
select(multipliesCols: _*).
limit(1) // <-- don't use distinct or dropDuplicates here
scala> answer.show
+-----------------+---------------+-----------------+
| sum_a_productMe|sum_b_productMe| sum_c_productMe|
+-----------------+---------------+-----------------+
|6.199999999999999| 6.3|4.300000000000001|
+-----------------+---------------+-----------------+
Physical Plan
Let's see the physical plan then (as it was the only reason why we wanted to see how to do the query using window aggregates, wasn't it?)
The following is the details for the only job 0.
If I understand your question correctly then following can be your solution
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
This gives input dataframe as you have (you can add more)
+---------+---+---+---+
|productMe|a |b |c |
+---------+---+---+---+
|3 |0.2|0.5|0.4|
|6 |0.6|0.3|0.1|
|5 |0.4|0.6|0.5|
+---------+---+---+---+
And
val productMe = df.columns.head
val colNames = df.columns.tail
var tempdf = df
for(column <- colNames){
tempdf = tempdf.withColumn(column, col(column)*col(productMe))
}
Above steps should give you Table2
+---------+------------------+------------------+------------------+
|productMe|a |b |c |
+---------+------------------+------------------+------------------+
|3 |0.6000000000000001|1.5 |1.2000000000000002|
|6 |3.5999999999999996|1.7999999999999998|0.6000000000000001|
|5 |2.0 |3.0 |2.5 |
+---------+------------------+------------------+------------------+
Table3 can be achieved as following
tempdf.select(sum("a").as("sum(a.productMe)"), sum("b").as("sum(b.productMe)"), sum("c").as("sum(c.productMe)")).show(false)
Table3 is
+-----------------+----------------+-----------------+
|sum(a.productMe) |sum(b.productMe)|sum(c.productMe) |
+-----------------+----------------+-----------------+
|6.199999999999999|6.3 |4.300000000000001|
+-----------------+----------------+-----------------+
Table2 can be achieved for any number of columns you have but Table3 would require you to define columns explicitly

How to split the values in a map in Scala?

I have a map that has values that comes from multiple different columns in the database. The values have underscores in between. For example,
newMap("A", 23_null_12_09asfA)
Here, 23 comes from column A and null from column B and so on. Now, consider a map that has 20 values. I want to know how to split these values into arrays or how to split and store them?
val baseRDD=sc.parallelize(List(("john","1_abc_2"),("jack","3_xyz_4")))
val sRDD = baseRDD.map(x=> x._2.split("_"))
val resultDF=sRDD.toDF
resultDF.show
|[1, abc, 2]|
|[3, xyz, 4]|