I have some columns like
age | company | country | gender |
----------------------------------
1 | 1 | 1 | 1 |
-----------------------------------
I want to create pairs like
(age,company)
(company,country)
(country,gender)
(company,gender)
(age,gender)
(age,country)
(age,company,country)
(company,country,gender)
(age,country,gender)
(age,company,gender)
(age,company,country,gender)
An idiomatic approach to generating a powerset using Set collection method subsets,
implicit class groupCols[A](val cols: List[A]) extends AnyVal {
def grouping() = cols.toSet.subsets.filter { _.size > 1 }.toList
}
Then
List("age","company","country","gender").grouping
delivers
List( Set(age, company),
Set(age, country),
Set(age, gender),
Set(company, country),
Set(company, gender),
Set(country, gender),
Set(age, company, country),
Set(age, company, gender),
Set(age, country, gender),
Set(company, country, gender),
Set(age, company, country, gender))
Note that the powerset includes the empty set and a set for each element in the original set, here we filter them out.
I doubt you can achieve this with tuples (and this topic confirms it).
But what you are looking for is called Power Set.
Consider this piece of code:
object PowerSetTest extends Application {
val ls = List(1, 2, 3, 4)
println(power(ls.toSet).filter(_.size > 1))
def power[A](t: Set[A]): Set[Set[A]] = {
#annotation.tailrec
def pwr(t: Set[A], ps: Set[Set[A]]): Set[Set[A]] =
if (t.isEmpty) ps
else pwr(t.tail, ps ++ (ps map (_ + t.head)))
pwr(t, Set(Set.empty[A]))
}
}
Running this gives you:
Set(Set(1, 3), Set(1, 2), Set(2, 3), Set(1, 2, 3, 4), Set(3, 4), Set(2, 4), Set(1, 2, 4), Set(1, 4), Set(1, 2, 3), Set(2, 3, 4), Set(1, 3, 4))
You can read here for more information
Related
Assume I have a Spark DataFrame d1 with two columns, elements_1 and elements_2, that contain sets of integers of size k, and value_1, value_2 that contain a integer value. For example, with k = 3:
d1 =
+------------+------------+
| elements_1 | elements_2 |
+-------------------------+
| (1, 4, 3) | (3, 4, 5) |
| (2, 1, 3) | (1, 0, 2) |
| (4, 3, 1) | (3, 5, 6) |
+-------------------------+
I need to create a new column combinations made that contains, for each pair of sets elements_1 and elements_2, a list of the sets from all possible combinations of their elements. These sets must have the following properties:
Their size must be k+1
They must contain either the set in elements_1 or the set in elements_2
For example, from (1, 2, 3) and (3, 4, 5) we obtain [(1, 2, 3, 4), (1, 2, 3, 5), (3, 4, 5, 1) and (3, 4, 5, 2)]. The list does not contain (1, 2, 5) because it is not of length 3+1, and it does not contain (1, 2, 4, 5) because it contains neither of the original sets.
You need to create a custom user-defined function to perform the transformation, create a spark-compatible UserDefinedFunction from it, then apply using withColumn. So really, there are two questions here: (1) how to do the set transformation you described, and (2) how to create a new column in a DataFrame using a user-defined function.
Here's a first shot at the set logic, let me know if it does what you're looking for:
def combo[A](a: Set[A], b: Set[A]): Set[Set[A]] =
a.diff(b).map(b+_) ++ b.diff(a).map(a+_)
Now create the UDF wrapper. Note that under the hood these sets are all represented by WrappedArrays, so we need to handle this. There's probably a more elegant way to deal with this by defining some implicit conversions, but this should work:
import scala.collection.mutable.WrappedArray
val comboWrap: (WrappedArray[Int],WrappedArray[Int])=>Array[Array[Int]] =
(x,y) => combo(x.toSet,y.toSet).map(_.toArray).toArray
val comboUDF = udf(comboWrap)
Finally, apply it to the DataFrame by creating a new column:
val data = Seq((Set(1,2,3),Set(3,4,5))).toDF("elements_1","elements_2")
val result = data.withColumn("result",
comboUDF(col("elements_1"),col("elements_2")))
result.show
I have 2 lists:
val list_1 = List((1, 11), (2, 12), (3, 13), (4, 14))
val list_2 = List((1, 111), (2, 122), (3, 133), (4, 144), (1, 123), (2, 234))
I want to replace key in the second list as value of first list, resulting in a new list that looks like:
List ((11, 111), (12, 122), (13, 133), (14, 144), (11, 123), (12, 234))
This is my attempt:
object UniqueTest {
def main(args: Array[String]){
val l_1 = List((1, 11), (2, 12), (3, 13), (4, 14))
val l_2 = List((1, 111), (2,122), (3, 133), (4, 144), (1, 123), (2, 234))
val l_3 = l_2.map(x => (f(x._1, l_1), x._2))
print(l_3)
}
def f(i: Int, list: List[(Int, Int)]): Int = {
for(pair <- list){
if(i == pair._1){
return pair._2
}
}
return 0
}
}
This results in:
((11, 111), (12, 122), (13, 133), (14, 144), (11, 123), (12, 234))
Is the program above a good way to do this? Are there built-in functions in Scala to handle this need, or another way to do this manipulation?
The only real over-complication you make is this line:
val l_3 = l_2.map(x => (f(x._1, l_1), x._2))
Your f function uses an imperative style to loop over a list to find a key. Any time you find yourself doing this, it's a good indication what you want is a map. By doing the for loop each time you're exploding the computational complexity: a map will allow you to fetch the corresponding value for a given key in O(1). With a map you first convert your list, which is a key-value pair, to a datastructure explicit about supporting the key-value pair relationship.
Thus, the first thing you should do is build your map. Scala provides a really easy way to do this with toMap:
val map_1 = list_1.toMap
Then it is just a matter of 'mapping':
val result = list_2.map { case (key, value) => map_1.getOrElse(key, 0), value) }
This takes each case in your list_2, matches the first value (key) to a key in your map_1, retrieves that value (or the default 0) and puts as the first value in a key-value tuple.
You can do:
val map = l_1.toMap // transform l_1 to a Map[Int, Int]
// for each (a, b) in l_2, retrieve the new value v of a and return (v, b)
val res = l_2.map { case (a, b) => (map.getOrElse(a, 0), b) }
The most idiomatic way is zipping them together and then transforming according to your needs:
(list_1 zip list_2) map { case ((k1, v1), (k2, v2)) => (v1, v2) }
Orgin data
ID, NAME, SEQ, NUMBER
A, John, 1, 3
A, Bob, 2, 5
A, Sam, 3, 1
B, Kim, 1, 4
B, John, 2, 3
B, Ria, 3, 5
To mak ID group list, I did below
val MapRDD = originDF.map { x => (x.getAs[String](colMap.ID), List(x)) }
val ListRDD = MapRDD.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
My goal is making this RDD (purpose is to find SEQ-1's NAME and Number diff in each ID group)
ID, NAME, SEQ, NUMBER, PRE_NAME, DIFF
A, John, 1, 3, NULL, NULL
A, Bob, 2, 5, John, 2
A, Sam, 3, 1, Bob, -4
B, Kim, 1, 4, NULL, NULL
B, John, 2, 3, Kim, -1
B, Ria, 3, 5, John, 2
Currently ListRDD would be like
A, ([A,Jone,1,3], [A,Bob,2,5], ..)
B, ([B,Kim,1,4], [B,John,2,3], ..)
This is code I tried to make my goal RDD with ListRDD (not working as I want)
def myFunction(ListRDD: RDD[(String, List[Row])]) = {
var rows: List[Row] = Nil
ListRDD.foreach( row => {
rows ::: make(row._2)
})
//rows has nothing and It's not RDD
}
def make( eachList: List[Row]): List[Row] = {
caseList.foreach { x => //... Make PRE_NAME and DIFF in new List
}
My final goal is to save this RDD in csv (RDD.saveAsFile...). How to make this RDD(not list) with this data.
Window functions look like a good fit here:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
val df = sc.parallelize(Seq(
("A", "John", 1, 3),
("A", "Bob", 2, 5),
("A", "Sam", 3, 1),
("B", "Kim", 1, 4),
("B", "John", 2, 3),
("B", "Ria", 3, 5))).toDF("ID", "NAME", "SEQ", "NUMBER")
val w = Window.partitionBy($"ID").orderBy($"SEQ")
df.select($"*",
lag($"NAME", 1).over(w).alias("PREV_NAME"),
($"NUMBER" - lag($"NUMBER", 1).over(w)).alias("DIFF"))
I have an Array[(List(String)), Array[(Int, Int)]] like this
((123, 456, 789), (1, 24))
((89, 284), (2, 6))
((125, 173, 88, 222), (3, 4))
I would like to distribute each element of the first list to the second list, like this
(123, (1, 24))
(456, (1, 24))
(789, (1, 24))
(89, (2, 6))
(284, (2, 6))
(125, (3, 4))
(173, (3, 4))
(88, (3, 4))
(22, (3, 4))
Can anyone help me with this? Thank you very much.
For input data defined as follows:
val data = Array((List("123", "456", "789"), (1, 24)), (List("89", "284"), (2, 6)), (List("125", "173", "88", "222"), (3, 4)))
you can use:
data.flatMap { case (l, ii) => l.map((_, ii)) }
which yields:
Array[(String, (Int, Int))] = Array(("123", (1, 24)), ("456", (1, 24)), ("789", (1, 24)), ("89", (2, 6)), ("284", (2, 6)), ("125", (3, 4)), ("173", (3, 4)), ("88", (3, 4)), ("222", (3, 4)))
which I believe matches what you are looking for.
Based on your example, it seemed to me that you were using a single type.
scala> val xs: List[(List[Int], (Int, Int))] =
| List( ( List(123, 456, 789), (1, 24) ),
| ( List(89, 284), (2,6)),
| ( List(125, 173, 88, 222), (3, 4)) )
xs: List[(List[Int], (Int, Int))] = List((List(123, 456, 789), (1,24)),
(List(89, 284),(2,6)),
(List(125, 173, 88, 222),(3,4)))
Then I wrote this function:
scala> def f[A](xs: List[(List[A], (A, A))]): List[(A, (A, A))] =
| for {
| x <- xs
| head <- x._1
| } yield (head, x._2)
f: [A](xs: List[(List[A], (A, A))])List[(A, (A, A))]
Apply f to xs.
scala> f(xs)
res9: List[(Int, (Int, Int))] = List((123,(1,24)), (456,(1,24)),
(789,(1,24)), (89,(2,6)), (284,(2,6)), (125,(3,4)),
(173,(3,4)), (88,(3,4)), (222,(3,4)))
I have 2 types of array like this:
array one,
Array(productId, categoryId)
(2, 423)
(6, 859)
(3, 423)
(5, 859)
and another array Array((productId1, productId2), count)
((2, 6), 1), ((2, 3), 1), ((6, 5), 1), ((6, 3), 1)
I would like to filter the second array based on the first array,
firstly I want to check array 2 to see if productId1 and productId2 having the same category, if yes, will keep, otherwise will filter out this element.
So the list above will be filtered to remain:
( ((2, 3), 1), ((6, 5), 1) )
Can anybody help me with this? Thank you very much.
If you don't mind working with the first array as a map, ie:
scala> val categ_info = cats = Array((2, 423), (6, 859), (3, 423), (5, 859)).toMap
categ_info: Map[Int, Int] = Map(2 -> 423, 6 -> 859, 3 -> 423, 5 -> 859)
then we have (setting up example data as simple Ints for convenience):
val data = Array(((2, 6), 1), ((2, 3), 1), ((6, 5), 1), ((6, 3), 1))
data.filter { case ((prod1_id, prod2_id), _) =>
categ_info(prod1_id) == categ_info(prod2_id)
}
producing:
res2: Array[((Int, Int), Int)] = Array(((2, 3), 1), ((6, 5), 1))
as requested.