Get all key combinations from nested maps - scala

I have a nested map like so:
val m: Map[Int, Map[String, Seq[Int]]] =
Map(
1 -> Map(
"A" -> Seq(1, 2, 3),
"B" -> Seq(4, 5, 6)
),
2 -> Map(
"C" -> Seq(7, 8, 9),
"D" -> Seq(10, 11, 12),
"E" -> Seq(13, 14, 15)
),
3 -> Map(
"F" -> Seq(16, 17, 18)
)
)
I want the desired output to show every possible combination of the integers in the Seqs. For example:
List((1, "A", 1),
(1, "A", 2),
(1, "A", 3),
(1, "B", 4),
(1, "B", 5),
(1, "B", 6),
(2, "C", 7),
(2, "C", 8),
(2, "C", 9),
(2, "D", 10),
(2, "D", 11),
(2, "D", 12),
(2, "E", 13),
(2, "E", 14),
(2, "E", 15),
(3, "F", 16),
(3, "F", 17),
(3, "F", 18))
I have been trying different combinations of map and flatMap, but nothing has been working. Any ideas?

Here is a possibility using a for comprehension:
for {
(k1, v1) <- m
(k2, v2) <- v1
v3 <- v2
} yield (k1, k2, v3)
This goes through all top key/value pairs of m. For each of these top values, this goes through all nested key/values. And finally for all of these nested values (which are the lists), it goes through each elements and yields what's requested.
A for comprehension is an equivalent to nested flatMaps, such as:
m.flatMap{
case (k1, v1) => v1.flatMap {
case (k2, v2) => v2.map(v3 => (k1, k2, v3))
}
}

Related

Using map in spark to make dictionary format

I executed the following code:
temp = rdd.map( lambda p: ( p[0], (p[1],p[2],p[3],p[4],p[5]) ) ).groupByKey().mapValues(list).collect()
print(temp)
and I could get data:
[ ("A", [("a", 1, 2, 3, 4), ("b", 2, 3, 4, 5), ("c", 4, 5, 6, 7)]) ]
I'm trying to make a dictionary with second list argument.
For example I want to reconstruct temp like this format:
("A", {"a": [1, 2, 3, 4], "b":[2, 3, 4, 5], "c":[4, 5, 6, 7]})
Is there any clear way to do this?
If I understood you correctly you need something like this:
spark = SparkSession.builder.getOrCreate()
data = [
["A", "a", 1, 2, 5, 6],
["A", "b", 3, 4, 6, 9],
["A", "c", 7, 5, 6, 0],
]
rdd = spark.sparkContext.parallelize(data)
temp = (
rdd.map(lambda x: (x[0], ({x[1]: [x[2], x[3], x[4], x[5]]})))
.groupByKey()
.mapValues(list)
.mapValues(lambda x: {k: v for y in x for k, v in y.items()})
)
print(temp.collect())
# [('A', {'a': [1, 2, 5, 6], 'b': [3, 4, 6, 9], 'c': [7, 5, 6, 0]})]
This is easily doable with a custom Python function once you obtain the temp object. You just need to use tuple, list and dict manipulation.
def my_format(l):
# get tuple inside list
tup = l[0]
# create dictionary with key equal to first value of each sub-tuple
dct = {}
for e in tup[1]:
dct2 = {e[0]: list(e[1:])}
dct.update(dct2)
# combine first element of list with dictionary
return (tup[0], dct)
my_format(temp)
# ('A', {'a': [1, 2, 3, 4], 'b': [2, 3, 4, 5], 'c': [4, 5, 6, 7]})

Depth First Search Algorithm in Dataframe(GraphFrame) in spark

I have a two dataframe having one containing vertices
val v = sqlContext.createDataFrame(scala.List(
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 1),
("f", "Fanny", 36),
("g", "Gabby", 60),
("h", "harry", 45),
("i", "ishwar", 37),
("j", "James", 65),
("z", "James", 65),
("k", "Kamla", 43),
("l", "laila", 54)
)).toDF("id", "name", "age")
and another one containing edges
val e = sqlContext.createDataFrame(scala.List(
("a", "b", "follow", 193, 231),
("b", "c", "friend", 113, 211),
("c", "d", "follow", 124, 222),
("d", "e", "follow", 135, 233),
("f", "c", "follow", 146, 243),
("b", "f", "follow", 146, 243),
("h", "i", "friend", 123, 265),
("i", "h", "friend", 123, 265),
("i", "j", "friend", 126, 223),
("j", "h", "friend", 126, 223),
("f", "g", "friend", 157, 243),
("i", "a", "friend", 157, 243)
)).toDF("src", "dst", "relationship", "SNO", "Salary")
I need a dataframe that contain the paths having all possible paths from vertices lets say vert 'a' to 'e' just like algorithm DFS does.
It should give output like
|n1 |e1 |n2 |e2 |n3 |e3 |n4 |e4 |n5 |
+--------------+------------------------+------------+------------------------+----------------+------------------------+--------------+------------------------+--------------+
|[a, Alice, 34]|[a, b, follow, 193, 231]|[b, Bob, 36]|[b, c, friend, 113, 211]|[c, Charlie, 30]|[c, d, follow, 124, 222]|[d, David, 29]|[d, e, follow, 135, 233]|[e, Esther, 1]|
+--------------+------------------------+------------+------------------------+----------------+------------------------+--------------+------------------------+--------------+
+--------------+------------------------+------------+------------------------+--------------+------------------------+----------------+------------------------+--------------+------------------------+--------------+
|n1 |e1 |n2 |e2 |n3 |e3 |n4 |e4 |n5 |e5 |n6 |
+--------------+------------------------+------------+------------------------+--------------+------------------------+----------------+------------------------+--------------+------------------------+--------------+
|[a, Alice, 34]|[a, b, follow, 193, 231]|[b, Bob, 36]|[b, f, follow, 146, 243]|[f, Fanny, 36]|[f, c, follow, 146, 243]|[c, Charlie, 30]|[c, d, follow, 124, 222]|[d, David, 29]|[d, e, follow, 135, 233]|[e, Esther, 1]|
+--------------+------------------------+------------+------------------------+--------------+------------------------+----------------+------------------------+--------------+------------------------+--------------+
I want DFS algorithm in a dataframe so that I can do another another task on these derived datframe or Graphframe.
Any help or suggestion would be great.
Thanks

removing duplicate cycles of directed graph from a list in scala

I have collection of lists shown below.
List(4, 0, 1, 2, 4)
List(4, 0, 1, 3, 4)
List(4, 0, 2, 3, 4)
List(4, 3, 2, 3, 4)
List(4, 3, 4, 3, 4)
List(0, 1, 2, 4, 0)
List(0, 1, 3, 4, 0)
List(0, 2, 3, 4, 0)
List(1, 2, 4, 0, 1)
List(1, 3, 4, 0, 1)
List(3, 4, 0, 1, 3)
List(3, 4, 0, 2, 3)
List(3, 2, 3, 2, 3)
List(3, 4, 3, 2, 3)
List(3, 2, 3, 4, 3)
List(3, 4, 3, 4, 3)
List(2, 3, 4, 0, 2)
List(2, 4, 0, 1, 2)
List(2, 3, 2, 3, 2)
List(2, 3, 4, 3, 2)
These lists are the individual cycles in a directed graph with cycle length of 4. I want to filter out the number of unique path from the given lists which does not have any smaller path in between. For example - List(4,0,1,2,4) and List(0,1,2,4,0) forms the same cycle. Another example - List(2,3,2,3,2) iterates over 2 and 3 only and does not form the cycle length 4.
From this collection we can say that List(0, 1, 2, 4, 0) List(0, 1, 3, 4, 0) List(0, 2, 3, 4, 0) are the unique paths and total number would be 3.
List(0, 1, 2, 4, 0) and List(4,0,1,2,4) is the same cycle so we take one of them.
I tried to use filter but unable to find any logic to do this.
Following should work:
val input = List(List(4, 0, 1, 2, 4),List(4, 0, 1, 3, 4) ,List(4, 0, 2, 3, 4) ,List(4, 3, 2, 3, 4) ,List(4, 3, 4, 3, 4) ,
List(0, 1, 2, 4, 0) ,List(0, 1, 3, 4, 0) ,List(0, 2, 3, 4, 0) ,List(1, 2, 4, 0, 1) ,List(1, 3, 4, 0, 1) ,List(3, 4, 0, 1, 3) ,
List(3, 4, 0, 2, 3) ,List(3, 2, 3, 2, 3) ,List(3, 4, 3, 2, 3) ,List(3, 2, 3, 4, 3) ,List(3, 4, 3, 4, 3) ,
List(2, 3, 4, 0, 2) ,List(2, 4, 0, 1, 2) ,List(2, 3, 2, 3, 2), List(2, 3, 4, 3, 2))
var uniquePaths: mutable.Set[List[Int]] = collection.mutable.Set[List[Int]]()
var indexes: ListBuffer[Int] = mutable.ListBuffer[Int]()
input.zipWithIndex.foreach{x =>
val (list, index) = (x._1, x._2)
if(list.head==list.last) {
val list1 = rotateArray(list.tail)
if (list1.toSet.size == 4) {
if(!uniquePaths.contains(list1))
indexes.append(index)
uniquePaths.add(list1)
}
}
}
indexes foreach{x => println(input(x))}
def rotateArray(xs: List[Int]): List[Int] =
xs.splitAt(xs.indexOf(xs.min)) match {case (x, y) => List(y, x).flatten}
...freehand red cycles to the rescue.
Here are two different cycles on the same four vertices, which shows that sorting is insufficient:
The sketch assumes that all the points are vertices of a fully connected graph (edges omitted), and is supposed to show that the cycles [0, 1, 2, 3, 0] and [0, 2, 1, 3, 0] are not the same, despite the fact that if you sort the sets, you obtain [0, 1, 2, 3] in both cases.
Here is what might work instead:
Throw away all the paths which go through the same vertex more than once by filtering out all the paths that do not consist of four distinct elements.
Rotate the path representation into canonical form (e.g. starting at the vertex with minimum position).
Compute the set of canonical representations, retaining only the unique paths.
Here is what the implementation might look like:
def canonicalize(cycle: List[Int]) = {
val t = cycle.tail
val (b, a) = t.splitAt(t.zipWithIndex.minBy(_._1)._2)
val ab = (a ++ b)
ab :+ (ab.head)
}
val cycles = List(
List(4, 0, 1, 2, 4),
List(4, 0, 1, 3, 4),
List(4, 0, 2, 3, 4),
List(4, 3, 2, 3, 4),
List(4, 3, 4, 3, 4),
List(0, 1, 2, 4, 0),
List(0, 1, 3, 4, 0),
List(0, 2, 3, 4, 0),
List(1, 2, 4, 0, 1),
List(1, 3, 4, 0, 1),
List(3, 4, 0, 1, 3),
List(3, 4, 0, 2, 3),
List(3, 2, 3, 2, 3),
List(3, 4, 3, 2, 3),
List(3, 2, 3, 4, 3),
List(3, 4, 3, 4, 3),
List(2, 3, 4, 0, 2),
List(2, 4, 0, 1, 2),
List(2, 3, 2, 3, 2),
List(2, 3, 4, 3, 2)
)
val unique = cycles.filter(_.toSet.size == 4).map(canonicalize).toSet
unique foreach println
Output:
List(0, 1, 2, 4, 0)
List(0, 1, 3, 4, 0)
List(0, 2, 3, 4, 0)
Line-by-line example of what canonicalize does:
tail removes the duplicate vertex: [2, 1, 0, 4, 2] -> [1, 0, 4, 2]
splitAt finds the minimum vertex and cuts the list: [1, 0, 4, 2] -> ([1], [0, 4, 2])
a ++ b rebuilds the rotated list: [0, 4, 2, 1]
:+ appends the minimum vertex to the end: [0, 4, 2, 1, 0]
Drop the last element from the list (it's redundant)
Scroll the lists to start from the smallest ID
Sort loops by the length shortest first
You can use lexical matching now (if loop[i] contains any loop[0..i-1] -> drop it)

Finding values greater than * in a map list

My current system is a mapped String,List[Int], the String being a key value, "Sk1", "Sk2" etc, and the int is a list of numbers from 0-9.
Here is my current method to find all of the lists, how do I edit this to find only all of the "Sk*"s greater than the selected "SK*". The value of the list is the last element of the tail, which I already have a function to find. It is the handleFive option menu. To clarify, I need to find the last element (already have that function) then display only stocks greater than the selected stock.
Handler for the menu options
def handleFive(): Boolean = {
mnuShowSingleDataStock(currentStockLevel)
true
}
def handleSeven(): Boolean = {
mnuShowPointsForStock(allStockLevel)
true
}
Functions that invoke and interact with the user
// Returns a single result, not a list
def mnuShowSingleDataStock(f: (String) => (String,Int)) = {
print("Stock > ")
val data = f(readLine)
println(s"${data._1}: ${data._2}")
}
//Returns a list value
def mnuShowPointsForStock(f: (String) => (String,List[Int])) = {
print("Stock > ")
val data = f(readLine)
println(s"${data._1}: ${data._2}")
}
Not sure how to edit this, currently it shows ALL of the values in the list, I only want to return values greater than the selected value
//Show last element in the list, most current
def currentStockLevel (stock: String): (String, Int) = {
(stock, mapdata.get (stock).map(findLast(_)).getOrElse(0))
}
//Unsure how to change this to only return values greater than the selected one, not everything
def currentStockLevel (stock: String): (String, List[Int]) = {
(stock, mapdata.get (stock).map(findLast(_)).getOrElse(0))
}
My current mapped list - THIS IS MAPDATA
val mapdata = Map(
"SK1" -> List(9, 7, 2, 0, 7, 3, 7, 9, 1, 2, 8, 1, 9, 6, 5, 3, 2, 2, 7, 2, 8, 5, 4, 5, 1, 6, 5, 2, 4, 1),
"SK2" -> List(0, 7, 6, 3, 3, 3, 1, 6, 9, 2, 9, 7, 8, 7, 3, 6, 3, 5, 5, 2, 9, 7, 3, 4, 6, 3, 4, 3, 4, 1),
"SK3" -> List(8, 7, 1, 8, 0, 5, 8, 3, 5, 9, 7, 5, 4, 7, 9, 8, 1, 4, 6, 5, 6, 6, 3, 6, 8, 8, 7, 4, 0, 6),
"SK4" -> List(2, 9, 5, 7, 0, 8, 6, 6, 7, 9, 0, 1, 3, 1, 6, 0, 0, 1, 3, 8, 5, 4, 0, 9, 7, 1, 4, 5, 2, 8),
"SK5" -> List(2, 6, 8, 0, 3, 5, 5, 2, 5, 9, 4, 5, 3, 5, 7, 8, 8, 2, 5, 9, 3, 8, 6, 7, 8, 7, 4, 1, 2, 3),
"SK6" -> List(2, 7, 5, 9, 1, 9, 8, 4, 1, 7, 3, 7, 0, 8, 4, 5, 9, 2, 4, 4, 8, 7, 9, 2, 2, 7, 9, 1, 6, 9),
"SK7" -> List(6, 9, 5, 0, 0, 0, 0, 5, 8, 3, 8, 7, 1, 9, 6, 1, 5, 3, 4, 7, 9, 5, 5, 9, 1, 4, 4, 0, 2, 0),
"SK8" -> List(2, 8, 8, 3, 1, 1, 0, 8, 5, 9, 0, 3, 1, 6, 8, 7, 9, 6, 7, 7, 0, 9, 5, 2, 5, 0, 2, 1, 8, 6),
"SK9" -> List(7, 1, 8, 8, 4, 4, 2, 2, 7, 4, 0, 6, 9, 5, 5, 4, 9, 1, 8, 6, 3, 4, 8, 2, 7, 9, 7, 2, 6, 6)
)
The Map[String, List[Int]] type has a filterKeys(f: String => Boolean) method, in order to keep only the keys satisfying a given predicate.
A possible solution would be
// get int value from stock if of the form "SK<int>"
def stockInt(stock: String): Option[Int] =
Try(stock.drop(2).toInt).filter(_ => stock.startsWith("SK")).toOption
// we keep the keys in the return, so that you do not get unordered results
// (order is not assured by Map)
def currentStockLevel(stock: String): (String, Map[String, Int]) = {
val maybeN = stockInt(stock)
def isGreater(other: String) = (for {
o <- stockInt(other)
n <- maybeN
} yield o > n).getOrElse(true) // if any key is not in the form of SK*, assume it is greater than the original stock
(
stock,
mapdata.filterKeys(isGreater(_)).mapValues(findLast(_))
)
}
Another possibility, if you are sure to have only "SK" keys, is to use SortedMap, which uses a SortedSet for its keys, so that you are sure to have key-value pairs ordered as you want them to be.
In that case, a solution would be
//put all values in mapdata in a SortedMap
val sortedMap = SortedMap[String, List[Int]]() ++ mapdata
def currentStockLevel(stock: String): (String, List[Int]) = {
(
stock,
sortedMap.dropWhile(_ <= stock).toList.map(_._2).map(findLast(_))
)
}
EDIT (after comments on what is expected as a return):
If I understand well what you are trying to do, you want to filter on the values rather than the keys. This is not a problem, Map also has a filter(p: ((K, V)) => Boolean): Map[K, V] method to do just that:
def currentHigherStockLevel(stock: String): Map[String, Int] = {
val current = datamap.get(stock).map(findLast).getOrElse(0) // if stock is not in the keySet, we keep all keys, by keeping those greater than 0.
datamap.mapValues(findLast).filter {
case (sk, val) => val > current
}
}
This returns a Map[String; Int] where the values are the last ones that are greater than the one given as parameter (we keep their keys because they will probably be useful).
If the key strings are things like "SK9" and "SK10" then you have to cut the digits out, convert to Int, and compare/filter them, but if your keys are kept in a completely consistent format: "SK001", "SK002" ... "SK009", "SK010" ... "SK099", "SK100", etc., then you use simple string comparisons to filter for just what you want.
mapdata.filterKeys(_ >= stock).values // an Iterable[List[Int]]

SparkSQL:Avg based on a column after GroupBy

I have an rdd of student grades and I need to first group them by the first column which is university and then show the average of student count in each course like this. What is the easiest way to do this query?
+----------+-------------------+
|university| avg of students |
+----------+--------------------+
| MIT| 3 |
| Cambridge| 2.66
Here is the dataset.
case class grade(university: String, courseId: Int, studentId: Int, grade: Double)
val grades = List(grade(
grade("Cambridge", 1, 1001, 4),
grade("Cambridge", 1, 1004, 4),
grade("Cambridge", 2, 1006, 3.5),
grade("Cambridge", 2, 1004, 3.5),
grade("Cambridge", 2, 1002, 3.5),
grade("Cambridge", 3, 1006, 3.5),
grade("Cambridge", 3, 1007, 5),
grade("Cambridge", 3, 1008, 4.5),
grade("MIT", 1, 1001, 4),
grade("MIT", 1, 1002, 4),
grade("MIT", 1, 1003, 4),
grade("MIT", 1, 1004, 4),
grade("MIT", 1, 1005, 3.5),
grade("MIT", 2, 1009, 2))
1) First groupBy university
2) then get course count per university
3) then groupBy courseId
4) then get student count per course
grades.groupBy(_.university).map { case (k, v) =>
val courseCount = v.map(_.courseId).distinct.length
val studentCountPerCourse = v.groupBy(_.courseId).map { case (k, v) => v.length }.sum
k -> (studentCountPerCourse.toDouble / courseCount.toDouble)
}
Scala REPL
scala> val grades = List(
grade("Cambridge", 1, 1001, 4),
grade("Cambridge", 1, 1004, 4),
grade("Cambridge", 2, 1006, 3.5),
grade("Cambridge", 2, 1004, 3.5),
grade("Cambridge", 2, 1002, 3.5),
grade("Cambridge", 3, 1006, 3.5),
grade("Cambridge", 3, 1007, 5),
grade("Cambridge", 3, 1008, 4.5),
grade("MIT", 1, 1001, 4),
grade("MIT", 1, 1002, 4),
grade("MIT", 1, 1003, 4),
grade("MIT", 1, 1004, 4),
grade("MIT", 1, 1005, 3.5),
grade("MIT", 2, 1009, 2))
// grades: List[grade] = List(...)
scala> grades.groupBy(_.university).map { case (k, v) =>
val courseCount = v.map(_.courseId).distinct.length
val studentCountPerCourse = v.groupBy(_.courseId).map { case (k, v) => v.length }.sum
k -> (studentCountPerCourse.toDouble / courseCount.toDouble)
}
// res2: Map[String, Double] = Map("MIT" -> 3.0, "Cambridge" -> 2.6666666666666665)
gradesRdd.map({ case Grade(university: String, courseId: Int, studentId: Int, gpa: Int) =>
((university),(courseId))}).mapValues(x => (x, 1))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(y => 1.0 * y._1 / y._2).collect
res73: Array[(String, Double)] = Array((Cambridge,2.125), (MIT,1.1666666666666667))