having the following rdd
BBBBBBAAAAAAABAABBBBBBBB
AAAAABBAAAABBBAAABAAAAAB
I need to calculate the numbers of iterations per group of event, so, for this example the expected output should be:
BBBBBBAAAAAAABAABBBBBBBB A -> 2 B -> 3
AAAAABBAAAABBBAAABBCCCCC A -> 3 B -> 4 C-> 1
Final Output -> A -> 5 B -> 7 C-> 1
I have implemented the splitting and them a sliding for each character to try to obtain the values, but I cannot obtain the expected result.
Thanks,
val baseRDD = sc.parallelize(Seq("BBBBBBAAAAAAABAABBBBBBBB", "AAAAABBAAAABBBAAABBCCCC"))
baseRDD.flatMap(x => "(\\w)\\1*".r.findAllMatchIn(x).map(x => (x.matched.charAt(0), 1)).toList).reduceByKey((accum, current) => accum + current).foreach(println(_))
Result
(C,1)
(B,6)
(A,5)
Hope this is what you wanted.
Related
Recently, i started working on spark windows and trying to understand what happens under the hood in spark executors when applying these windowing functions.
My question is, since every window must be created using partitionBy function which means shuffling data across the cluster, is it normal to use multiple windows?
For example i have this dataframe :
cli date item1 item2
--------- ---------- ------- -------
1234567 20191030 A D
1234567 20191030 B D
1234567 20191029 A E
1234567 20191026 A F
7456123 20191026 C D
7456123 20191025 C F
The aim here is to calculate the frequency of each item for each client for every date based on history.
For example the client 1234567 at 20191030 used 4 item_1 from 20191030 and backwards so the frequency of A will be 3/4 and B is 1/4.
I chose to calculate these frequencies for each day by using windows since it calculates a value for each row, but I need to use three windows :
// This will give me the number of items used by a client
// in that day and all history.
val lByCliWindow = Window.partitionBy(col("cli"))
// This will give me how many times a client used this exact item_1 A in
// that day and back in history (here my history is 120 days)
val lByCliItem1Window = Window
.partitionBy(col("cli"), col("item_1"))
.orderBy(to_timestamp(col("date"), "yyyyMMdd").cast(LongType))
.rangeBetween(-86400*120,0)
// This will give me how many times a client used this exact item_3 F in
// that day and back in history (here my history is 120 days)
val lByCliItem2Window = Window
.partitionBy(col("cli"), col("item_2"))
.orderBy(to_timestamp(col("date"), "yyyyMMdd").cast(LongType))
.rangeBetween(-86400*120,0)
The expected output is :
cli date frequency_item1 frequency_item2
--------- ---------- ------------------------- --------------------------------
1234567 20191030 Map(A -> 3/4, B -> 1/4) Map(D -> 2/4, E -> 1/4, F 1/4)
1234567 20191029 Map(A -> 2/2) Map(E -> 1/2, F -> 1/2)
1234567 20191026 Map(A -> 1/1) Map(F -> 1/1)
7456123 20191026 Map(C -> 2/2) Map(D -> 1/2, F -> 1/2)
7456123 20191025 Map(C -> 1/1) Map(F -> 1/1)
When i do explain() on this approach I can see so many exchange plans of hashpartitioning etc and that is very expectable since we are doing a partitionBy every time.
Giving that I have almost 30 variables this means 30 times partitioning of data. (This is a lot of shuffling)
What I want to understand is this approach normal? Will spark work on this partitioning on parallel (create multiple windows on the same time therefore partitioning the dataframe in multiple different ways at the same time) or sequentially?
Can we use multiple windows? what is more costly the groupBy shuffle or partitionBy windows shuffle?
Thank you for your replies and don't hesitate to propose different approach for calculating frequencies using windows.
I have a solution that involves only one window. I'll explain with comments.
// The columns you are interested in
val items = df.columns.filter(_ startsWith "item")
// collect_list aggregation. It avoid duplicates. We will group by cli and date.
val aggs = items.map(c => collect_list(col(c)) as c)
// A window over "cli" and ordered by date.
val win = Window.partitionBy("cli").orderBy("date")
// A UDF that computes the frequencies you want
// It takes as input a seq of seq because of the first aggregation we do
val compute_freqs = udf((s : Seq[Seq[String]]) => {
val flat_s = s.flatten
val total = flat_s.size
flat_s.groupBy(identity).mapValues(_.size.toDouble / total)
})
// for each item, we collect the values over the window, and compute the frequency
val frequency_columns = items
.map(item => compute_freqs(collect_list(col(item)) over win)
.alias(s"frequency_$item"))
// Then we use everything
val result = df
.groupBy("cli", "date")
.agg(aggs.head, aggs.tail : _*)
.select((Seq("cli", "date").map(col) ++ frequency_columns) :_*)
.orderBy($"cli", $"date" desc)
And here is the result:
scala> result.show(false)
+-------+--------+----------------------+--------------------------------+
|cli |date |frequency_item1 |frequency_item2 |
+-------+--------+----------------------+--------------------------------+
|1234567|20191030|[A -> 0.75, B -> 0.25]|[D -> 0.5, F -> 0.25, E -> 0.25]|
|1234567|20191029|[A -> 1.0] |[F -> 0.5, E -> 0.5] |
|1234567|20191026|[A -> 1.0] |[F -> 1.0] |
|7456123|20191026|[C -> 1.0] |[D -> 0.5, F -> 0.5] |
|7456123|20191025|[C -> 1.0] |[F -> 1.0] |
+-------+--------+----------------------+--------------------------------+
The following code allows me the sum per filter key. How do I sum and average all the values together. i.e combine the results of all filtered values.
val f= p.groupBy(d => (d.Id))
.mapValues(totavg =>
(totavg.groupBy(_.Day).filterKeys(Set(2,3,4)).mapValues(_.map(_.Amount))
Sample output:
Map(A9 -> Map(2 -> List(473.3, 676.48), 4 -> List(685.45, 812.73))
I would like to add all values together and compute total average.
i.e (473.3+676.48+685.45+812.73)/4
For the given Map, you can apply flatMap twice to return the sequence of values firstly, then calculate the average:
val m = Map("A9" -> Map(2 -> List(473.3, 676.48), 4 -> List(685.45, 812.73)))
val s = m.flatMap(_._2.flatMap(_._2))
// s: scala.collection.immutable.Iterable[Double] = List(473.3, 676.48, 685.45, 812.73)
s.sum/s.size
// res14: Double = 661.99
I am new to Scala and trying out the map function on a Map.
Here is my Map:
scala> val map1 = Map ("abc" -> 1, "efg" -> 2, "hij" -> 3)
map1: scala.collection.immutable.Map[String,Int] =
Map(abc -> 1, efg -> 2, hij -> 3)
Here is a map function and the result:
scala> val result1 = map1.map(kv => (kv._1.toUpperCase, kv._2))
result1: scala.collection.immutable.Map[String,Int] =
Map(ABC -> 1, EFG -> 2, HIJ -> 3)
Here is another map function and the result:
scala> val result1 = map1.map(kv => (kv._1.length, kv._2))
result1: scala.collection.immutable.Map[Int,Int] = Map(3 -> 3)
The first map function returns all the members as expected however the second map function returns only the last member of the Map. Can someone explain why this is happening?
Thanks in advance!
In Scala, a Map cannot have duplicate keys. When you add a new key -> value pair to a Map, if that key already exists, you overwrite the previous value. If you're creating maps from functional operations on collections, then you're going to end up with the value corresponding to the last instance of each unique key. In the example you wrote, each string key of the original map map1 has the same length, and so all your string keys produce the same integer key 3 for result1. What's happening under the hood to calculate result1 is:
A new, empty map is created
You map "abc" -> 1 to 3 -> 3 and add it to the map. Result now contains 1 -> 3.
You map "efg" -> 2 to 3 -> 2 and add it to the map. Since the key is the same, you overwrite the existing value for key = 3. Result now contains 2 -> 3.
You map "hij" -> 3 to 3 -> 3 and add it to the map. Since the key is the same, you overwrite the existing value for key = 3. Result now contains 3 -> 3.
Return the result, which is Map(3 -> 3)`.
Note: I made a simplifying assumption that the order of the elements in the map iterator is the same as the order you wrote in the declaration. The order is determined by hash bin and will probably not match the order you added elements, so don't build anything that relies on this assumption.
I am new to Scala and I am having troubles constructing a Map from inputs.
Here is my problem :
I am getting an input for elevators information. It consists of n lines, each one has the elevatorFloor number and the elevatorPosition on the floor.
Example:
0 5
1 3
4 5
So here I have 3 elevators, first one is on floor 0 at position 5, second one at floor 1 position 3 etc..
Is there a way in Scala to put it in a Map without using var ?
What I get so far is a Vector of all the elevators' information :
val elevators = {
for{i <- 0 until n
j <- readLine split " "
} yield j.toInt
}
I would like to be able split the lines in two variables "elevatorFloor" and "elevatorPos" and group them in a data structure (my guess is Map would be the appropriate choice) I would like to get something looking like:
elevators: SomeDataStructure[Int,Int] = ( 0->5, 1 -> 3, 4 -> 5)
I would like to clarify that I know I could write Javaish code, initialise a Map and then add the values to it, but I am trying to keep as close to functionnal programming as possible.
Thanks for the help or comments
You can do:
val res: Map[Int, Int] =
Source.fromFile("myfile.txt")
.getLines
.map { line =>
Array(floor, position) = line.split(' ')
(floor.toInt -> position.toInt)
}.toMap
So this question is related to question Transforming matrix format, scalding
But now, I want to make the back operation. So i can make it in a such way:
Tsv(in, ('row, 'col, 'v))
.read
.groupBy('row) { _.sortBy('col).mkString('v, "\t") }
.mapTo(('row, 'v) -> ('c)) { res : (Long, String) =>
val (row, v) = res
v }
.write(Tsv(out))
But, there, we got problem with zeros. As we know, scalding skips zero values fields. So for example we got matrix:
1 0 8
4 5 6
0 8 9
In scalding format is is:
1 1 1
1 3 8
2 1 4
2 2 5
2 3 6
3 2 8
3 3 9
Using my function I wrote above we can only get:
1 8
4 5 6
8 9
And that's incorrect. So, how can i deal with it? I see two possible variants:
To find way, to add zeros (actually, dunno how to insert data)
To write own operations on own matrix format (it is unpreferable, cause I'm interested in Scalding matrix operations, and dont want to write all of them my own)
Mb there r some methods, and I can avoid skipping zeros in matrix?
Scalding stores a sparse representation of the data. If you want to output a dense matrix (first of all, that won't scale, because the rows will be bigger than can fit in memory at some point), you will need to enumerate all the rows and columns:
// First, I highly suggest you use the TypedPipe api, as it is easier to get
// big jobs right generally
val mat = // has your matrix in 'row1, 'col1, 'val1
def zero: V = // the zero of your value type
val rows = IterableSource(0 to 1000, 'row)
val cols = IterableSource(0 to 2000, 'col)
rows.crossWithTiny(cols)
.leftJoinWithSmaller(('row, 'col) -> ('row1, 'col1), mat)
.map('val1 -> 'val1) { v: V =>
if(v == null) // this value should be 0 in your type:
zero
else
v
}
.groupBy('row) {
_.toList[(Int, V)](('col, 'val1) -> 'cols)
}
.map('cols -> 'cols) { cols: List[(Int, V)] =>
cols.sortBy(_._1).map(_._2).mkString("\t")
}
.write(TypedTsv[(Int, String)]("output"))