Count unique values in list of sub-lists - scala

I have RDD of the following structure (RDD[(String,Map[String,List[Product with Serializable]])]):
This is a sample data:
(600,Map(base_data -> List((10:00 01-08-2016,600,111,1,1), (10:15 01-08-2016,615,111,1,5)), additional_data -> List((1,2)))
(601,Map(base_data -> List((10:01 01-08-2016,600,111,1,2), (10:02 01-08-2016,619,111,1,2), (10:01 01-08-2016,600,111,1,4)), additional_data -> List((5,6)))
I want to calculate the number of unique values of the 4th fields in sub-lists.
For instance let's take the first entry. The list is List((10:00 01-08-2016,600,111,1,1), (10:15 01-08-2016,615,111,1,5)). It contains 2 unique values (1 and 5) in the 4th field of sub-lists.
As to the second entry, it also contains 2 unique values (2 and 4), because 2 is repeated twice.
The resulting RDD should be of the format RDD[Map[String,Any]].
I tried to solve this task as follows:
val result = myRDD.map({
line => Map(("id",line._1),
("unique_count",line._2.get("base_data").groupBy(l => l).count(_))))
})
However this code does not do what I need. In fact, I don't know how to properly indicate that I want to group by 4th field...

You are quite close to the solution. There is no need to call groupBy, but you can access the item of the tuples by index, transform the resulting List into a Set and then just return the size of the Set, which corresponds to the number of unique elements:
("unique_count", line._2("base_data").map(bd => bd.productElement(4)).toSet.size)

Related

Scala Apriori Algorithm : if first index value is same, then Generate Three item set from two items set list

i want this answer
if Input list is >> List( List("A","B"),List("A","C"),List("B","C"),List("B","D"))
Output should be >> List(List("A", "B","C"),List("B","C","D"))
i think it should be done as following>> all indices having first element same are grouped e.g if first element is "A" then group will be ("A","B","B","C").distinct = ("A","B","C")

Scala: How to create a map over a collection from a set of keys?

Say I have a set of people Set[People]. Each person has an age. I want to create a function, which creates a Map[Int, Seq[People]] where for each age from, say, 0 to 100, there would be a sequence of people of that age or an empty sequence if there were no people of that age in the original collection.
I.e. I'm doing something along the lines
Set[People].groupBy(_.age)
where the output was
Map[Int, Seq[People]](0 -> Seq[John,Mary], 1-> Seq[People](), 2 -> Seq[People](Bill)...
groupBy of course omits all those ages for which there are no people. How should I implement this?
Configure a default value for your map:
val grouped = people.groupBy(_.age).withDefaultValue(Set())
if you need the values to be sequences you can map them
val grouped = people.groupBy(_.age).mapValues(_.toSeq).withDefaultValue(Seq())
Remember than, as the documentation puts it:
Note: `get`, `contains`, `iterator`, `keys`, etc are not affected by `withDefault`.
Since you've got map with not empty sequences corresponding to ages, you can fill the rest with empty collections:
val fullMap = (0 to 100).map (index => index -> map.getOrElse(index, None)).toMap

Imputing the dataset with mean of class label causing crash in filter operation

I have a csv file that contains numeric values.
val row = withoutHeader.map{
line => {
val arr = line.split(',')
for (h <- 0 until arr.length){
if(arr(h).trim == ""){
val abc = avgrdd.filter {case ((x,y),z) => x == h && y == arr(dependent_col_index).toDouble} //crashing here
arr(h) = //imputing with the value above
}
}
arr.mkString(",")
}
}
This is a snippet of the code where I am trying to impute the missing values with the mean of class labels.
avgrdd contains the average for the key value pairs where key is column index and the class label value. This avgrdd is calculated using the combiners which I see is calculating the results correctly.
dependent_col_index is the column containing the class labels.
The line with filter is crashing with the null pointer exception.
On removing this line the original array is the output (comma separated).
I am confused why the filter operation is causing a crash.
Please suggest on how to fix this issue.
Example
col1,dependent_col_index
4,1
8,0
,1
21,1
21,0
,1
25,1
,0
34,1
mean for class 1 is 84/4 = 21 and for class 0 is 29/2 = 14.5
Required Output
4,1
8,0
21,1
21,1
21,0
21,1
25,1
14.5,0
34,1
Thanks !!
You are trying to execute a RDD transformation inside of another RDD transformation. Remember that you cannot use RDD inside of another RDD transformation, this would cause an error.
The way to proceed is the following:
Transform the source RDD withoutHeader to the RDD of pairs <Class, Value> of the corrent type (Long in your case). Cache it
Calculate avgrdd on top of withoutHeader. This should be an RDD of pairs <Class, AvgValue>
Join withoutHeader RDD and avgrdd together - this way for each row you would have a structure <Class, <Value, AvgValue>>
Execute map on top of the result to replace missing Value with AvgValue
Another option might be to split the RDD in two parts on step 3 (one part - RDD with missing values, second one - RDD with non-missing values), join the avgrdd only with the RDD containing only missing values and after that make a union between this two parts. It would be faster if you have a small fraction of missing values

Index wise most frequently occuring element

I have a an array of the form
val array: Array[(Int, (String, Int))] = Array(
(idx1,(word1,count1)),
(idx2,(word2,count2)),
(idx1,(word1,count1)),
(idx3,(word3,count1)),
(idx4,(word4,count4)))
I want to get the top 10 and bottom 10 elements from this array for each index (idx1,idx2,....). Basically I want the top 10 most occuring and bottom 10 least occuring elements for each index value.
Please suggest how to acheive in spark in most efficient way.
I have tried it using the for loops for each index but this makes the program too slow and runs sequentially.
An example would be this :
(0,("apple",1))
(0,("peas",2))
(0,("banana",4))
(1,("peas",2))
(1,("banana",1))
(1,("apple",3))
(2,("NY",3))
(2,("London",5))
(2,("Zurich",6))
(3,("45",1))
(3,("34",4))
(3,("45",6))
Suppose I do top 2 on this set output would be
(0,("banana",4))
(0,("peas",2))
(1,("apple",3))
(1,("peas",2))
(2,("Zurich",6))
(2,("London",5))
(3,("45",6))
(3,("34",4))
I also need bottom 2 in the same way
I understand this is equivalent to producing the entire column list by using groupByKey on (K,V) pairs and then doing sort operation on it. Although the operation is right but in a typical spark environment the groupByKey operation will involve a lot of shuffle output and this may lead to inefficient operation.
Not sure about spark, but I think you can go with something like:
def f(array: Array[(Int, (String, Int))], n:Int) =
array.groupBy(_._1)
.map(pair => (
pair._1,
pair._2.sortBy(_._2).toList
)
)
.map(pair => (
pair._1,
(
pair._2.take(Math.min(n, pair._2.size)),
pair._2.drop(Math.max(0, pair._2.size - n))
)
)
)
The groupBy returns a map of index into a sorted list of entries by frequenct. After this, you map these entries to a pair of lists, one containing the top n elements and the other containing the bottom n elements. Note that you can replace all named parameters with _, I did it for clarity.
This version assumes that you always are interested in computing both the top and bot n elements, and thus does both in a single pass. If you usually only need one of the two, it's more efficient to add the .take or .drop immediately after the toList.

Sum of elements based on grouping an element using scala?

I have following scala list-
List((192.168.1.1,8590298237), (192.168.1.1,8590122837), (192.168.1.1,4016236988),
(192.168.1.1,1018539117), (192.168.1.1,2733649135), (192.168.1.2,16755417009),
(192.168.1.1,3315423529), (192.168.1.2,1523080027), (192.168.1.1,1982762904),
(192.168.1.2,6148851261), (192.168.1.1,1070935897), (192.168.1.2,276531515092),
(192.168.1.1,17180030107), (192.168.1.1,8352532280), (192.168.1.3,8590120563),
(192.168.1.3,24651063), (192.168.1.3,4431959144), (192.168.1.3,8232349877),
(192.168.1.2,17493253102), (192.168.1.2,4073818556), (192.168.1.2,42951186251))
I want following output-
List((192.168.1.1, sum of all values of 192.168.1.1),
(192.168.1.2, sum of all values of 192.168.1.2),
(192.168.1.3, sum of all values of 192.168.1.3))
How do I get sum of second elements from list by grouping on first element using scala??
Here you can use the groupBy function in Scala. You do, in your input data however have some issues, the ip numbers or whatever that is must be Strings and the numbers Longs. Here is an example of the groupBy function:
val data = ??? // Your list
val sumList = data.groupBy(_._1).map(x => (x._1, x._2.map(_._2).sum)).toList
If the answer is correct accept it or comment and I'll explain some more.