Match and Count Sequence Data - scala

I'm trying to match and count the data in the sequence data.What I have done so far as below:
val fruits = Seq("apple", "PEAR", "Pear")
fruits.exists(_.equalsIgnoreCase("pear"))
from here, how do we count the output as 2?

exists returns the boolean value so it will tell whether particular keyword exists or not in collection.
To find the count,
fruits.count(_.equalsIgnoreCase("pear"))

Related

Remove records from mutable.mutableList in Scala

I have a mutable.MutableList[emp] with following structure.
case class emp(name: String,id:String,sal: Long,dept: String)
I am generating records based on above case class in the below mutable.MutableList[emp].
val list1: mutable.MutableList[emp] = ((mike, 1, 123, HR),(mike,2,123,sys),(Lind,1,2323,sys))
If I have same name with id 1 and 2, I need to take only 2 and drop id 1 record. Id id 2 is not present, I have to take id 1.
How do achieve this? I tried it with following way but results are not accurate:
0. converted mutable.mutableList to Dataframe
1. filtered records with id 1(id1s_DF)
2. filtered records with id 2(other_rec_DF)
3. joined records with name and used leftsemi as join condition.
val join_info_DF = other_rec_DF.join(id1s_DF, id1s_DF("name") =!= other_rec_DF("name"),"leftsemi")
Above join will give all the names which are present in other_rec_DS and not present in Other_rec_DF.
Looks like I am doing some thing wrong with the join and not getting expected results.
Could some please help me to achieve this in either mutableList or by converting it into Dataframe.
Thanks,
Babu
If the size of your data is small enough you don't need something like Apache Spark to do the above task.
Doing this in plain scala code, the code would look something like below
case class Emp(name: String,id:Int,sal: Long,dept: String)
val list1: mutable.MutableList[Emp] = mutable.MutableList(
Emp("mike", 1, 123, "HR"),
Emp("mike", 2, 123, "sys"),
Emp("Lind", 1, 2323, "sys")
)
val result = list1
.groupBy(_.name)
.mapValues(_.sortBy(_.id)(Ordering[Int].reverse).head)
.values
result.foreach(println)
The output of the above code would be
Emp(Lind,1,2323,sys)
Emp(mike,2,123,sys)
The idea / approach is to make sure we group by the key on which you want to de-duplicate the items, sort them and pick the one with the highest id. We then drop the key and store only the values.
The above approach would work exactly the same way on Spark as well.

How to filter out entries from List[Map[String,String]]?

I want to filter out those entries that have operation_id equal to "0".
val operations_seen_qty = parsed.flatMap(_.lift("operation_id")).toSet.size.toString
parsed is List[Map[String,String]].
How can I do it?
This is my draft, but I think that I am in contrast selecting only those entries that have operation_id equal to 0:
val operations_seen_qty = parsed.flatMap(_.lift("operation_id")).filter(p=>p.equals("0")).toSet.size.toString
The final objective is to count the number of unique operation_id values that are not equal to "0".
If I understand correctly, you only want to retain those entries whose entry id is NOT equal to "0". In this case, the function in the filter should be p=>!p.equals("0") or p=>p!="0".
Filter will retain the entries fulfill the predicate. What you did is exactly the opposite.

.contains giving empty string in rdd

I have an array of id's called id. I have an RDD called r which as a field called idval which might have some ids in the id array. I want to get only the rows which are in this array. I am using
val new_r = r.filter(x => r.contains(x.idval)
But, when I go to do
new_r.take(10).foreach(println)
I get a NumberFormatException: empty String
Does contains include empty strings?
Here is an example of lines in the RDD:
idval,part,date,sign
1,'leg',2011-01-01,1.0
18,'arm',2013-01-01,1.0
6, 'nose', 2011-01-01,1.0
I have a separate array with id's such as [1,3,4,5,18,...] and I want to extract the rows of the RDD above which have the idval in ids
So filtering this should give me
idval,part,date,sign
1,'leg',2011-01-01,1.0
18,'arm',2013-01-01,1.0
as idval 1 and 18 are in the array above.
The problem is that I am getting this empty string error when I go to foreach(println) the rows in the new filtered array.
The RDD is loaded from a csv file (loadFromUrl) and then its mapped
val r1 = rdd.map(s=>s.split(","))
val r2 = r1.map(p=>Event(s(0), p(1),dateFormat.parse(p(2).asInstanceOf[String]), p(3).toDouble))

Filtering RDD by substring values

I want to filter out some entries from RDD[(String,List[(String,String,String,String)] based on analyzing values in substrings:
This is my sampe data:
(600,List((600,111,1,1), (615,111,1,5)))
(600,List((638,111,2,null), (649,222,3,1)))
(600,List((638,111,2,3), (649,null,3,1)))
In particular I want to check the 4th field in each substring (if started counting from 1). If it's equal to null, then the whole entry should be deleted. The result should be the following:
(600,List((600,111,1,1), (615,111,1,5)))
(600,List((638,111,2,3), (649,null,3,1)))
So, in this particular example the second entry should be deleted.
This is my attempt to solve this task:
val filtered = separated.map(l => (l._1,l._2.filter(!_._4.equals("null"))))
The problem is that it just deletes the substring, but not the whole entry. The result is the following (instead of the above-mentioned one):
(600,List((600,111,1,1), (615,111,1,5)))
(600,List((649,222,3,1)))
(600,List((638,111,2,3), (649,null,3,1)))
Filter your RDD by checking that the list of tuples does not have a tuple with 4th entry "null"
yourRdd.filter({
case (id, list) => !list.exists(t => t._4.equals("null"))
})

Distinct MondoDB function - How to use some criteria with distinct

I have a situation where I need fetch only distict records which are grater than 0 and all records with value 0.
For Example I have column name called mid then it rows like "0,0,1,1,2,3,5,5,3" then I should fetch only "0,0,1,2,5,3".
In short distinct record plus all mid with value 0
I have used this
def distinctMIdCursor = dataSetCollection.distinct("mid",whereObject)
def distinctMIdList = distinctMIdCursor.asList()
but its fetching result like "0,1,2,5,3"
Actual result "0,1,2,5,3".
Expected result "0,0,1,2,5,3"
How to achieve it. What is better way?
You cannot achieve it with distinct because by doing so you are defying the whole purpose of using distinct. Instead you can write two queries and concat the result.
def nonZeroDistinctList = dataSetCollection.distinct("mid",{mid: {$ne:0}});
// map function to convert object list into mid value list
def allZeroList = dataSetCollection.find({mid:0}).map(function(doc){return doc.mid});
// concating the two lists
def result = nonZeroDistinctList + allZeroList ;