too many arguments for method apply: (i: Int)String in class Array - scala

I have an RDD (r2Join1) which holds the following data
(100,(102|1001,201))
(100,(102|1001,200))
(100,(103|1002,201))
(100,(103|1002,200))
(150,(151|1003,204))
I want to transform this to the following
(102, (1001, 201))
(102, (1001, 200))
(103, (1002, 201))
(103, (1002, 200))
(151, (1003, 204))
i.e., I want to transform (k, (v1|v2, v3)) to (v1, (v2, v3)).
I did the following:
val m2 = r2Join1.map({case (k, (v1, v2)) => val p: Array[String] = v1.split("\\|") (p(0).toLong, (p(1).toLong, v2.toLong))})
I get the following error
error: too many arguments for method apply: (i: Int)String in class Array
I am new to Spark & Scala. Please let me know how this error can be resolved.

The code looks like it might be off in other areas, but without the rest I can't be sure, so at minimum this should get you moving you need either a semicolon after your split or to put the two separate statements on separate lines.
v1.split("\\|");(p(0).toLong, (p(1).toLong, v2.toLong))
Without the semicolon the compiler is interpreting it as:
v1.split("\\|").apply(p(0).toLong...)
where apply acts as an indexer of the array in this case.

Related

Filtering RDDs based on value of Key

I have two RDDs that wrap the following arrays:
Array((3,Ken), (5,Jonny), (4,Adam), (3,Ben), (6,Rhonda), (5,Johny))
Array((4,Rudy), (7,Micheal), (5,Peter), (5,Shawn), (5,Aaron), (7,Gilbert))
I need to design a code in such a way that if I provide input as 3 I need to return
Array((3,Ken), (3,Ben))
If input is 6, output should be
Array((6,Rhonda))
I tried something like this:
val list3 = list1.union(list2)
list3.reduceByKey(_+_).collect
list3.reduceByKey(6).collect
None of these worked, can anyone help me out with a solution for this problem?
Given the following that you would have to define yourself
// Provide you SparkContext and inputs here
val sc: SparkContext = ???
val array1: Array[(Int, String)] = ???
val array2: Array[(Int, String)] = ???
val n: Int = ???
val rdd1 = sc.parallelize(array1)
val rdd2 = sc.parallelize(array2)
You can use the union and filter to reach your goal
rdd1.union(rdd2).filter(_._1 == n)
Since filtering by key is something that you would probably want to do in several occasions, it makes sense to encapsulate this functionality in its own function.
It would also be interesting if we could make sure that this function could work on any type of keys, not only Ints.
You can express this in the old RDD API as follows:
def filterByKey[K, V](rdd: RDD[(K, V)], k: K): RDD[(K, V)] =
rdd.filter(_._1 == k)
You can use it as follows:
val rdd = rdd1.union(rdd2)
val filtered = filterByKey(rdd, n)
Let's look at this method a little bit more in detail.
This method allows to filterByKey and RDD which contains a generic pair, where the type of the first item is K and the type of the second type is V (from key and value). It also accepts a key of type K that will be used to filter the RDD.
You then use the filter function, that takes a predicate (a function that goes from some type - in this case K - to a Boolean) and makes sure that the resulting RDD only contains items that respect this predicate.
We could also have written the body of the function as:
rdd.filter(pair => pair._1 == k)
or
rdd.filter { case (key, value) => key == k }
but we took advantage of the _ wildcard to express the fact that we want to act on the first (and only) parameter of this anonymous function.
To use it, you first parallelize your RDDs, call union on them and then invoke the filterByKey function with the number you want to filter by (as shown in the example).

Applying distinct on rdd cosidering both key- value pair and not just on the basis of keys

I have 2 pair RDDs on which I am doing union to give a third RDD.
But resulting RDD has tupes which are repeated:
rdd3 = {(1,2) , (3,4) , (1,2)}
I want to remove duplicate tuples from rdd3 but only if both the key value pair of tuple is same.
How can i do that?
Please directly invoke the spark-scala lib api:
def distinct(): RDD[T]
Remember that it is a generic method with a type parameter.
If you invoke it with your rdd, of type RDD[(Int, Int)], it will give your distinct pairs of type (Int, Int) in your rdd, just as it is.
If you want to see the internal of this method. see below for the signature:
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}
You can use distinct for example
val data= sc.parallelize(
Seq(
("Foo","41","US","3"),
("Foo","39","UK","1"),
("Bar","57","CA","2"),
("Bar","72","CA","2"),
("Baz","22","US","6"),
("Baz","36","US","6"),
("Baz","36","US","6")
)
)
remove duplicate :
val distinctData = data.distinct()
distinctData.collect

How to implement GetResult[List[String]] in scala slick?

I'm use SQLActionBuilder, such as seq"""select ...""", to create a common/wide sql query and I not care about the result column count it is.
Document example use as[TupleX] to decided result type,in my stage, I want use List[String] replace TupleX type.
I have attempted with sQLActionBuilder.as[List[String]] but a compile error encounter:
Error:(8, 187) could not find implicit value for parameter rconv: slick.jdbc.GetResult[List[String]]
val x = reportUtilRepository.find(List())("td_report_region")(1469635200000L, 1475251200000L)("region" :: Nil, "clicks" :: "CPC" :: Nil)(List("1", "2"), List("regionType" -> "1"))(_.as[List[String]]).map(x => println(x.toString))
and sQLActionBuilder.as[List[(String, String, String)]] works well. So how can I use List[String] to match a common result.
I think a straight way is implement a GetResult[List[String]] as compiler tips but I don't know how to do it.Other ways also welcome.
Thanks.
First of all querying database always returns a list of tuples, so the result type will be List[TupleX] because each row is represented as a list record and then columns in each row are respectively tuple elements.
Therefore, your data will look like List((1,2,3),(3,4,5)) where data type is List[(Int, Int, Int)]. To produce List[Int] you might do following:
val a = List((1,2,3),(3,4,5))
a map {x => List(x._1, x._2, x._3)} flatten
res0: List[Int] = List(1, 2, 3, 3, 4, 5)

How to unpack a map/list in scala to tuples for a variadic function?

I'm trying to create a PairRDD in spark. For that I need a tuple2 RDD, like RDD[(String, String)]. However, I have an RDD[Map[String, String]].
I can't work out how to get rid of the iterable so I'm just left with RDD[(String, String)] rather than e.g. RDD[List[(String, String)]].
A simple demo of what I'm trying to make work is this broken code:
val lines = sparkContext.textFile("data.txt")
val pairs = lines.map(s => Map(s -> 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The last line doesn't work because pairs is an RDD[Map[String, Int]] when it needs to be an RDD[(String, Int)].
So how can I get rid of the iterable in pairs above to convert the Map to just a tuple2?
You can actually just run:
val counts = pairs.flatMap(identity).reduceByKey(_ + _)
Note that the usage of the identity function that replicates the functionality of flatten on an RDD and the reduceByKey() function has a nifty underscore notation for conciseness.

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}