I have the following tuples: (1,"3idiots",List("Action","Adventure","Horror") Which I need to convert into a list in the following format:
List(
(1,"3idiots","Action"),
(1,"3idiots","Adventure")
)
To add to previous answers, you can also use for-comprehension in this case; it might make things clearer IMHO:
for(
(a,b,l) <- ts;
s <- l
) yield (a,b,s)
So if you have:
val ts = List(
("a","1", List("foo","bar","baz")),
("b","2", List("foo1","bar1","baz1"))
)
You will get:
List(
(a,1,foo),
(a,1,bar),
(a,1,baz),
(b,2,foo1),
(b,2,bar1),
(b,2,baz1)
)
Assuming that you have more than one tuple like this:
val tuples = List(
(1, "3idiots", List("Action", "Adventure", "Horror")),
(2, "foobar", List("Foo", "Bar"))
)
and you want result like this:
List(
(1, "3idiots", "Action"),
(1, "3idiots" , "Adventure"),
(1, "3idiots", "Horror"),
(2, "foobar", "Foo"),
(2, "foobar", "Bar")
)
the solution for you would be to use a flatMap, which can convert a list of lists to a single list:
tuples.flatMap(t =>
t._3.map(s =>
(t._1, t._2, s)
)
)
or shorter: tuples.flatMap(t => t._3.map((t._1, t._2, _)))
This should do what you want:
val input = (1,"3idiots",List("Action","Adventure","Horror"))
val result = input._3.map(x => (input._1,input._2,x))
// gives List((1,3idiots,Action), (1,3idiots,Adventure), (1,3idiots,Horror))
You can use this.
val question = (1,"3idiots",List("Action","Adventure","Horror"))
val result = question._3.map(x=> (question._1 , question._2 ,x))
Related
according this link: https://github.com/amplab/training/blob/ampcamp6/machine-learning/scala/solution/MovieLensALS.scala
I don't understand what is the point of :
val numUsers = ratings.map(_._2.user).distinct.count
val numMovies = ratings.map(_._2.product).distinct.count
_._2.[user|product] , what does that mean?
That is accessing the tuple elements: The following example might explain it better.
val xs = List(
(1, "Foo"),
(2, "Bar")
)
xs.map(_._1) // => List(1,2)
xs.map(_._2) // => List("Foo", "Bar")
// An equivalent way to write this
xs.map(e => e._1)
xs.map(e => e._2)
// Perhaps a better way is
xs.collect {case (a, b) => a} // => List(1,2)
xs.collect {case (a, b) => b} // => List("Foo", "Bar")
ratings is a collection of tuples:(timestamp % 10, Rating(userId, movieId, rating)). The first underscore in _._2.user refers to the current element being processed by the map function. So the first underscore now refers to a tuple (pair of values). For a pair tuple t you can refer to its first and second elements in the shorthand notation: t._1 & t._2 So _._2 is selecting the second element of the tuple currently being processed by the map function.
val ratings = sc.textFile(movieLensHomeDir + "/ratings.dat").map { line =>
val fields = line.split("::")
// format: (timestamp % 10, Rating(userId, movieId, rating))
(fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
}
I was playing around with spark and I am getting stuck with something that seems foolish.
Let's say we have two RDD:
rdd1 = {(1, 2), (3, 4), (3, 6)}
rdd2 = {(3, 9)}
if I am doing rdd1.substrackByKey(rdd2) , I will get {(1, 2)} wich is perfectly fine. But I also want to save the rejected values {(3,4),(3,6)} to another RDD, is there a prebuilt function in spark or an elegant way to do this?
Please keep in mind that I am new with Spark, any help will be appreciated, thanks.
As Rohan suggests, there is no (to the best of my knowledge) standard API call to do this. What you want to do can be expressed as Union - Intersection.
Here is how you can do this on spark:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val intersection = r1.map(_._1).intersection(r2.map(_._1))
val union = r1.map(_._1).union(r2.map(_._1))
val diff = union.subtract(intersection)
diff.collect()
> Array[Int] = Array(1)
To get the actual pairs:
val d = diff.collect()
r1.union(r2).filter(x => d.contains(x._1)).collect
I think I claim this is slightly more elegant:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val r3 = r1.leftOuterJoin(r2)
val subtracted = r3.filter(_._2._2.isEmpty).map(x=>(x._1, x._2._1))
val discarded = r3.filter(_._2._2.nonEmpty).map(x=>(x._1, x._2._1))
//subtracted: (1,2)
//discarded: (3,4)(3,6)
The insight is noticing that leftOuterJoin produces both the discarded (== records with a matching key in r2) and remaining (no matching key) in one go.
It's a pity Spark doesn't have RDD.partition (in the Scala collection sense of split a collection into two depending on a predicate) or we could caclculate subtracted and discarded in one pass
You can try
val rdd3 = rdd1.subtractByKey(rdd2)
val rdd4 = rdd1.subtractByKey(rdd3)
But you won't be keeping the values, just running another subtraction.
Unfortunately, I don't think there's an easy way to keep the rejected values using subtractByKey(). I think one way you get your desired result is through cogrouping and filtering. Something like:
val cogrouped = rdd1.cogroup(rdd2, numPartitions)
def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
You might be able to borrow the work done here to make the last two lines look more elegant.
When I run this on your example, I see:
scala> val rdd1 = sc.parallelize(Array((1, 2), (3, 4), (3, 6)))
scala> val rdd2 = sc.parallelize(Array((3, 9)))
scala> val cogrouped = rdd1.cogroup(rdd2)
scala> def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
scala> val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> res1.collect()
...
res7: Array[(Int, Int)] = Array((1,2))
scala> res2.collect()
...
res8: Array[(Int, Int)] = Array((3,4), (3,6))
First use substractByKey() and then subtract
val rdd1 = spark.sparkContext.parallelize(Seq((1,2), (3,4), (3,5)))
val rdd2 = spark.sparkContext.parallelize(Seq((3,10)))
val result = rdd1.subtractByKey(rdd2)
result.foreach(print) // (1,2)
val rejected = rdd1.subtract(result)
rejected.foreach(print) // (3,5)(3,4)
I have two lists -
A = (("192.168.1.1","private","Linux_server","str1"),
("192.168.1.2","private","Linux_server","str2"))
B = ("A","B")
I want following output
outputList = (("192.168.1.1","private","Linux_server","str1", "A"),
("192.168.1.2","private","Linux_server","str2","B"))
I want to insert second list element into first list as list sequence.
Two lists size will be always same.
How do I get above output using scala??
The short answer:
A = (A zip B).map({ case (x, y) => x :+ y })
Some compiling code to be more explicit:
val a = List(
List("192.168.1.1", "private", "Linux_server", "str1"),
List("192.168.1.2", "private", "Linux_server", "str2")
)
val b = List("A", "B")
val c = List(
List("192.168.1.1", "private", "Linux_server", "str1", "A"),
List("192.168.1.2", "private", "Linux_server", "str2", "B")
)
assert((a zip b).map({ case (x, y) => x :+ y }) == c)
First List data as below
List(("A",66729122803169854198650092,"SD"),("B",14941578978240528153321786,"HD"),("C",14941578978240528153321786,"PD"))
and second list contains data as below
List(("X",14941578978240528153321786),("Y",68277588597782900503675727),("Z",14941578978240528153321786),("L"66729122803169854198650092))
using above two list I want to form following list which matched first list second number to second list second number so my output should as below
List(("X",14941578978240528153321786,"B","HD"),("X",14941578978240528153321786,"C","PD"), ("Y",68277588597782900503675727,"",""),("Z",14941578978240528153321786,"B","HD"),("Z",14941578978240528153321786,"C","PD"),
("L",66729122803169854198650092,"A","SD"))
val tuples3 = List(
("A", "66729122803169854198650092", "SD"),
("B", "14941578978240528153321786", "HD"),
("C", "14941578978240528153321786", "PD"))
val tuples2 = List(
("X", "14941578978240528153321786"),
("Y", "68277588597782900503675727"),
("Z", "14941578978240528153321786"),
("L", "66729122803169854198650092"))
Group first list by target field:
val tuples3Grouped =
tuples3
.groupBy(_._2)
.mapValues(_.map(t => (t._1, t._3)))
.withDefaultValue(List(("", "")))
Zip all data:
val result = for{ (first, second) <- tuples2
t <- tuples3Grouped(second)
} yield (first, second, t._1, t._2)
I apply groupBy function to my List collection, however I want to remove the repetitive values in the value part of the Map. Here is the initial List collection:
PO_ID PRODUCT_ID RETURN_QTY
1 1 10
1 1 20
1 2 30
1 2 10
When I apply groupBy to that List, it will produce something like this:
(1, 1) -> (1, 1, 10),(1, 1, 20)
(1, 2) -> (1, 2, 30),(1, 2, 10)
What I really want is something like this:
(1, 1) -> (10),(20)
(1, 2) -> (30),(10)
So, is there anyway to remove the repetitive part in the Map's values [(1,1),(1,2)] ?
Thanks..
For
val a = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
consider
a.groupBy( v => (v._1,v._2) ).mapValues( _.map (_._3) )
which delivers
Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Note that mapValues operates over a List[List] of triplets obtained from groupBy, whereas in map we extract the third element of each triplet.
Is it easier to pull the tuple apart first?
scala> val ts = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
ts: Seq[(Int, Int, Int)] = List((1,1,10), (1,1,20), (1,2,30), (1,2,10))
scala> ts map { case (a,b,c) => (a,b) -> c }
res0: Seq[((Int, Int), Int)] = List(((1,1),10), ((1,1),20), ((1,2),30), ((1,2),10))
scala> ((Map.empty[(Int, Int), List[Int]] withDefaultValue List.empty[Int]) /: res0) { case (m, (k,v)) => m + ((k, m(k) :+ v)) }
res1: scala.collection.immutable.Map[(Int, Int),List[Int]] = Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Guess not.