I am very new to Spark and Scala, and I want to test if a value is a key from an RDD.
The data I have is like this:
RDD data: key -> value
RDD stat: key -> statistics
What I want to do is to filter all the key-value pairs in data that has the key in stat.
My general idea is to convert the keys of an RDD into a set, then test if a value belongs to this set?
Are there better approaches, and how to convert the keys of an RDD into a set using Scala?
Thanks.
You can use lookup
def lookup(key: K): List[V]
Return the list of values in the RDD for key key. This operation is
done efficiently if the RDD has a known partitioner by only searching
the partition that the key maps to.
You asked -
What I want to do is to filter all the key-value pairs in data that
has the key in stat.
I think you should join by key instead of doing a lookup.
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (V, W)) pairs with all pairs of elements for each key. Outer
joins are supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.
.
"close over an RDD inside another RDD."
Basically using an RDD inside the transformations (in this case filter) of another RDD.
Nesting of one RDD inside another is not allowed in Spark.
Related
so I have an RDD in scala which is currently stored as a key-value mapping like the following.
(A, (B,C,D,E))
I was wondering if it was possible to somehow map this to an RDD which stores a key-value mapping like the following
(A,B)
(A,C)
(A,D)
(A,E)
i.e. is it possible to make the key separately map to everything?
Found a way to do it. You can use a flatMapValues(x=>x) to turn all of them into key value pairs rather than one key array value pair.
Let's assume that I have an RDD with items of type
case class Foo(name: String, nums: Seq[Int])
and a table my_schema.foo in Cassandra with a partitioning key composed of name and num columns
Now, I'd like to fetch for each element in the input RDD all corresponding rows, i.e. something like:
SELECT * from my_schema.foo where name = :name and num IN :nums
I've tried the following approaches:
use the joinWithCassandraTable extension: rdd.joinWithCassandraTable("my_schema", "foo").on(SomeColumns("name")) but I don't know how I could specify the IN constraint
For each element of the input RDD issue a separate query (within a map function). This does not work, as the spark context is not serializable and cannot be passed into the map
Flatmap the input RDD to generate a separate item (name, num) for each num in nums. This will work, but it will probably be way less efficient than using an IN clause.
What would be a proper way of solving this problem?
I'm trying to expend my RDD table by one column (with string values) using this question answers but I cannot add a column name this way... I'm using Scala.
Is there any easy way to add a column to RDD?
Apache Spark has a functional approach to the elaboration of data. Fundamentally, an RDD[T] is some sort of collection of objects (RDD stands for Resilient Distributed Data structure).
Following the functional approach, you elaborate the objects inside the RDD using transformations. Transformations construct a new RDD from a previous one.
One example of transformation is the map method. Using map, you can transform each object of your RDD in every other type of object you need. So, if you have a data structure that represents a row, you can trasform that structure in a new one with an added row.
For example, take the following piece of code.
val rdd: (String, String) = sc.pallelize(List(("Hello", "World"), ("Such", "Wow"))
// This new RDD will have one more "column",
// which is the concatenation of the previous
val rddWithOneMoreColumn =
rdd.map {
case(a, b) =>
(a, b, a + b)
In this example an RDD of Tuple2 (a.k.a. a couple) is transformed into an RDD of Tuple3, simply applying a function to each RDD element.
Clearly, you have to apply an action over the object rddWithOneMoreColumn to make the computation happen. In fact, Apache Spark computes lazily the result of all of your transformation.
I am new to Spark (using 1.1 version) and Scala .. I am converting my existing Hadoop MapReduce code to spark MR using Scala and bit lost.
I want my mapped RDD to be grouped by Key .. When i read online it's suggested that we should avoid groupByKey and use reducedByKey instead.. but when I apply reduceBykey I am not getting list of values for given key as expected by my code =>Ex.
val rdd = sc.parallelize(List(("k1", "v11"), ("k1", "v21"), ("k2", "v21"), ("k2", "v22"), ("k3", "v31") ))
My "values" for actual task are huge, having 300 plus columns in key-values pair
And when I will do group by on common key it will result in shuffle which i want to avoid.
I want something like this as o/p (key, List OR Array of values) from my mapped RDD =>
rdd.groupByKey()
which gives me following Output
(k3,ArrayBuffer(v31))
(k2,ArrayBuffer(v21, v22))
(k1,ArrayBuffer(v11, v21))
But when I use
rdd.reduceByKey((x,y) => x+y)
I get values connected together like following- If pipe('|') or some other breakable character( (k2,v21|v22) ) would have been there my problem would have been little bit solved but still having list would be great for good coding practice
(k3,v31)
(k2,v21v22)
(k1,v11v21)
Please help
If you refer the spark documentation http://spark.apache.org/docs/latest/programming-guide.html
For groupByKey It says
“When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.”
The Iterable keyword is very important over here, when you get the value as (v21, v22) it’s iterable.
Further it says
“Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.”
So from this what I understand is, if you want the return RDD to have iterable values use groupByKey if and if you want to have a single added up value like SUM then use reducebyKey.
Now in your tuple instead of having (String,String) => (K1,V1), if you had (String,ListBuffer(String)) => (K1,ListBuffer(“V1”)) then maybe you could have done rdd.reduceByKey( (x,y) = > x += y)
My starting point is an RDD[(key,value)] in Scala using Apache Spark. The RDD contains roughly 15 million tuples. Each key has roughly 50+-20 values.
Now I'd like to take one value (doesn't matter which one) for each key. My current approach is the following:
HashPartition the RDD by the key. (There is no significant skew)
Group the tuples by key resulting in RDD[(key, array of values)]]
Take the first of each value array
Basically looks like this:
...
candidates
.groupByKey()
.map(c => (c._1, c._2.head)
...
The grouping is the expensive part. It is still fast because there is no network shuffle and candidates is in memory but can I do it faster?
My idea was to work on the partitions directly, but I'm not sure what I get out of the HashPartition. If I take the first tuple of each partition, I will get every key but maybe multiple tuples for a single key depending on the number of partitions? Or will I miss keys?
Thank you!
How about reduceByKey with a function that returns the first argument? Like this:
candidates.reduceByKey((x, _) => x)