I have a RDD (array of String) org.apache.spark.rdd.RDD[String] = MappedRDD[18]
and to convert it to a map with unique Ids. I did 'val vertexMAp = vertices.zipWithUniqueId'
but this gave me another RDD of type 'org.apache.spark.rdd.RDD[(String, Long)]' but I want a 'Map[String, Long]' . How can I convert my 'org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]' ?
Thanks
There's a built-in collectAsMap function in PairRDDFunctions that would deliver you a map of the pair values in the RDD.
val vertexMAp = vertices.zipWithUniqueId.collectAsMap
It's important to remember that an RDD is a distributed data structure. You can visualize it a 'pieces' of your data spread over the cluster. When you collect, you force all those pieces to go to the driver and to be able to do that, they need to fit in the memory of the driver.
From the comments, it looks like in your case, you need to deal with a large dataset. Making a Map out of it is not going to work as it won't fit on the driver's memory; causing OOM exceptions if you try.
You probably need to keep the dataset as an RDD. If you are creating a Map in order to lookup elements, you could use lookup on a PairRDD instead, like this:
import org.apache.spark.SparkContext._ // import implicits conversions to support PairRDDFunctions
val vertexMap = vertices.zipWithUniqueId
val vertixYId = vertexMap.lookup("vertexY")
Collect to "local" machine and then convert Array[(String, Long)] to Map
val rdd: RDD[String] = ???
val map: Map[String, Long] = rdd.zipWithUniqueId().collect().toMap
You do not need to convert. The implicits for PairRDDFunctions detects a Two-Tuple based RDD and applies the PairRDDFunctions methods automatically.
Related
I have scenario to capture some data (not all) from an existing RDD and then pass it to other Scala class for actual operations. Lets see with example data(empnum, empname, emplocation, empsal) in a text file.
11,John,Paris,1000
12,Daniel,UK,3000
first step, I create an RDD with RDD[String] by below code,
val empRDD = spark
.sparkContext
.textFile("empInfo.txt")
So, my requirement is to create another RDD with empnum, empname, emplocation (again with RDD[String]).
For that I have tried below code hence I am getting RDD[String, String, String].
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
I have tried with Slice also, it gives me RDD[Array(String)].
My required RDD should be of RDD[String] to pass to required Scala class to do some operations.
The expected output should be,
11,John,Paris
12,Daniel,UK
Can anyone help me how to achieve?
I would try this
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
val rddString = empReqRDD.map({case(id,name,city) => "%s,%s,%s".format(id,name,city)})
In your initial implementation, the second map is putting the array elements into a 3-tuple, hence the RDD[(String, String, String)].
One way to accomplish your objective is to change the second map to construct a string like so:
empRDD
.map(a=> a.split(","))
.map(x => s"${x(0)},${x(1)},${x(2)}")
Alternatively, and a bit more concise, you could do it by taking the first 3 elements of the array and using the mkString method:
empRDD.map(_.split(',').take(3).mkString(","))
Probably overkill for this use-case, but you could also use a regex to extract the values:
val r = "([^,]*),([^,]*),([^,]*).*".r
empRDD.map { case r(id, name, city) => s"$id,$name,$city" }
I have a following syntax
val data = sc.textFile("log1.txt,log2.txt")
val s = Seq(data)
val par = sc.parallelize(s)
Result that i obtained is as follows:
WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)
par: org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] = ParallelCollectionRDD[2] at parallelize at :28
Question 1
How does a parallelCollection work?.
Question 2
Can I iterate through them and perform transformation?
Question 3
RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
What does this mean?
(An interesting case indeed)
When in doubt, I always recommend to follow the types in Scala (after all the types are why we, Scala developers, use the language in the first place, don't we?)
So, let's reveal the types:
scala> val data = sc.textFile("log1.txt,log2.txt")
data: org.apache.spark.rdd.RDD[String] = log1.txt,log2.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val s = Seq(data)
s: Seq[org.apache.spark.rdd.RDD[String]] = List(log1.txt,log2.txt MapPartitionsRDD[1] at textFile at <console>:24)
scala> val par = sc.parallelize(s)
WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)
par: org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] = ParallelCollectionRDD[3] at parallelize at <console>:28
As you were told, org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] is not supported case in Spark (however it was indeed accepted by the Scala compiler since it matches the signature of SparkContext.parallelize method...unfortunately).
You don't really need val s = Seq(data) since the records in the two files log1.txt,log2.txt are already "inside" RDD and Spark will process them in distributed and parallel manner all records in all the two files (which I believe is your use case).
I do think I've answered all the three questions that I think are based on false expectations and hence they all are pretty much alike :)
I have the following Scala value:
val values: List[Iterable[Any]] = Traces().evaluate(features).toList
and I want to convert it to a DataFrame.
When I try the following:
sqlContext.createDataFrame(values)
I got this error:
error: overloaded method value createDataFrame with alternatives:
[A <: Product](data: Seq[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (List[Iterable[Any]])
sqlContext.createDataFrame(values)
Why?
Thats what spark implicits object is for. It allows you to convert your common scala collection types into DataFrame / DataSet / RDD.
Here is an example with Spark 2.0 but it exists in older versions too
import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()
Edit: Just realised you were after 2d list. Here is something I tried on spark-shell. I converted a 2d List to List of Tuples and used implicit conversion to DataFrame:
val values = List(List("1", "One") ,List("2", "Two") ,List("3", "Three"),List("4","4")).map(x =>(x(0), x(1)))
import spark.implicits._
val df = values.toDF
Edit2: The original question by MTT was How to create spark dataframe from a scala list for a 2d list for which this is a correct answer. The original question is https://stackoverflow.com/revisions/38063195/1
The question was later changed to match an accepted answer. Adding this edit so that if someone else looking for something similar to the original question can find it.
As zero323 mentioned, we need to first convert List[Iterable[Any]] to List[Row] and then put rows in RDD and prepare schema for the spark data frame.
To convert List[Iterable[Any]] to List[Row], we can say
val rows = values.map{x => Row(x:_*)}
and then having schema like schema, we can make RDD
val rdd = sparkContext.makeRDD[RDD](rows)
and finally create a spark data frame
val df = sqlContext.createDataFrame(rdd, schema)
Simplest approach:
val newList = yourList.map(Tuple1(_))
val df = spark.createDataFrame(newList).toDF("stuff")
In Spark 2 we can use DataSet by just converting list to DS by toDS API
val ds = list.flatMap(_.split(",")).toDS() // Records split by comma
or
val ds = list.toDS()
This more convenient than rdd or df
The most concise way I've found:
val df = spark.createDataFrame(List("A", "B", "C").map(Tuple1(_)))
I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
I then apply sortByKey, still no problem...
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple.
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])]
exactly as it says :).
now if you want to split the record you need to get it from this tuple.
you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)
but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.
if you want to sort the rdd using the 7th string in the array, you can just do it directly by
rdd.sortBy(_(6)) // array starts at 0 not 1
or
rdd.sortBy(arr => arr(6))
That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).
To test this, i did the following:
val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))
// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))
Here's the result:
Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))
just do this:
val rdd4 = rdd3.map(_._2)
I thought you don't familiar with Scala,
So, below should help you understand more,
rdd3.map(kv => {
println(kv._1) // This represent String
println(kv._2) // This represent Array[String]
})
May be I am missing something but I expected the data to be sorted based on the key
scala> val x=sc.parallelize(Array( "cat", "ant", "1"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[160] at parallelize at <console>:22
scala> val xxx=x.map(v=> (v,v.length))
xxx: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[161] at map at <console>:26
scala> xxx.sortByKey().foreach(println)
(1,1)
(cat,3)
(ant,3)
scala> xxx.sortByKey().foreach(println)
(cat,3)
(1,1)
(ant,3)
It works if I tell spark to use only 1 partitions as below but how to make this work in a cluster or more than 1 workers?
scala> xxx.sortByKey(numPartitions=1).foreach(println)
(1,1)
(ant,3)
(cat,3)
UPDATE:
I think I got the answer. It is being sorted correctly as it works when I use the collect
scala> xxx.sortByKey().collect
res170: Array[(String, Int)] = Array((1,1), (ant,3), (cat,3))
Keeping the question open to validate my understanding.
That makes sense. foreach runs in parallel across the partitions which creates non-deterministic ordering. The order may be mixed. collect gives you an array of the partitions concatenated in their sorted order.
Have a look at spark documentation why collect() method fixed the issue for you.
e.g.
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as an array of objects.
Calling collect() on the resulting RDD will return or output an ordered list of records
collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Remember doing a collect() action operation on a very large distributed RDD can cause your driver program to run out of memory and crash. So, do not use collect() except for when you are prototyping your Spark program on a small dataset.
Have a look at this article for more details
EDIT:
sortByKey(): Sort the RDD by key, so that each partition contains a sorted range of the elements. Since all partitions may not reside in same Executor node, you will not get ordered set unless you call collect()