I am trying a map operation on a Spark DStream in the code below:
val hashesInRecords: DStream[(RecordKey, Array[Int])] = records.map(record => {
val hashes: List[Int] = calculateIndexing(record.fields())
val ints: Array[Int] = hashes.toArray(Array.ofDim[Int](hashes.length))
(new RecordKey(record.key, hashes.length), ints)
})
The code looks fine in IntelliJ however when I try to build I get an error which I don't really understand:
Error:(53, 61) type mismatch;
found : Array[Int]
required: scala.reflect.ClassTag[Int]
val ints: Array[Int] = hashes.toArray(Array.ofDim[Int](hashes.length))
This error remains even after I add the type in the map operation like so :
records.map[(RecordKey, Array[Int])](record => {...
This should fix your problem, also it avoids the call of List.length which is O(N), and uses Array.length instead which is O(1).
val hashesInRecords: DStream[(RecordKey, Array[Int])] = records.map { record =>
val ints = calculateIndexing(record.fields()).toArray
(new RecordKey(record.key, ints.length), ints)
}
Related
i have an RDD as below with name: other_nodes:
(4,(1,true))
(22,(1,true))
(14,(1,true))
(3,(1,true))
(8,(1,true))
(18,(1,true))
i wrote a case class as below and applyed it on a graph and it gave the result i wanted:
case class nodes_properties(label:Int, isVisited:Boolean=false)
when i apply case on a graph its result looks like this:
(1,nodes_properties(15,false))
(2,nodes_properties(11,false))
(3,nodes_properties(9,false))
Problem: how can i apply the case class i have defined, on the other_nodes RDD to get the result like as below:
(4,nodes_properties(1,true))
(22,nodes_properties(1,true))
(14,nodes_properties(1,true))
(3,nodes_properties(1,true))
(8,nodes_properties(1,true))
(18,nodes_properties(1,true))
This solution might work:
scala> val data = sc.parallelize(Seq((4,(1, true)),(22,(1,true))))
data: org.apache.spark.rdd.RDD[(Int, (Int, Boolean))] = ParallelCollectionRDD[72] at parallelize at <console>:39
scala> data.take(2)
res27: Array[(Int, (Int, Boolean))] = Array((4,(1,true)), (22,(1,true)))
scala> val data1 = data.map(elem => (elem._1, nodes_properties(elem._2._1, elem._2._2)))
data1: org.apache.spark.rdd.RDD[(Int, nodes_properties)] = MapPartitionsRDD[73] at map at <console>:42
scala> data1.take(2)
res28: Array[(Int, nodes_properties)] = Array((4,nodes_properties(1,true)), (22,nodes_properties(1,true)))
EDIT
The problem is each element in others_rdd is of Type (VertexId, Any). You need to convert to (VertexId, (Int, Boolean)) type in order for your case class to apply. The way to do is
val newRdd = others_rdd.map(elem => (elem._1, elem._2.asInstanceOf[(Int,Boolean)]))
After performing this, you can apply the solution as shown above by mapping to node_properties class.
Let me know if it helps!!
First, I read a text file and turn it into RDD[(String,(String,Float))]:
val data = sc.textFile(dataInputPath);
val dataRDD:RDD[(String,(String,Float))] = data.map{f=> {
val temp=f.split("//x01");
(temp(0),(temp(1),temp(2).toInt ) );
}
} ;
Then, I run following code to turn my data into Rating type
import org.apache.spark.mllib.recommendation.Rating
val imeiMap = dataRDD.reduceByKey((s1,s2)=>s1).collect().zipWithIndex.toMap;
val docidMap = dataRDD.map( f=>(f._2._1,1)).reduceByKey((s1,s2)=>s1).collect().zipWithIndex.toMap;
val ratings = dataRDD.map{case (imei, (doc_id,rating))=> Rating(imeiMap(imei),docidMap(doc_id),rating)};
But I got an error:
Error:(32, 77) type mismatch;
found : String
required: (String, (String, Float))
val ratings = dataRDD.map{case (imei, (doc_id,rating))=> Rating(imeiMap(imei),docidMap(doc_id),rating)};
Why this happen? I think that the string have already changed to (String, (String, Float)).
The key of docidMap is not a String, is a Tuple (String, Int)
This is because you have the zipWithIndex before the .toMap method:
With this rdd as input for a quick test:
(String1,( String2,32.0))
(String1,( String2,35.0))
scala> val docidMap = dataRDD.map( f=>(f._2._1,1)).reduceByKey((s1,s2)=>s1).collect().zipWithIndex.toMap;
docidMap: scala.collection.immutable.Map[(String, Int),Int] = Map((" String2",1) -> 0)
val docidMap = dataRDD.map( f=>(f._2._1,1)).reduceByKey((s1,s2)=>s1).collect().toMap;
docidMap: scala.collection.immutable.Map[String,Int] = Map(" String2" -> 1)
The same will happen with your imeiMap, it seems that you just need to remove the zipWithIndex from there too
val imeiMap = dataRDD.reduceByKey((s1,s2)=>s1).collect.toMap
it is not about your dataRDD, it is about imeiMap:
imeiMap: scala.collection.immutable.Map[(String, (String, Float)),Int]
Lets say I am having the following dataframe:
var randomData = Seq(("a",8),("h",5),("f",3),("a",2),("b",8),("c",3)
val df = sc.parallelize(randomData,2).toDF()
and I am having this function which will be an input for the mapPartition:
def trialIterator(row:Iterator[(String,Int)]): Iterator[(String,Int)] =
row.toArray.tail.toIterator
And using the map partition:
df.mapPartition(trialIterator)
I am having the following error message:
Type mismatch, expected (Iterator[Row]) => Iterator[NotInferedR], actual: Iterator[(String,Int) => Iterator[(String,Int)]
I can understand that this is happening due to the input, output type of my function but how to solve this?
If you want to get strongly typed input don't use Dataset[Row] (DataFrame) but Dataset[T] where T in this particular scenario is (String, Int). Also don't convert to Array and don't call blindly tail without knowing if partition is empty:
def trialIterator(iter: Iterator[(String, Int)]) = iter.drop(1)
randomData
.toDS // org.apache.spark.sql.Dataset[(String, Int)]
.mapPartitions(trialIterator _)
or
randomData.toDF // org.apache.spark.sql.Dataset[Row]
.as[(String, Int)] // org.apache.spark.sql.Dataset[(String, Int)]
.mapPartitions(trialIterator _)
You expecting type Iterator[(String,Int)] while you should expect Iterator[Row]
def trialIterator(row:Iterator[Row]): Iterator[(String,Int)] = {
row.next()
row //seems to do the same thing w/o all the conversions
}
I have the following clss in scala shell in spark.
class StringSplit(val query:String)
{
def getStrSplit(rdd:RDD[String]):RDD[String]={
rdd.map(x=>x.split(query))
}
}
I am trying to call the method in this class like
val inputRDD=sc.parallelize(List("one","two","three"))
val strSplit=new StringSplit(",")
strSplit.getStrSplit(inputRDD)
-> This steps fails with error:getStrSplit is not a member of StringSplit error.
Can you please let me know what is wrong with this?
It seems like a reasonable thing to do, but...
the result type for getStrSplit is wrong because .split returns Array[String] not String
parallelizing List("one","two","three") results in "one", "two" and "three" being stored, and there are no strings needing a comma split.
Another way:
val input = sc.parallelize(List("1,2,3,4","5,6,7,8"))
input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>
The test input here is a list of two strings that each require some comma splitting to get to the data.
To parse input by splitting can be as easy as:
val parsedInput = input.map(_.split(","))
parsedInput: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[19] at map at <console>:25
Here _.split(",") is an anonymous function with one parameter _, where Scala infers the types from the other calls rather than the types being explicitly defined.
Notice the type is RDD[Array[String]] not RDD[String]
We could extract the 3rd element of each line with
parsedInput.map(_(2)).collect()
res27: Array[String] = Array(3, 7)
So how about the original question, doing the same operation
in a class. I tried:
class StringSplit(query:String){
def get(rdd:RDD[String]) = rdd.map(_.split(query));
}
val ss = StringSplit(",");
ss.get(input);
---> org.apache.spark.SparkException: Task not serializable
I'm guessing that occurs because the class is not serialized to each worker, rather Spark tries to send split function but it has a parameter that is not also sent.
scala> class commaSplitter {
def get(rdd:RDD[String])=rdd.map(_.split(","));
}
defined class commaSplitter
scala> val cs = new commaSplitter;
cs: commaSplitter = $iwC$$iwC$commaSplitter#262f1580
scala> cs.get(input);
res29: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[23] at map at <console>:10
scala> cs.get(input).collect()
res30: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
This parameter-free class works.
EDIT
You can tell scala you want your class to be serializable by extends Serializable like so:
scala> class stringSplitter(s:String) extends Serializable {
def get(rdd:RDD[String]) = rdd.map(_.split(s));
}
defined class stringSplitter
scala> val ss = new stringSplitter(",");
ss: stringSplitter = $iwC$$iwC$stringSplitter#2a33abcd
scala> ss.get(input)
res33: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[25] at map at <console>:10
scala> ss.get(input).collect()
res34: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
and this works.
The following scala code compiles fine.
object Main extends App {
import scala.collection.mutable.IndexedSeq
def doIt() {
val nums: IndexedSeq[Int] = Array(3,5,9,11)
val view: IndexedSeq[Int] = nums.view
val half: IndexedSeq[Int] = view.take(2)
val grouped: Iterator[IndexedSeq[Int]] = half.grouped(2)
val firstPair: IndexedSeq[Int] = grouped.next() //throws exception here
}
doIt()
}
However, at runtime it thows java.lang.ClassCastException: scala.collection.SeqViewLike$$anon$1 cannot be cast to scala.collection.mutable.IndexedSeq
on the call grouped.next()
I would expect the call to grouped.next() to return something equal to IndexedSeq[Int](3,5)
I am wondering why is this code failing, and if there a proper way to fix it?
If I repeat the same steps in the REPL, the type information confirms why the code compiles, but does not give me any insight to why the exception was thrown:
scala> val nums = Array(3,5,9,11)
nums: Array[Int] = Array(3, 5, 9, 11)
scala> val view = nums.view
view: scala.collection.mutable.IndexedSeqView[Int,Array[Int]] = SeqView(...)
scala> val half = view.take(2)
half: scala.collection.mutable.IndexedSeqView[Int,Array[Int]] = SeqViewS(...)
scala> val grouped = half.grouped(2)
grouped: Iterator[scala.collection.mutable.IndexedSeqView[Int,Array[Int]]] = non-empty iterator
scala> val firstPair = grouped.next()
java.lang.ClassCastException: scala.collection.SeqViewLike$$anon$1 cannot be cast to scala.collection.mutable.IndexedSeqView
Scala version 2.10.0-20121205-235900-18481cef9b -- Copyright 2002-2012, LAMP/EPFL
Looks like you ran into bug SI-6709