Spark type mismatch error - scala

I have a function below:-
def doSomething(line: RDD[(String, String)]): (String) = {
val c = line.toLocalIterator.mkString
val file2 = KeepEverythingExtractor.INSTANCE.getText(c)
(file2)
}
It's of type org.apache.spark.rdd.RDD[(String, String)])String
I have some files stored at hdfs which I have to access as below:-
val logData = sc.wholeTextFiles("hdfs://localhost:9000/home/akshat/recipes/recipes/simplyrecipes/*/*/*/*")
It's of type org.apache.spark.rdd.RDD[(String, String)]
I have to map these files according to doSomething function
val mapper = logData.map(doSomething)
But an error comes out like this:-
<console>:32: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, String)] => String
required: ((String, String)) => ?
val mapper = logData.map(doSomething)
^
I have defined in my function what type of input and output I should have and I am giving the input according to that only.
Why is this error coming then and what should I change in order to rectify this error?
Thanks in advance!

What is passed to map function is not RDD[(String, String)] but sequence of pairs (String, String), hence the error. Same way when you map over list you don't get list itself, but elements of the list, one by one.
Lets say want to extract file path then what you need is something like this:
def doSomething(x: (String, String)): String = {
x match {
case (fname, content) => fname
}
}
or simply:
logData.map(_._1)

Related

Maps and Flatmaps in Scala

I am new to scala. I need a lot of help with using maps and flatmaps with tuples.
I have functions as follows-
def extract(url: String): String = {//some code}
def splitstring(content: String): Array[String]={//some code}
def SentenceDetect(paragraph: Array[String]): Array[String] = {//some code}
def getMd5(literal: String): String = {//some code}
I have a incoming list of urls. and I want it to go through above series of functions using map and flatmaps.
var extracted_content=url_list.map(url => (url,extract(url)))
val trimmed_content=extracted_content.map(t => (t._1,splitstring(t._2)))
val sentences=trimmed_content.map(t => (t._1,SentenceDetect(t._2)))
val hashed_values=sentences.flatMap(t => (t._1,getMd5(t._2)))
The issue is I am getting at error at flatMap as type mismatch--
Error:(68, 46) type mismatch;
found : (String, String)
required: scala.collection.GenTraversableOnce[?]
val hashed_values=sentences.flatMap(t => (t._1,getMd5(t._2.toString)))
How to get it done.
I think this is what you're after.
val hashed_values = sentences.map(t => (t._1, t._2.map(getMd5)))
This should result in type List[(String,Array[String])]. This assumes that you actually want the Md5 calculation of each element in the t._2 array.
Recall that the signature of flatMap() is flatMap(f: (A) ⇒ GenTraversableOnce[B]), in other words, it takes a function that takes an element and returns a collection of transitioned elements. A tuple, (String,String), is not GenTraversableOnce thus the error you're getting.
You are getting this error because getMd5(...) accepts a string, however sentences is of type List[(String, Array[String])] (assuming url_list is List[String]), so t._2 is of type Array[String].
Anyway, some notes regarding your code:
variable names in scala are "lower camel" (https://docs.scala-lang.org/style/naming-conventions.html), not lower-with-underscore
extracted_content should probably be a val
since all your variables are maps, and since you want to transform the map's value, you better use .mapValues instead of .map

error: value split is not a member of (String, String)

I have two RDD's rdd1 and rdd2 of type RDD[String].
I have performed cartesian product on the two RDDs in scala spark
val cartesianproduct = rdd1.cartesian(rdd2)
When I am performing the below code, I am getting an error.
val splitup = cartesianproduct.map(line => line.split(","))
Below is error which I am getting:
error: value split is not a member of (String, String)
Cartesian join returns RDD of Tuple2[String, String] so you have to perform map operation on Tuple2[String, String] not on String, Here is the example how to handle Tuple in map function
val cartesianproduct = rdd1.cartesian(rdd2)
val splitup = cartesianproduct.map{ case (line1, line2) => line1.split(",") ++ line2.split(",")}

spark group by not getting type, mismatch error

I am trying to get this variable GroupsByP to have certain type: GroupsByP is defined out of db connection select/collect statement which has 3 fields: 2 strings (p and id) and an int (order).
Expected result should be of the form Map[p,Set[(Id,Order)]]
val GroupsByP = db.pLinkGroups.collect()
.groupBy(_.p)
.map(group => group._1 -> (group._2.map(_.Id -> group._2.map(_.Order)).toSet))
my desired type for this variable is
Map[String, Set[(String, Int)]]
but actual is Map[String, Set[(String, Array[Int])]],
If I got your question right, this should do it:
val GroupsByP: Map[String, Set[(String, Int)]] = input.collect()
.groupBy(_.p)
.map(group => group._1 -> group._2.map(record => (record.Id, record.Order)).toSet)
You should be mapping each record into a (Id, Order) tuple.
A very similar but perhaps more readable implementation might be:
val GroupsByP: Map[String, Set[(String, Int)]] = input.collect()
.groupBy(_.p)
.mapValues(_.map(record => (record.Id, record.Order)).toSet)

Type mismatch in Scala Map from getOrElse returning Equals

I have this Scala snippet from a custom mapper (for use in a Spark mapPartitions) I'm writing to compute histograms of multiple Int fields simultaneously.
def multiFeatureHistogramFunc(iter: Iterator[Row]) : Iterator[(Int, (Int, Long))] = {
var featureHistMap:Map[Int, (Int, Long)] = Map()
while (iter.hasNext)
{
val cur = iter.next;
indices.foreach( { index:Int =>
val v:Int = if ( cur.isNullAt(index) ) -100 else cur.getInt(index)
var featureHist:Map[Int, Long] = featureHistMap.getOrElse(index, Map())
val newCount = featureHist.getOrElse(v,0L) + 1L
featureHist += (v -> newCount)
featureHistMap += (index -> featureHist)
})
}
featureHistMap.iterator
}
But the error I'm getting is this:
<console>:49: error: type mismatch;
found : Equals
required: Map[Int,Long]
var featureHist:Map[Int, Long] =
featureHistMap.getOrElse(index, Map())
^
I couldn't find the answer to this specific issue. It looks to me like the default parameter in featureHistMap.getOrElse is a different type than the value field of the featureHistMap itself and the common parent type is Equals so this causes a type mismatch. I tried a number of different things like changing the default parameter to be a more specific type, but this just caused a different error.
Can someone explain what's going on here and how to fix it?
The problem is that you declared your featureHistMap as Map[Int, (Int, Long)] - note that you are mapping an Int to a pair (Int, Long). Later, you try to retrieve a value from it as a Map[Int, Long], instead of a pair (Int, Long).
You either need to redeclare the type of featureHistMap to Map[Int, Map[Int, Long]], or the type of featureHist to (Int, Long).

How to convert RDD[(String, String)] into RDD[Array[String]]?

I am trying to append filename to each record in the file. I thought if the RDD is Array it would have been easy for me to do it.
Some help with converting RDD type or solving this problem would be much appreciated!
In (String, String) type
scala> myRDD.first()(1)
scala><console>:24: error: (String, String) does not take parametersmyRDD.first()(1)
In Array(string)
scala> myRDD.first()(1)
scala> res1: String = abcdefgh
My function:
def appendKeyToValue(x: Array[Array[String]){
for (i<-0 to (x.length - 1)) {
var key = x(i)(0)
val pattern = new Regex("\\.")
val key2 = pattern replaceAllIn(key1,"|")
var tempvalue = x(i)(1)
val finalval = tempvalue.split("\n")
for (ab <-0 to (finalval.length -1)){
val result = (I am trying to append filename to each record in the filekey2+"|"+finalval(ab))
}
}
}
If you have a RDD[(String, String)], you can access the first tuple field of the first tuple by calling
val firstTupleField: String = myRDD.first()._1
If you want to convert a RDD[(String, String)] into a RDD[Array[String]] you can do the following
val arrayRDD: RDD[Array[String]] = myRDD.map(x => Array(x._1, x._2))
You may also employ a partial function to destructure the tuples:
val arrayRDD: RDD[Array[String]] = myRDD.map { case (a,b) => Array(a, b) }