error: value split is not a member of (String, String) - scala

I have two RDD's rdd1 and rdd2 of type RDD[String].
I have performed cartesian product on the two RDDs in scala spark
val cartesianproduct = rdd1.cartesian(rdd2)
When I am performing the below code, I am getting an error.
val splitup = cartesianproduct.map(line => line.split(","))
Below is error which I am getting:
error: value split is not a member of (String, String)

Cartesian join returns RDD of Tuple2[String, String] so you have to perform map operation on Tuple2[String, String] not on String, Here is the example how to handle Tuple in map function
val cartesianproduct = rdd1.cartesian(rdd2)
val splitup = cartesianproduct.map{ case (line1, line2) => line1.split(",") ++ line2.split(",")}

Related

Convert RDD[(String, String, String)] to RDD[(String, (String, String))] in Spark Scala

There are 2 rdds , which i am trying to join :
It's getting joined when there are 2 parameters in each rdd , however when i add a new parameter in existingGTINs rdd , i am facing below error:
Below is the code:
newGTS.collect()
(00070137115045,00070137115045)
(00799999150451,00799999150451)
existingGTS.collect()
(00799999150451,(00003306-808b-46da-bc7f-419c5ae223a7,2016-10-10 10:23:12.0))
(00016700000653,(00006d79-94ea-4651-be0c-0ce77958cd45,2021-05-31 01:20:39.291))
(00923846453024,(0000704b-b40d-4b9e-b266-f7c66723df0e,null))
(00610074049265,(0000a7a1-587c-4b13-a155-7846df82fdee,2020-03-20 12:16:55.873))
(00034100516079,(0002495f-6084-49dd-aadb-20cd137d9694,null))
val join1 = newGTINs.leftOuterJoin(existingGTINs) mapValues {
case (gtin, iUUID, createDt) => (iUUID.isEmpty, iUUID.getOrElse(UUID.randomUUID.toString))
}
error: constructor cannot be instantiated to expected type;
found : (T1, T2, T3)
required: (String, Option[(String, String)])
case (gtin, iUUID, createDt) => (iUUID.isEmpty, iUUID.getOrElse(UUID.randomUUID.toString))
^
PS: UUID.randomUUID.toString --> this function is to creatre a random id
I am gussing that newGTINs and existingGTINs used in join are supposed to be same as newGTS and existingGTS shown with collects.
Since your newGTSINs looks to be a RDD[(String, String)] and existingGTINS is a RDD[(String, (String, String))], your newGTINs.leftOuterJoin(existingGTINs) will be a RDD[(String,(String, Option[(String, String)]))].
Which means that your mapValues will expect a function (String, Option[(String, String)]) => SomeNewType or as a parameter. It can also accept a partial function satisfying the similar type semantics.
But your { case (gtin, iUUID, createDt) => (iUUID.isEmpty, iUUID.getOrElse(UUID.randomUUID.toString)) } is a partial function which corresponds to type (String, String, String) => SomeNewType.
Notice the difference, hence the error. You can fix this by providing appropriate partial function to statisfy the mapValues requirement.
val join1 =
newGTINs
.leftOuterJoin(existingGTINs)
.mapValues {
case (gtin, Some(iUUID, createDt)) =>
(iUUID.isEmpty, iUUID.getOrElse(UUID.randomUUID.toString))
case (gtin, None) =>
// what heppens for gtins without matching element in existing one's
(true, UUID.randomUUID.toString)
}

spark group by not getting type, mismatch error

I am trying to get this variable GroupsByP to have certain type: GroupsByP is defined out of db connection select/collect statement which has 3 fields: 2 strings (p and id) and an int (order).
Expected result should be of the form Map[p,Set[(Id,Order)]]
val GroupsByP = db.pLinkGroups.collect()
.groupBy(_.p)
.map(group => group._1 -> (group._2.map(_.Id -> group._2.map(_.Order)).toSet))
my desired type for this variable is
Map[String, Set[(String, Int)]]
but actual is Map[String, Set[(String, Array[Int])]],
If I got your question right, this should do it:
val GroupsByP: Map[String, Set[(String, Int)]] = input.collect()
.groupBy(_.p)
.map(group => group._1 -> group._2.map(record => (record.Id, record.Order)).toSet)
You should be mapping each record into a (Id, Order) tuple.
A very similar but perhaps more readable implementation might be:
val GroupsByP: Map[String, Set[(String, Int)]] = input.collect()
.groupBy(_.p)
.mapValues(_.map(record => (record.Id, record.Order)).toSet)

How to do flatten in scala horizantally?

I am trying some basic logic using scala . I tried the below code but it throws error .
scala> val data = ("HI",List("HELLO","ARE"))
data: (String, List[String]) = (HI,List(HELLO, ARE))
scala> data.flatmap( elem => elem)
<console>:22: error: value flatmap is not a member of (String, List[String])
data.flatmap( elem => elem)
Expected Output :
(HI,HELLO,ARE)
Could some one help me to fix this issue?
You are trying to flatMap over a tuple, which won't work. The following will work:
val data = List(List("HI"),List("HELLO","ARE"))
val a = data.flatMap(x => x)
This will be very trivial in scala:
val data = ("HI",List("HELLO","ARE"))
println( data._1 :: data._2 )
what exact data structure are you working with?
If you are clear about you data structure:
type rec = (String, List[String])
val data : rec = ("HI",List("HELLO","ARE"))
val f = ( v: (String, List[String]) ) => v._1 :: v._2
f(data)
A couple of observations:
Currently there is no flatten method for tuples (unless you use shapeless).
flatMap cannot be directly applied to a list of elements which are a mix of elements and collections.
In your case, you can make element "HI" part of a List:
val data = List(List("HI"), List("HELLO","ARE"))
data.flatMap(identity)
Or, you can define a function to handle your mixed element types accordingly:
val data = List("HI", List("HELLO","ARE"))
def flatten(l: List[Any]): List[Any] = l.flatMap{
case x: List[_] => flatten(x)
case x => List(x)
}
flatten(data)
You are trying to flatMap on Tuple2 which is not available in current api
If you don't want to change your input, you can extract the values from Tuple2 and the extract the values for second tuple value as below
val data = ("HI",List("HELLO","ARE"))
val output = (data._1, data._2(0), data._2(1))
println(output)
If that's what you want:
val data = ("HI",List("HELLO,","ARE").mkString(""))
println(data)
>>(HI,HELLO,ARE)

How to convert RDD[(String, String)] into RDD[Array[String]]?

I am trying to append filename to each record in the file. I thought if the RDD is Array it would have been easy for me to do it.
Some help with converting RDD type or solving this problem would be much appreciated!
In (String, String) type
scala> myRDD.first()(1)
scala><console>:24: error: (String, String) does not take parametersmyRDD.first()(1)
In Array(string)
scala> myRDD.first()(1)
scala> res1: String = abcdefgh
My function:
def appendKeyToValue(x: Array[Array[String]){
for (i<-0 to (x.length - 1)) {
var key = x(i)(0)
val pattern = new Regex("\\.")
val key2 = pattern replaceAllIn(key1,"|")
var tempvalue = x(i)(1)
val finalval = tempvalue.split("\n")
for (ab <-0 to (finalval.length -1)){
val result = (I am trying to append filename to each record in the filekey2+"|"+finalval(ab))
}
}
}
If you have a RDD[(String, String)], you can access the first tuple field of the first tuple by calling
val firstTupleField: String = myRDD.first()._1
If you want to convert a RDD[(String, String)] into a RDD[Array[String]] you can do the following
val arrayRDD: RDD[Array[String]] = myRDD.map(x => Array(x._1, x._2))
You may also employ a partial function to destructure the tuples:
val arrayRDD: RDD[Array[String]] = myRDD.map { case (a,b) => Array(a, b) }

Spark type mismatch error

I have a function below:-
def doSomething(line: RDD[(String, String)]): (String) = {
val c = line.toLocalIterator.mkString
val file2 = KeepEverythingExtractor.INSTANCE.getText(c)
(file2)
}
It's of type org.apache.spark.rdd.RDD[(String, String)])String
I have some files stored at hdfs which I have to access as below:-
val logData = sc.wholeTextFiles("hdfs://localhost:9000/home/akshat/recipes/recipes/simplyrecipes/*/*/*/*")
It's of type org.apache.spark.rdd.RDD[(String, String)]
I have to map these files according to doSomething function
val mapper = logData.map(doSomething)
But an error comes out like this:-
<console>:32: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, String)] => String
required: ((String, String)) => ?
val mapper = logData.map(doSomething)
^
I have defined in my function what type of input and output I should have and I am giving the input according to that only.
Why is this error coming then and what should I change in order to rectify this error?
Thanks in advance!
What is passed to map function is not RDD[(String, String)] but sequence of pairs (String, String), hence the error. Same way when you map over list you don't get list itself, but elements of the list, one by one.
Lets say want to extract file path then what you need is something like this:
def doSomething(x: (String, String)): String = {
x match {
case (fname, content) => fname
}
}
or simply:
logData.map(_._1)