SCALA : Read the text file and create tuple of it - scala

How to create a tuple from the below-existing RDD?
// reading a text file "b.txt" and creating RDD
val rdd = sc.textFile("/home/training/desktop/b.txt")
b.txt dataset -->
Ankita,26,BigData,newbie
Shikha,30,Management,Expert

If you are intending to have Array[Tuples4] then you can do the following
scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24
scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))
Then you can access each fields as tuples
scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())
Updated
If you have variable sized input file as
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big
you can write match case pattern matching as
scala> val arrayTuples = rdd.map(line => line.split(",") match {
| case Array(a, b, c, d) => (a,b,c,d)
| case Array(a,b,c) => (a,b,c)
| }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))
Updated again
As #eliasah pointed that above procedure is a bad practice which is using product iterator. As his suggestion we should know the maximum elements of the input data and use following logic where we assign default values for no elements
val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect
And as #philantrovert pointed out, we can verify the output in the following way, if we are not using REPL
arrayTuples.foreach(println)
which results to
(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

Related

How to convert var to List?

How to convert one var to two var List?
Below is my input variable:
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
I want my result should be:
val first= List(List("name","name1"),List("add1"))
val second= List(List("id","id1"),List("city"))
First of all, input is not a valid json
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
You have to make it valid json RDD ( as you are going to use apache spark)
val validJsonRdd = sc.parallelize(Seq(input)).flatMap(x => x.replace(",", "\",\"").replace(":", "\":\"").replace("[", "{\"").replace("]", "\"}").replace("}{", "}&{").split("&"))
Once you have valid json rdd, you can easily convert that to dataframe and then apply the logic you have
import org.apache.spark.sql.functions._
val df = spark.read.json(validJsonRdd)
.groupBy("level")
.agg(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
.select(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
You should get desired output in dataframe as
+------------------------------------------------+--------------------------------------------+
|var1 |var2 |
+------------------------------------------------+--------------------------------------------+
|[WrappedArray(name1, name2), WrappedArray(add1)]|[WrappedArray(id1, id2), WrappedArray(city)]|
+------------------------------------------------+--------------------------------------------+
And you can convert the array to list if required
To get the values as in the question, you can do the following
val rdd = df.collect().map(row => (row(0).asInstanceOf[Seq[Seq[String]]], row(1).asInstanceOf[Seq[Seq[String]]]))
val first = rdd(0)._1.map(x => x.toList).toList
//first: List[List[String]] = List(List(name1, name2), List(add1))
val second = rdd(0)._2.map(x => x.toList).toList
//second: List[List[String]] = List(List(id1, id2), List(city))
I hope the answer is helpful
reduceByKey is the important function to achieve your required output. More explaination on step by step reduceByKey explanation
You can do the following
val input="[level:1,var1:name1,var2:id1][level:1,var1:name2,var2:id2][level:2,var1:add1,var2:city]"
val groupedrdd = sc.parallelize(Seq(input)).flatMap(_.split("]\\[").map(x => {
val values = x.replace("[", "").replace("]", "").split(",").map(y => y.split(":")(1))
(values(0), (List(values(1)), List(values(2))))
})).reduceByKey((x, y) => (x._1 ::: y._1, x._2 ::: y._2))
val first = groupedrdd.map(x => x._2._1).collect().toList
//first: List[List[String]] = List(List(add1), List(name1, name2))
val second = groupedrdd.map(x => x._2._2).collect().toList
//second: List[List[String]] = List(List(city), List(id1, id2))

Scala: GraphX: error: class Array takes type parameters

I am trying to build an Edge RDD for GraphX. I am reading a csv file and converting to DataFrame Then trying to convert to an Edge RDD:
val staticDataFrame = spark.
read.
option("header", true).
option("inferSchema", true).
csv("/projects/pdw/aiw_test/aiw/haris/Customers_DDSW-withDN$.csv")
val edgeRDD: RDD[Edge[(VertexId, VertexId, String)]] =
staticDataFrame.select(
"dealer_customer_number",
"parent_dealer_cust_number",
"dealer_code"
).map{ (row: Array) =>
Edge((
row.getAs[Long]("dealer_customer_number"),
row.getAs[Long]("parent_dealer_cust_number"),
row("dealer_code")
))
}
But I am getting this error:
<console>:81: error: class Array takes type parameters
val edgeRDD: RDD[Edge[(VertexId, VertexId, String)]] = staticDataFrame.select("dealer_customer_number", "parent_dealer_cust_number", "dealer_code").map((row: Array) => Edge((row.getAs[Long]("dealer_customer_number"), row.getAs[Long]("parent_dealer_cust_number"), row("dealer_code"))))
^
The result for
staticDataFrame.select("dealer_customer_number", "parent_dealer_cust_number", "dealer_code").take(1)
is
res3: Array[org.apache.spark.sql.Row] = Array([0000101,null,B110])
First, Array takes type parameters, so you would have to write Array[Something]. But this is probably not what you want anyway.
The dataframe is a Dataset[Row], not a Dataset[Array[_]], therefore you have to change
.map{ (row: Array) =>
to
.map{ (row: Row) =>
Or just omit the typing completely (it should be inferred):
.map{ row =>

Separate RDD Based on Membership of Id in Another RDD

I have two case classes and one RDD of each.
case class Thing1(Id: String, a: String, b: String, c: java.util.Date, d: Double)
case class Thing2(Id: String, e: java.util.Date, f: Double)
val rdd1 = // Loads an rdd of type RDD[Thing1]
val rdd2 = // Loads an rdd of type RDD[Thing2]
I want to create 2 new RDD[Thing1]s, 1 that contains elements of rdd1 where the element has an Id present in rdd2, and another that contains elements of rdd1 where the element does not have an Id present in rdd2
Here's what I have tried (looked at this, Scala Spark contains vs. does not contain, and other stack overflow posts, but none have worked)
val rdd2_ids = rdd2.map(r => r.Id)
val rdd1_present = rdd1.filter{case r => rdd2 contains r.Id}
val rdd1_absent = rdd1.filter{case r => !(rdd2 contains r.Id)}
But this gets me the error error: value contains is not a member of org.apache.spark.rdd.RDD[String]
I have seen many questions on SO asking how to do similar things to what I am trying to do, but none have worked for me. I get the value _____ is not a member of org.apache.spark.rdd.RDD[String] error a lot.
Why are these other answers not working for me, and how can I achieve what I am trying to do?
I created two simple RDDs
val rdd1 = sc.parallelize(Array(
| Thing1(1,2),
| Thing1(2,3),
| Thing1(3,4) ))
rdd1: org.apache.spark.rdd.RDD[Thing1] = ParallelCollectionRDD[174] at parallelize
val rdd2 = sc.parallelize(Array(
| Thing2(1, "Two"),
| Thing2(2, "Three" ) ))
rdd2: org.apache.spark.rdd.RDD[Thing2] = ParallelCollectionRDD[175] at parallelize
Now you can join them by the respective element for which you want to find the common value in both :
val rdd1_present = rdd1.keyBy(_.a).join(rdd2.keyBy(_.a) ).map{ case(a, (t1, t2) ) => t1 }
//rdd1_present.collect
//Array[Thing1] = Array(Thing1(2,3), Thing1(1,2))
val rdd1_absent = rdd1.keyBy(_.a).subtractByKey(rdd1_present.keyBy(_.a) ).map{ case(a,t1) => t1 }
//rdd1_absent.collect
//Array[Thing1] = Array(Thing1(3,4))
Try full outer join-
val joined = rdd1.map(s=>(s.id,s)).fullOuterJoin(rdd2.map(s=>(s.id,s))).cache()
//only in left
joined.filter(s=> s._2._2.isEmpty).foreach(println)
//only in right
joined.filter(s=>s._2._1.isEmpty).foreach(println)
//in both
joined.filter(s=> !s._2._1.isEmpty && !s._2._2.isEmpty).foreach(println)

how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!
join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

Array[Byte] Spark RDD to String Spark RDD

I'm using the Cloudera's SparkOnHBase module in order to get data from HBase.
I get a RDD in this way:
var getRdd = hbaseContext.hbaseRDD("kbdp:detalle_feedback", scan)
Based on that, what I get is an object of type
RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])]
which corresponds to row key and a list of values. All of them represented by a byte array.
If I save the getRDD to a file, what I see is:
([B#f7e2590,[([B#22d418e2,[B#12adaf4b,[B#48cf6e81), ([B#2a5ffc7f,[B#3ba0b95,[B#2b4e651c), ([B#27d0277a,[B#52cfcf01,[B#491f7520), ([B#3042ad61,[B#6984d407,[B#f7c4db0), ([B#29d065c1,[B#30c87759,[B#39138d14), ([B#32933952,[B#5f98506e,[B#8c896ca), ([B#2923ac47,[B#65037e6a,[B#486094f5), ([B#3cd385f2,[B#62fef210,[B#4fc62b36), ([B#5b3f0f24,[B#8fb3349,[B#23e4023a), ([B#4e4e403e,[B#735bce9b,[B#10595d48), ([B#5afb2a5a,[B#1f99a960,[B#213eedd5), ([B#2a704c00,[B#328da9c4,[B#72849cc9), ([B#60518adb,[B#9736144,[B#75f6bc34)])
for each record (rowKey and the columns)
But what I need is to get the String representation of all and each of the keys and values. Or at least the values. In order to save it to a file and see something like
key1,(value1,value2...)
or something like
key1,value1,value2...
I'm completely new on spark and scala and it's being quite hard to get something.
Could you please help me with that?
First lets create some sample data:
scala> val d = List( ("ab" -> List(("qw", "er", "ty")) ), ("cd" -> List(("ac", "bn", "afad")) ) )
d: List[(String, List[(String, String, String)])] = List((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
This is how the data is:
scala> d foreach println
(ab,List((qw,er,ty)))
(cd,List((ac,bn,afad)))
Convert it to Array[Byte] format
scala> val arrData = d.map { case (k,v) => k.getBytes() -> v.map { case (a,b,c) => (a.getBytes(), b.getBytes(), c.getBytes()) } }
arrData: List[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = List((Array(97, 98),List((Array(113, 119),Array(101, 114),Array(116, 121)))), (Array(99, 100),List((Array(97, 99),Array(98, 110),Array(97, 102, 97, 100)))))
Create an RDD out of this data
scala> val rdd1 = sc.parallelize(arrData)
rdd1: org.apache.spark.rdd.RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = ParallelCollectionRDD[0] at parallelize at <console>:25
Create a conversion function from Array[Byte] to String:
scala> def b2s(a: Array[Byte]): String = new String(a)
b2s: (a: Array[Byte])String
Perform our final conversion:
scala> val rdd2 = rdd1.map { case (k,v) => b2s(k) -> v.map{ case (a,b,c) => (b2s(a), b2s(b), b2s(c)) } }
rdd2: org.apache.spark.rdd.RDD[(String, List[(String, String, String)])] = MapPartitionsRDD[1] at map at <console>:29
scala> rdd2.collect()
res2: Array[(String, List[(String, String, String)])] = Array((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
I don't know about HBase but if those Array[Byte]s are Unicode strings, something like this should work:
rdd: RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = *whatever*
rdd.map(k, l =>
(new String(k),
l.map(a =>
a.map(elem =>
new String(elem)
)
))
)
Sorry for bad styling and whatnot, I am not even sure it will work.