How to convert var to List? - scala

How to convert one var to two var List?
Below is my input variable:
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
I want my result should be:
val first= List(List("name","name1"),List("add1"))
val second= List(List("id","id1"),List("city"))

First of all, input is not a valid json
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
You have to make it valid json RDD ( as you are going to use apache spark)
val validJsonRdd = sc.parallelize(Seq(input)).flatMap(x => x.replace(",", "\",\"").replace(":", "\":\"").replace("[", "{\"").replace("]", "\"}").replace("}{", "}&{").split("&"))
Once you have valid json rdd, you can easily convert that to dataframe and then apply the logic you have
import org.apache.spark.sql.functions._
val df = spark.read.json(validJsonRdd)
.groupBy("level")
.agg(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
.select(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
You should get desired output in dataframe as
+------------------------------------------------+--------------------------------------------+
|var1 |var2 |
+------------------------------------------------+--------------------------------------------+
|[WrappedArray(name1, name2), WrappedArray(add1)]|[WrappedArray(id1, id2), WrappedArray(city)]|
+------------------------------------------------+--------------------------------------------+
And you can convert the array to list if required
To get the values as in the question, you can do the following
val rdd = df.collect().map(row => (row(0).asInstanceOf[Seq[Seq[String]]], row(1).asInstanceOf[Seq[Seq[String]]]))
val first = rdd(0)._1.map(x => x.toList).toList
//first: List[List[String]] = List(List(name1, name2), List(add1))
val second = rdd(0)._2.map(x => x.toList).toList
//second: List[List[String]] = List(List(id1, id2), List(city))
I hope the answer is helpful

reduceByKey is the important function to achieve your required output. More explaination on step by step reduceByKey explanation
You can do the following
val input="[level:1,var1:name1,var2:id1][level:1,var1:name2,var2:id2][level:2,var1:add1,var2:city]"
val groupedrdd = sc.parallelize(Seq(input)).flatMap(_.split("]\\[").map(x => {
val values = x.replace("[", "").replace("]", "").split(",").map(y => y.split(":")(1))
(values(0), (List(values(1)), List(values(2))))
})).reduceByKey((x, y) => (x._1 ::: y._1, x._2 ::: y._2))
val first = groupedrdd.map(x => x._2._1).collect().toList
//first: List[List[String]] = List(List(add1), List(name1, name2))
val second = groupedrdd.map(x => x._2._2).collect().toList
//second: List[List[String]] = List(List(city), List(id1, id2))

Related

How do I split a Spark rdd Array[(String, Array[String])] to a single RDD

I want to split the following RDD into a single RDD(id,(all name same type)).
>val test = rddByKey.map{case(k,v)=> (k,v.collect())}
test: Array[(String, Array[String])] =
Array(
(45000,Array(Amit, Pavan, Ratan)),
(10000,Array(Kumar, Venkat, Sheela)),
(50000,Array(Tejas, Dinesh, Lokesh, Bhupesh))
)
I want to print it like this:
(45000,(Amit, Pavan, Ratan))
(10000,(Kumar, Venkat, Sheela))
This is what I have tried
val data = sc.textFile("/user/cloudera/data.csv")
val rdd = data.map(r=>(r.split(",")(0),r.split(",")(1)))
val groupByKey = rdd.groupByKey().collect()
val rddByKey = groupByKey.map{case(k,v) => k->sc.makeRDD(v.toSeq)}
val test = rddByKey.map{case(k,v)=> (k,v.collect())}
You don't have to go through such complexity of using collect. you can simply do
val data = sc.textFile("/user/cloudera/data.csv")
val rdd = data.map(r=>{
val x = r.split(",")
(x(0),x(1))
})
val groupByKey = rdd.groupByKey().map{case (x, y) => (x :: y.toList)}
groupByKey is
List(45000, Amit, Pavan, Ratan)
List(10000, Kumar, Venkat, Sheela)
List(50000, Tejas, Dinesh, Lokesh, Bhupesh)
I hope the answer is helpful

SCALA : Read the text file and create tuple of it

How to create a tuple from the below-existing RDD?
// reading a text file "b.txt" and creating RDD
val rdd = sc.textFile("/home/training/desktop/b.txt")
b.txt dataset -->
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
If you are intending to have Array[Tuples4] then you can do the following
scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24
scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))
Then you can access each fields as tuples
scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())
Updated
If you have variable sized input file as
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big
you can write match case pattern matching as
scala> val arrayTuples = rdd.map(line => line.split(",") match {
| case Array(a, b, c, d) => (a,b,c,d)
| case Array(a,b,c) => (a,b,c)
| }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))
Updated again
As #eliasah pointed that above procedure is a bad practice which is using product iterator. As his suggestion we should know the maximum elements of the input data and use following logic where we assign default values for no elements
val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect
And as #philantrovert pointed out, we can verify the output in the following way, if we are not using REPL
arrayTuples.foreach(println)
which results to
(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

Scala - Convert List[String] to tuple List[(Int, Int)]

I would like to getLine from a Source and convert it to a tuple (Int, Int). I've did it using foreach.
val values = collection.mutable.ListBuffer[(Int, Int)]()
Source.fromFile(invitationFile.ref.file).getLines().filter(line => !line.isEmpty).foreach(line => {
val value = line.split("\\s")
values += ((value(0).toInt, (value(1).toInt)))
})
What's the best way to write the same code without use foreach?
Use map, it builds a new list for you:
Source.fromFile(invitationFile.ref.file)
.getLines()
.filter(line => !line.isEmpty)
.map(line => {
val value = line.split("\\s")
(value(0).toInt, value(1).toInt)
})
.toList()
foreach should be a final operation, not a transformation.
In your case, you want to use the function map
val values = Source.fromFile(invitationFile.ref.file).getLines()
.filter(line => !line.isEmpty)
.map(line => line.split("\\s"))
.map(line => (line(0).toInt, line(1).toInt))
Using a for comprehension:
val values = for(line <- Source.fromFile(invitationFile.ref.file).getLines(); if !line.isEmpty) {
val splits = line.split("\\s")
yield (split(0).toInt, split(1).toInt)
}

Loading several input files into one Dataframe in Scala / Spark 1.6

I'm trying to load several input files in to a single dataframe:
val inputs = List[String]("input1.txt", "input2.txt", "input3.txt")
val dataFrames = for (
i <- inputs;
df <- sc.textFile(i).toDF()
) yield {df}
val inputDataFrame = unionAll(dataFrames, sqlContext)
// union of all given DataFrames
private def unionAll(dataFrames: Seq[DataFrame], sqlContext: SQLContext): DataFrame = dataFrames match {
case Nil => sqlContext.emptyDataFrame
case head :: Nil => head
case head :: tail => head.unionAll(unionAll(tail, sqlContext))
}
Compiler says
Error:(40, 8) type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: scala.collection.GenTraversableOnce[?]
df <- sc.textFile(i).toDF()
^
Any idea?
First, SQLContext.read.text(...) accepts multiple filename arguments, so you can simply do:
val inputs = List[String]("input1.txt", "input2.txt", "input3.txt")
val inputDataFrame = sqlContext.read.text(inputs: _*)
Or:
val inputDataFrame = sqlContext.read.text("input1.txt", "input2.txt", "input3.txt")
As for your code - when you write:
val dataFrames = for (
i <- inputs;
df <- sc.textFile(i).toDF()
) yield df
It is translated into:
inputs.flatMap(i => sc.textFile(i).toDF().map(df => df))
Which can't compile, because flatMap expects a function that returns a GenTraversableOnce[?], while the supplied function returns an RDD[Row] (See signature of DataFrame.map). In other words, when you write df <- sc.textFile(i).toDF() you're actually taking each row in the dataframe, and yielding a new RDD with these rows, which isn't what you intended.
What you were trying to do is simpler:
val dataFrames = for (
i <- inputs;
) yield sc.textFile(i).toDF()
But, as mentioned at the beginning, the recommended approach is using sqlContext.read.text.

how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!
join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))