Spark: format of an rdd to convert to dataframe - scala

Assuming I am having the following rdd:
val rdd = sc.parallelize(Seq(('a'.toString,1.1,Array(1.1,2.2),0),
('b'.toString,1.5,Array(1.4,4.2),3),
('d'.toString,2.1,Array(3.3,7.4),4)))
>>>rdd: org.apache.spark.rdd.RDD[(String,Double,Array[Double],Int)]
And I want to write the output to csv format by using .write.format("com.databricks.spark.csv") which takes a dataframe.
So firstly i need to convert the current schema to -> rdd[(String, String, String, String, String)] and after convert it to df. I tried the following:
rdd.map { case((a,b,c,d)) => (a,b,c.mkString(","),d)}
but this outputs:
rdd[(string,double,string,int)]
Any idea how to do it?

UPDATE
To work with Tuples, you have to know how many elements you're going to put in them and define the use case yourself. Hence, to work with variable number of elements, you'll probably need to work with some collection.
For your use case, something like this can work:
rdd.map { case((a,b,c,d)) => a +: (b +: c) :+ d}.map(_.mkString(","))
This will result in an RDD[String] corresponding to each line of the csv file.
You're prepending and appending the other elements to the Array "c" to result in a single Array.

Related

How to create an RDD by selecting specific data from an existing RDD where output should of RDD[String]?

I have scenario to capture some data (not all) from an existing RDD and then pass it to other Scala class for actual operations. Lets see with example data(empnum, empname, emplocation, empsal) in a text file.
11,John,Paris,1000
12,Daniel,UK,3000
first step, I create an RDD with RDD[String] by below code,
val empRDD = spark
.sparkContext
.textFile("empInfo.txt")
So, my requirement is to create another RDD with empnum, empname, emplocation (again with RDD[String]).
For that I have tried below code hence I am getting RDD[String, String, String].
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
I have tried with Slice also, it gives me RDD[Array(String)].
My required RDD should be of RDD[String] to pass to required Scala class to do some operations.
The expected output should be,
11,John,Paris
12,Daniel,UK
Can anyone help me how to achieve?
I would try this
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
val rddString = empReqRDD.map({case(id,name,city) => "%s,%s,%s".format(id,name,city)})
In your initial implementation, the second map is putting the array elements into a 3-tuple, hence the RDD[(String, String, String)].
One way to accomplish your objective is to change the second map to construct a string like so:
empRDD
.map(a=> a.split(","))
.map(x => s"${x(0)},${x(1)},${x(2)}")
Alternatively, and a bit more concise, you could do it by taking the first 3 elements of the array and using the mkString method:
empRDD.map(_.split(',').take(3).mkString(","))
Probably overkill for this use-case, but you could also use a regex to extract the values:
val r = "([^,]*),([^,]*),([^,]*).*".r
empRDD.map { case r(id, name, city) => s"$id,$name,$city" }

Scala method toLowerCase in spark

val file = sc.textFile(filePath)
val sol1=file.map(x=>x.split("\t")).map(x=>Array(x(4),x(5),x(1)))
val sol2=sol1.map(x=>x(2).toLowerCase)
In sol1, I have created an Rdd[Array[String]] and I want to put for every array the 3rd string element in LowerCase so call the method toLowerCase which should do that but instead it transform the string in lowercase char??
I assume you want to convert 3rd array element to lower case
val sol1=file.map(x=>x.split("\t"))
.map(x => Array(x(4),x(5),x(1).toLowerCase))
In your code, sol2 will be the sequence of string, not the sequence of array.

How do I split a Spark rdd Array[(String, Array[String])]?

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
I then apply sortByKey, still no problem...
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple.
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])]
exactly as it says :).
now if you want to split the record you need to get it from this tuple.
you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)
but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.
if you want to sort the rdd using the 7th string in the array, you can just do it directly by
rdd.sortBy(_(6)) // array starts at 0 not 1
or
rdd.sortBy(arr => arr(6))
That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).
To test this, i did the following:
val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))
// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))
Here's the result:
Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))
just do this:
val rdd4 = rdd3.map(_._2)
I thought you don't familiar with Scala,
So, below should help you understand more,
rdd3.map(kv => {
println(kv._1) // This represent String
println(kv._2) // This represent Array[String]
})

How to generate tab separated output using saveastextfile in Spark?

I'm using Scala, and I want saveAsTextFile to directly save the result as tab separated, like, for example:
a 1
b 4
c 5
(space is tab)
I just want to use saveAsTextFile (not print), and when I have like RDD[(String, Double)], I cannot use
ranks = ranks.map( f => f._1 +"\t"+f._2)
It says the type does not match, I guess is because f._1 is string and f._2 is a double?
The only mistake in your code is trying to re-assign the result of the mapping into the same ranks variable - I'm assuming ranks has type RDD[(String, Double)] so indeed you can't assign it with a value of type RDD[String]. Simply use a separate variable:
val ranks: RDD[(String, Double)] = sc.parallelize(Seq(("a", 1D), ("b", 4D)))
val tabSeparated: RDD[String] = ranks.map(f => f._1 +"\t"+f._2)
tabSeparated.saveAsTextFile("./test.tsv")
In general, it's almost always better to use vals and not vars to prevent such mistakes.
NOTE: a perhaps cleaner way to convert a tuple (of any size) into a tab-delimited string:
ranks.map(_.productIterator.mkString("\t"))

Flattening an element of a RDD

I am using Spark scala API.
prods_grpd has this type: String, mutable.HashSet[String]
val prods_grpd = all_meds.aggregateByKey(initialSet)(addToSet, mergePartitionSets)
prods_grpd.saveAsTextFile("scratch/prods_grpdby_users.tsv")
When I save this rdd, I get this o/p. The first value is key and then I get a set of keys.
(8635214,Set(2013-01-01))
(3580112,Set(2013-01-01))
(146086,Set(2010-01-01, 2012-01-01))
(112220,Set(2013-01-01))
(2020,Set(2013-01-01))
(24218,Set(2013-01-01))
However, I want o/p like:
(8635214, 2013-01-01)
(3580112, 2013-01-01)
(146086, 2010-01-01, 2012-01-01)
(112220, 2013-01-01)
(2020, 2013-01-01)
(24218, 2013-01-01)
I which like to know how do I unnest/flatten the 2nd parameter of RDD.
You cannot simply convert Set to Tuple because tuples are not collections and don't support arbitrary number of elements. Instead you can map entries to strings with desired format:
prods_grpd.map{case (k, s) => {
val sstr = s.mkString(",")
s"($k,$sstr)"
}}.saveAsTextFile(...)