converting all elements of an RDD[string] to float in scala - scala

I'm trying to convert all the values of RDD[string] to float
The RDD contains data similar to this
15.994
1.008
4.9594
an so on
RDD is in RDD[string] format.
I need to calculate the sum of all these values and hence need to convert them into float.
I found a code for this problem in python, but I need it in scala
python code :
val massData1 = [map(float,i) for i in massData]
massData is the RDD[string]
Can anyone please tell me how I can add all the values in the RDD [string] by converting them into float.

Suppose you have an strRDD which contains string, do transform as below:
val doubleRDD = strRDD.map(_.toFloat)
Then add them up:
val result = doubleRDD.reduce(_ + _)

Related

Converting Fields to Ints, Doubles, ect. in Scala in Spark Shell RDD

I have an assignment where I need to load a csv dataset in a spark-shell using spark.read.csv(), and accomplish the following:
Convert the dataset to RDD
Remove the heading (first record (line) in the dataset)
Convert the first two fields to integers
Convert other fields except the last one to doubles. Questions marks should be NaN. The
last field should be converted to a Boolean.
I was able to do steps 1 and 2 with the following code:
//load the dataset as an RDD
val dataRDD = spark.read.csv("block_1.csv").rdd //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[14] at rdd at <console>:23
dataRDD.count() //output 574914
//import Row since RDD is of Row
import org.apache.spark.sql.Row
//function to recognize if a string contains "id_1"
def isHeader(r : Row) = r.toString.contains("id_1")
//filter function will take !isHeader function and apply it to all lines in dataRDD and the //return will form another RDD
val nohead = dataRDD.filter(x => !isHeader(x))
nohead.count() //output is now 574913
nohead.first //output is [37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE]
nohead //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[15] at filter at <console>:28
I'm trying to convert the fields but every time I use a function like toDouble I get an error stating not a member of:
:25: error: value toDouble is not a member of
org.apache.spark.sql.Row
if ("?".equals(s)) Double.NaN else s.toDouble
I'm not sure what I'm doing wrong and I've taken a look at the website https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Row.html#anyNull()
but I still don't know what I'm doing wrong.
I'm not sure how to convert something if there isn't a toDouble, toInt, or toBoolean function.
Can someone please guide me in the right direction to figure what I'm doing wrong? Where I can possibly look to answer? I need to convert the first two fields to integers, the other fields except for the last one to doubles. Question marks should be NaN. The last field should be converted to Boolean.
Convert the first two fields to integers
Convert other fields except the last one to doubles. Questions marks should be NaN. The last field should be converted to a Boolean.
You can do both 3 and 4 at once using a parse function.
First create the toDouble function since it is used in the parse function:
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line: String) = {
val pieces = line.split(',')
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
(id1, id2, scores, matched)
}
After you do this, you can call parse on each row in your RDD using map; however, you still have the type issue. To fix this, you could convert nohead from an RDD[Row] to an RDD[String]; however its probably easier to just convert the row to a string as you pass it:
val parsed = noheader.map(line => parse(line.mkString(",")))
This will give parsed as type: RDD[(Int, Int, Array[Double], Boolean)]

Converting from vector column to Double[Array] column in scala Spark

I have a data frame doubleSeq whose structure is as below
res274: org.apache.spark.sql.DataFrame = [finalFeatures: vector]
The first record of the column is as follows
res281: org.apache.spark.sql.Row = [[3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]]
I want to extract the double array
[3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]
from this -
doubleSeq.head(1)(0)(0)
gives
Any = [3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]
Which is not solving my problem
Scala Spark - split vector column into separate columns in a Spark DataFrame
Is not solving my issue but its an indicator
So you want to extract a Vector from a Row, and turn it into an array of doubles.
The problem with your code is that the get method (and the implicit apply method you are using) returns an object of type Any. Indeed, a Row is a generic, unparametrized object and there is no way to now at compile time what types it contains. It's a bit like Lists in java 1.4 and before. To solve it in spark, you can use the getAs method that you can parametrize with a type of your choosing.
In your situation, you seem to have a dataframe containing a vector (org.apache.spark.ml.linalg.Vector).
import org.apache.spark.ml.linalg._
val firstRow = df.head(1)(0) // or simply df.head
val vect : Vector = firstRow.getAs[Vector](0)
// or all in one: df.head.getAs[Vector](0)
// to transform into a regular array
val array : Array[Double] = vect.toArray
Note also that you can access columns by name like this:
val vect : Vector = firstRow.getAs[Vector]("finalFeatures")

Spark convert PairRDD to data frame

How can I convert a pair RDD of the following type
joinResult
res16: org.apache.spark.api.java.JavaPairRDD[com.vividsolutions.jts.geom.Polygon,java.util.HashSet[com.vividsolutions.jts.geom.Polygon]] = org.apache.spark.api.java.JavaPairRDD#264b550
to a data frame?
https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala/myOrg/GeoSpark.scala#L72-L75
joinResult.toDF().show
will not work as well as

Spark: format of an rdd to convert to dataframe

Assuming I am having the following rdd:
val rdd = sc.parallelize(Seq(('a'.toString,1.1,Array(1.1,2.2),0),
('b'.toString,1.5,Array(1.4,4.2),3),
('d'.toString,2.1,Array(3.3,7.4),4)))
>>>rdd: org.apache.spark.rdd.RDD[(String,Double,Array[Double],Int)]
And I want to write the output to csv format by using .write.format("com.databricks.spark.csv") which takes a dataframe.
So firstly i need to convert the current schema to -> rdd[(String, String, String, String, String)] and after convert it to df. I tried the following:
rdd.map { case((a,b,c,d)) => (a,b,c.mkString(","),d)}
but this outputs:
rdd[(string,double,string,int)]
Any idea how to do it?
UPDATE
To work with Tuples, you have to know how many elements you're going to put in them and define the use case yourself. Hence, to work with variable number of elements, you'll probably need to work with some collection.
For your use case, something like this can work:
rdd.map { case((a,b,c,d)) => a +: (b +: c) :+ d}.map(_.mkString(","))
This will result in an RDD[String] corresponding to each line of the csv file.
You're prepending and appending the other elements to the Array "c" to result in a single Array.

Spark DataFrame zipWithIndex

I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].