val file = sc.textFile(filePath)
val sol1=file.map(x=>x.split("\t")).map(x=>Array(x(4),x(5),x(1)))
val sol2=sol1.map(x=>x(2).toLowerCase)
In sol1, I have created an Rdd[Array[String]] and I want to put for every array the 3rd string element in LowerCase so call the method toLowerCase which should do that but instead it transform the string in lowercase char??
I assume you want to convert 3rd array element to lower case
val sol1=file.map(x=>x.split("\t"))
.map(x => Array(x(4),x(5),x(1).toLowerCase))
In your code, sol2 will be the sequence of string, not the sequence of array.
Related
I have an assignment where I need to load a csv dataset in a spark-shell using spark.read.csv(), and accomplish the following:
Convert the dataset to RDD
Remove the heading (first record (line) in the dataset)
Convert the first two fields to integers
Convert other fields except the last one to doubles. Questions marks should be NaN. The
last field should be converted to a Boolean.
I was able to do steps 1 and 2 with the following code:
//load the dataset as an RDD
val dataRDD = spark.read.csv("block_1.csv").rdd //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[14] at rdd at <console>:23
dataRDD.count() //output 574914
//import Row since RDD is of Row
import org.apache.spark.sql.Row
//function to recognize if a string contains "id_1"
def isHeader(r : Row) = r.toString.contains("id_1")
//filter function will take !isHeader function and apply it to all lines in dataRDD and the //return will form another RDD
val nohead = dataRDD.filter(x => !isHeader(x))
nohead.count() //output is now 574913
nohead.first //output is [37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE]
nohead //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[15] at filter at <console>:28
I'm trying to convert the fields but every time I use a function like toDouble I get an error stating not a member of:
:25: error: value toDouble is not a member of
org.apache.spark.sql.Row
if ("?".equals(s)) Double.NaN else s.toDouble
I'm not sure what I'm doing wrong and I've taken a look at the website https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Row.html#anyNull()
but I still don't know what I'm doing wrong.
I'm not sure how to convert something if there isn't a toDouble, toInt, or toBoolean function.
Can someone please guide me in the right direction to figure what I'm doing wrong? Where I can possibly look to answer? I need to convert the first two fields to integers, the other fields except for the last one to doubles. Question marks should be NaN. The last field should be converted to Boolean.
Convert the first two fields to integers
Convert other fields except the last one to doubles. Questions marks should be NaN. The last field should be converted to a Boolean.
You can do both 3 and 4 at once using a parse function.
First create the toDouble function since it is used in the parse function:
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line: String) = {
val pieces = line.split(',')
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
(id1, id2, scores, matched)
}
After you do this, you can call parse on each row in your RDD using map; however, you still have the type issue. To fix this, you could convert nohead from an RDD[Row] to an RDD[String]; however its probably easier to just convert the row to a string as you pass it:
val parsed = noheader.map(line => parse(line.mkString(",")))
This will give parsed as type: RDD[(Int, Int, Array[Double], Boolean)]
I have scenario to capture some data (not all) from an existing RDD and then pass it to other Scala class for actual operations. Lets see with example data(empnum, empname, emplocation, empsal) in a text file.
11,John,Paris,1000
12,Daniel,UK,3000
first step, I create an RDD with RDD[String] by below code,
val empRDD = spark
.sparkContext
.textFile("empInfo.txt")
So, my requirement is to create another RDD with empnum, empname, emplocation (again with RDD[String]).
For that I have tried below code hence I am getting RDD[String, String, String].
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
I have tried with Slice also, it gives me RDD[Array(String)].
My required RDD should be of RDD[String] to pass to required Scala class to do some operations.
The expected output should be,
11,John,Paris
12,Daniel,UK
Can anyone help me how to achieve?
I would try this
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
val rddString = empReqRDD.map({case(id,name,city) => "%s,%s,%s".format(id,name,city)})
In your initial implementation, the second map is putting the array elements into a 3-tuple, hence the RDD[(String, String, String)].
One way to accomplish your objective is to change the second map to construct a string like so:
empRDD
.map(a=> a.split(","))
.map(x => s"${x(0)},${x(1)},${x(2)}")
Alternatively, and a bit more concise, you could do it by taking the first 3 elements of the array and using the mkString method:
empRDD.map(_.split(',').take(3).mkString(","))
Probably overkill for this use-case, but you could also use a regex to extract the values:
val r = "([^,]*),([^,]*),([^,]*).*".r
empRDD.map { case r(id, name, city) => s"$id,$name,$city" }
Assuming I am having the following rdd:
val rdd = sc.parallelize(Seq(('a'.toString,1.1,Array(1.1,2.2),0),
('b'.toString,1.5,Array(1.4,4.2),3),
('d'.toString,2.1,Array(3.3,7.4),4)))
>>>rdd: org.apache.spark.rdd.RDD[(String,Double,Array[Double],Int)]
And I want to write the output to csv format by using .write.format("com.databricks.spark.csv") which takes a dataframe.
So firstly i need to convert the current schema to -> rdd[(String, String, String, String, String)] and after convert it to df. I tried the following:
rdd.map { case((a,b,c,d)) => (a,b,c.mkString(","),d)}
but this outputs:
rdd[(string,double,string,int)]
Any idea how to do it?
UPDATE
To work with Tuples, you have to know how many elements you're going to put in them and define the use case yourself. Hence, to work with variable number of elements, you'll probably need to work with some collection.
For your use case, something like this can work:
rdd.map { case((a,b,c,d)) => a +: (b +: c) :+ d}.map(_.mkString(","))
This will result in an RDD[String] corresponding to each line of the csv file.
You're prepending and appending the other elements to the Array "c" to result in a single Array.
I'm using Scala, and I want saveAsTextFile to directly save the result as tab separated, like, for example:
a 1
b 4
c 5
(space is tab)
I just want to use saveAsTextFile (not print), and when I have like RDD[(String, Double)], I cannot use
ranks = ranks.map( f => f._1 +"\t"+f._2)
It says the type does not match, I guess is because f._1 is string and f._2 is a double?
The only mistake in your code is trying to re-assign the result of the mapping into the same ranks variable - I'm assuming ranks has type RDD[(String, Double)] so indeed you can't assign it with a value of type RDD[String]. Simply use a separate variable:
val ranks: RDD[(String, Double)] = sc.parallelize(Seq(("a", 1D), ("b", 4D)))
val tabSeparated: RDD[String] = ranks.map(f => f._1 +"\t"+f._2)
tabSeparated.saveAsTextFile("./test.tsv")
In general, it's almost always better to use vals and not vars to prevent such mistakes.
NOTE: a perhaps cleaner way to convert a tuple (of any size) into a tab-delimited string:
ranks.map(_.productIterator.mkString("\t"))
I'm using MLlib of Spark (v1.1.0) and Scala to do k-means clustering applied to a file with points (longitude and latitude).
My file contains 4 fields separated by comma (the last two are the longitude and latitude).
Here, it's an example of k-means clustering using Spark:
https://spark.apache.org/docs/1.1.0/mllib-clustering.html
What I want to do is to read the last two fields of my files that are in a specific directory in HDFS, transform them into an RDD<Vector> o use this method in KMeans class:
train(RDD<Vector> data, int k, int maxIterations)
This is my code:
val data = sc.textFile("/user/test/location/*")
val parsedData = data.map(s => Vectors.dense(s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))))
But when I run it in spark-shell I get the following error:
error: overloaded method value dense with alternatives: (values:
Array[Double])org.apache.spark.mllib.linalg.Vector (firstValue:
Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[(Double, Double)])
So, I don't know how to transform my Array[(Double, Double)] into Array[Double]. Maybe there is another way to read the two fields and convert them into RDD<Vector>, any suggestion?
Previous suggestion using flatMap was based on the assumption that you wanted to map over the elements of the array given by the .split(",") - and offered to satisfy the types, by using Array instead of Tuple2.
The argument received by the .map/.flatMap functions is an element of the original collection, so should be named 'field' (singluar) for clarity. Calling fields(2) selects the 3rd character of each of the elements of the split - hence the source of confusion.
If what you're after is the 3rd and 4th elements of the .split(",") array, converted to Double:
s.split(",").drop(2).take(2).map(_.toDouble)
or if you want all BUT the first to fields converted to Double (if there may be more than 2):
s.split(",").drop(2).map(_.toDouble)
There're two 'factory' methods for dense Vectors:
def dense(values: Array[Double]): Vector
def dense(firstValue: Double, otherValues: Double*): Vector
While the provided type above is Array[Tuple2[Double,Double]] and hence does not type-match:
(Extracting the logic above:)
val parseLineToTuple: String => Array[(Double,Double)] = s => s=> s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))
What is needed here is to create a new Array out of the input String, like this: (again focusing only on the specific parsing logic)
val parseLineToArray: String => Array[Double] = s=> s.split(",").flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble)))
Integrating that in the original code should solve the issue:
val data = sc.textFile("/user/test/location/*")
val vectors = data.map(s => Vectors.dense(parseLineToArray(s))
(You can of course inline that code, I separated it here to focus on the issue at hand)
val parsedData = data.map(s => Vectors.dense(s.split(',').flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble))))