Converting String RDD to Int RDD - scala

I am new to scala..I want to know when processing large datasets with scala in spark is it possible to read as int RDD instead of String RDD
I tried the below:
val intArr = sc
.textFile("Downloads/data/train.csv")
.map(line=>line.split(","))
.map(_.toInt)
But I am getting the error:
error: value toInt is not a member of Array[String]
I need to convert to int rdd because down the line i need to do the below
val vectors = intArr.map(p => Vectors.dense(p))
which requires the type to be integer
Any kind of help is truly appreciated..thanks in advance

As far as I understood, one line should create one vector, so it should goes like:
val result = sc
.textFile("Downloads/data/train.csv")
.map(line => line.split(","))
.map(numbers => Vectors.dense(numbers.map(_.toInt)))
numbers.map(_.toInt) will map every element of array to int, so result type will be Array[Int]

Related

Add new columns

ErrorHi I am trying to a new column to a Spark. I am trying in a data set where I want to add the percentage made by in all games.
The data set looks like this:
Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales
val vgdataLines = sc.textFile("hdfs:///user/ashhall1616/bdc_data/t1/vgsales-small.csv")
val vgdata = vgdataLines.map(_.split(";"))
def toPercentage(x: Double): Double = {x * 100} val countPubl = vgdata.map(r => (r(4),1)).reduceByKey(_+_)
val addpercen = countPubl.withColumn("count", toPercentage($"count"/countPubl.count(_._2)))
I used withColumn() to add new column 'count' and expected output to be like:
(Ubisoft,3,15.0)
Can anyone tell whats wrong here?
You cannot use "withColumn" with an RDD.
You could do as follow
val addpercen = countPubl.map({case(key, value) => (key, value, toPercentage(value))})
use map to add a calculated value as new column and convert to a DataFrame if you want
import spark.implicits._
val myDf = addpercen.toDF("key","value","myNewColumn")
myDf.show()
Hope it helps.
You can not use withColumn with an RDD hence convert it to DataFrame as below and then use it
val countPubl : DataFrame = vgdata.map(r => (r(4),1)).reduceByKey(_+_).toDF()
If you still looking to use RDD then just converto it back to RDD once you add the with column as
val javaRdd : JavaRDD[Row] = countPubl.withColumn("...",col("...")).toJavaRDD

spark-streaming scala: how can I pass an array of strings to a filter?

I want to replace the string "a" for an array of Strings making .contains() to check for every String in the array. Is that possible?
val filtered = stream.flatMap(status => status.getText.split(" ").filter(_.contains("a")))
Edit:
Also tried this (sc is sparkContext):
val ssc = new StreamingContext(sc, Seconds(15))
val stream = TwitterUtils.createStream(ssc, None)
val filtered = stream.flatMap(status => status.getText.split(" ").filter(a.contains(_)))
And got the following error:
java.io.NotSerializableException: Object of org.apache.spark.streaming.twitter.TwitterInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
Then I tried to broadcast the array before it is used:
val aBroadcast = sc.broadcast(a)
val filtered = stream.flatMap(status => status.getText.split(" ").filter(aBroadcast.value.contains(_)))
And got the same error.
Thanks
As I understand the question you want to see if the status text after being split contains a list of words which is a subset of a:
val a = Array("a1", "a2")
val filtered = stream.flatMap(status => status.getText.split(" ").filter(_.forall(a contains))

addition of two dataframe integer values in Scala/Spark

So I'm new to both Scala and Spark so it may be kind of a dumb question...
I have the following code :
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(List(1,2,3)).toDF();
df.foreach( value => println( value(0) + value(0) ) );
error: type mismatch;
found : Any
required: String
What is wrong with it? How do I tell "this is an integer not an any"?
I tried value(0).toInt but "value toInt is not a member of Any".
I tried List(1:Integer, 2:Integer, 3:Integer) but I can not convert into a dataframe afterward...
Spark Row is an untyped container. If you want to extract anything else than Any you have to use typed extractor method or pattern matching over the Row (see Spark extracting values from a Row):
df.rdd.map(value => value.getInt(0) + value.getInt(0)).collect.foreach(println)
In practice there should be reason to extract these values at all. Instead you can operate directly on the DataFrame:
df.select($"_1" + $"_1")

Why doesn't keys() and values() work on (String,String) one-pair RDD, while sortByKey() works

I create an RDD using the README.md file in Spark directory. The type of the newRDD is (String,String)
val lines = sc.textFile("README.md")
val newRDD = lines.map(x => (x.split(" ")(0),x))
So, when I try to runnewRDD.values() or newRDD.keys(), I get the error:
error: org.apache.spark.rdd.RDD[String] does not take parameters newRDD.values()or.keys() resp.
What I can understand from the error is maybe that String data type cannot be a key (And I think I am wrong). But if that's the case, why does
newRDD.sortByKey() work ?
Note: I am trying values() and keys() transformations because they're listed as valid transformations for one-pair RDDs
Edit: I am using Apache Spark version 1.5.2 in Scala
It doesn't work values (or keys) receives no parameters and because of that it has to be called without parentheses:
val rdd = sc.parallelize(Seq(("foo", "bar")))
rdd.keys.first
// String = foo
rdd.values.first
// String = bar

Spark-Scala RDD

I have a RDD RDD1 with the following Schema:
RDD[String, Array[String]]
(let's call it RDD1)
and I would like create a new RDD RDD2 with each row as RDD[String,String] with the key and value belonging to RDD1.
For example:
RDD1 =Array(("Fruit",("Orange","Apple","Peach")),("Shape",("Square","Rectangle")),("Mathematician",("Aryabhatt"))))
I want the output to be as:
RDD2 = Array(("Fruit","Orange"),("Fruit","Apple"),("Fruit","Peach"),("Shape","Square"),("Shape","Rectangle"),("Mathematician","Aryabhatt"))
Can someone help me with this piece of code?
My Try:
val R1 = RDD1.map(line => (line._1,line._2.split((","))))
val R2 = R1.map(line => line._2.foreach(ph => ph.map(line._1)))
This gives me an error:
error: value map is not a member of Char
I understand that it is because that map function is only applicable to the RDDs and not each string/char. Please help me with a way to use nested functions for this purpose in Spark.
Break down the problem.
("Fruit",Array("Orange","Apple","Peach") -> Array(("Fruit", "Orange"), ("Fruit", "Apple"), ("Fruit", "Peach"))
def flattenLine(line: (String, Array[String])) = line._2.map(x => (line._1, x)
Apply that function to your rdd:
rdd1.flatMap(flattenLine)