Creating Map values in Spark using Scala - scala

I am new to spark-scala development. I am trying to create a map values in spark using scala but getting type mismatch error.
scala> val nums = sc.parallelize(Map("red" -> "#FF0000","azure" -> "#F0FFFF","peru" -> "#CD853F"))
<console>:21: error: type mismatch;
found : scala.collection.immutable.Map[String,String]
required: Seq[?]
Error occurred in an application involving default arguments.
val nums = sc.parallelize(Map("red" -> "#FF0000","azure" -> "#F0FFFF","peru" -> "#CD853F"))
How should I do this?

SparkContext.parallelize transforms from Seq[T] to RDD[T]. If you want to create RDD[(String, String)] where each element is an individual key-value pair from the original Map use:
import org.apache.spark.rdd.RDD
val m = Map("red" -> "#FF0000","azure" -> "#F0FFFF","peru" -> "#CD853F")
val rdd: RDD[(String, String)] = sc.parallelize(m.toSeq)
If you want RDD[Map[String,String]] (not that it makes any sense with a single element) use:
val rdd: RDD[Map[String,String]] = sc.parallelize(Seq(m))

Related

How to convert org.apache.spark.sql.Column to data types like Long or String

I am new to Scala and Spark. I am trying to load data from Spark SQL to build graphX vertices however I am facing an error that I don't know how to solve. This is the code:
val vRDD: RDD[(VertexId, String)] = spark.sparkContext.parallelize(Seq(spark.table("sw")))
.map(row => (row("id"), row("title_value")))
And this is the error:
<console>:36: error: type mismatch;
found : org.apache.spark.sql.Column
required: org.apache.spark.graphx.VertexId
(which expands to) Long
val vRDD: RDD[(VertexId, String)] = spark.sparkContext.parallelize(Seq(spark.table("sw")))
.map(row => (row("id"), row("title_value")))
There error message is correct you are getting columns returned. You can pull those values out of the column with the following:
spark.sparkContext.parallelize(Seq(spark.table("testme")))
.map(row => (row("id").asInstanceOf[Long],row("name").toString))
or maybe:
spark.sparkContext.parallelize(Seq(spark.table("testme")))
.map(row => (row("id").asInstanceOf[VertexId],row("name").asInstanceOf[String]))

Convert data frame to strong typed data set?

I have the following class, the run returns a list of ints from a database table.
class ItemList(sqlContext: org.apache.spark.sql.SQLContext, jdbcSqlConn: String) {
def run(date: LocalDate) = {
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"dbo.GetList('$date')"
)).load()
}
}
The following code
val conf = new SparkConf()
val sc = new SparkContext(conf.setAppName("Test").setMaster("local[*]"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val itemListJob = new ItemList(sqlContext, jdbcSqlConn)
val processed = itemListJob.run(rc, priority).select("id").map(d => {
runJob.run(d) // d expected to be int
})
processed.saveAsTextFile("c:\\temp\\mpa")
get the error of
[error] ...\src\main\scala\main.scala:39: type mismatch;
[error] found : org.apache.spark.sql.Row
[error] required: Int
[error] runJob.run(d)
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
I tried
val processed = itemListJob.run(rc, priority).select("id").as[Int].map(d =>
case class itemListRow(id: Int); ....as[itemListRow].
Both of them got errors of
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
Update:
I'm trying to add the import implicits statements
import sc.implicits._ got error of
value implicits is not a member of org.apache.spark.SparkContext
import sqlContext.implicits._ is OK. However, the later statement of processed.saveAsTextFile("c:\\temp\\mpa") got the error of
value saveAsTextFile is not a member of org.apache.spark.sql.Dataset[(Int, java.time.LocalDate)]
You should simply change the line with select("id") to be as follows:
select("id").as[Int]
You should import the implicits for converting Rows to Ints.
import sqlContext.implicits._ // <-- import implicits that add the "magic"
You could also change run to include the conversion as follows (note the comments to the lines I added):
class ItemList(sqlContext: org.apache.spark.sql.SQLContext, jdbcSqlConn: String) {
def run(date: LocalDate) = {
import sqlContext.implicits._ // <-- import implicits that add the "magic"
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"dbo.GetList('$date')"
)).load()
.select("id") // <-- take only "id" (which Spark pushes down and hence makes your query faster
.as[Int] // <-- convert Row into Int
}
}
value saveAsTextFile is not a member of org.apache.spark.sql.Dataset[(Int, java.time.LocalDate)]
The compilation error is because you try to use saveAsTextFile operation on Dataset that is not available.
Writing in Spark SQL is through DataFrameWriter that's available using write operator:
write: DataFrameWriter[T] Interface for saving the content of the non-streaming Dataset out into external storage.
So you should do the following:
processed.write.text("c:\\temp\\mpa")
Done!

VertexRDD giving me type mismatch error

I am running the following code attempting to create a Graph in GraphX in Apache Spark.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.graphx.GraphLoader
import org.apache.spark.graphx.Graph
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.VertexId
//loads file from the array
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/data/google-plus/2309.graph");
//maps lines and takes the first 21 characters of each line which is the node.
val result = lines.map( line => line.substring(0,20))
//creates a new variable with each node followed by a long .
val result2 = result.map(word => (word,1L).toLong)
//where i am getting an error
val vertexRDD: RDD[(Long,Long)] = sc.parallelize(result2)
i am getting the following error:
error: type mismatch;
found : org.apache.spark.rdd.RDD[(Long, Long)]
required: Seq[?]
Error occurred in an application involving default arguments.
val vertexRDD: RDD[(Long, Long)] = sc.parallelize(result2)
First, your maps can be simplified to the following code:
val vertexRDD: RDD[(Long, Long)] =
lines.map(line => (line.substring(0, 17).toLong, 1L))
Now, to your error: you cannot call sc.parallelize with an RDD. Your vertexRDD is already defined by result2. You can then create your graph with result2 and your EdgesRDD:
val g = Graph(result2, edgesRDD)
or, if using my suggestion:
val g = Graph(vertexRDD, edgesRDD)

Converting String RDD to Int RDD

I am new to scala..I want to know when processing large datasets with scala in spark is it possible to read as int RDD instead of String RDD
I tried the below:
val intArr = sc
.textFile("Downloads/data/train.csv")
.map(line=>line.split(","))
.map(_.toInt)
But I am getting the error:
error: value toInt is not a member of Array[String]
I need to convert to int rdd because down the line i need to do the below
val vectors = intArr.map(p => Vectors.dense(p))
which requires the type to be integer
Any kind of help is truly appreciated..thanks in advance
As far as I understood, one line should create one vector, so it should goes like:
val result = sc
.textFile("Downloads/data/train.csv")
.map(line => line.split(","))
.map(numbers => Vectors.dense(numbers.map(_.toInt)))
numbers.map(_.toInt) will map every element of array to int, so result type will be Array[Int]

Spark-Scala RDD

I have a RDD RDD1 with the following Schema:
RDD[String, Array[String]]
(let's call it RDD1)
and I would like create a new RDD RDD2 with each row as RDD[String,String] with the key and value belonging to RDD1.
For example:
RDD1 =Array(("Fruit",("Orange","Apple","Peach")),("Shape",("Square","Rectangle")),("Mathematician",("Aryabhatt"))))
I want the output to be as:
RDD2 = Array(("Fruit","Orange"),("Fruit","Apple"),("Fruit","Peach"),("Shape","Square"),("Shape","Rectangle"),("Mathematician","Aryabhatt"))
Can someone help me with this piece of code?
My Try:
val R1 = RDD1.map(line => (line._1,line._2.split((","))))
val R2 = R1.map(line => line._2.foreach(ph => ph.map(line._1)))
This gives me an error:
error: value map is not a member of Char
I understand that it is because that map function is only applicable to the RDDs and not each string/char. Please help me with a way to use nested functions for this purpose in Spark.
Break down the problem.
("Fruit",Array("Orange","Apple","Peach") -> Array(("Fruit", "Orange"), ("Fruit", "Apple"), ("Fruit", "Peach"))
def flattenLine(line: (String, Array[String])) = line._2.map(x => (line._1, x)
Apply that function to your rdd:
rdd1.flatMap(flattenLine)