Creating Map values in Spark using Scala

Creating Map values in Spark using Scala - scala

I am new to spark-scala development. I am trying to create a map values in spark using scala but getting type mismatch error.
scala> val nums = sc.parallelize(Map("red" -> "#FF0000","azure" -> "#F0FFFF","peru" -> "#CD853F"))
<console>:21: error: type mismatch;
found : scala.collection.immutable.Map[String,String]
required: Seq[?]
Error occurred in an application involving default arguments.
val nums = sc.parallelize(Map("red" -> "#FF0000","azure" -> "#F0FFFF","peru" -> "#CD853F"))
How should I do this?

SparkContext.parallelize transforms from Seq[T] to RDD[T]. If you want to create RDD[(String, String)] where each element is an individual key-value pair from the original Map use:
import org.apache.spark.rdd.RDD
val m = Map("red" -> "#FF0000","azure" -> "#F0FFFF","peru" -> "#CD853F")
val rdd: RDD[(String, String)] = sc.parallelize(m.toSeq)
If you want RDD[Map[String,String]] (not that it makes any sense with a single element) use:
val rdd: RDD[Map[String,String]] = sc.parallelize(Seq(m))

Related

How to convert org.apache.spark.sql.Column to data types like Long or String

I am new to Scala and Spark. I am trying to load data from Spark SQL to build graphX vertices however I am facing an error that I don't know how to solve. This is the code:
val vRDD: RDD[(VertexId, String)] = spark.sparkContext.parallelize(Seq(spark.table("sw")))
.map(row => (row("id"), row("title_value")))
And this is the error:
<console>:36: error: type mismatch;
found : org.apache.spark.sql.Column
required: org.apache.spark.graphx.VertexId
(which expands to) Long
val vRDD: RDD[(VertexId, String)] = spark.sparkContext.parallelize(Seq(spark.table("sw")))
.map(row => (row("id"), row("title_value")))

There error message is correct you are getting columns returned. You can pull those values out of the column with the following:
spark.sparkContext.parallelize(Seq(spark.table("testme")))
.map(row => (row("id").asInstanceOf[Long],row("name").toString))
or maybe:
spark.sparkContext.parallelize(Seq(spark.table("testme")))
.map(row => (row("id").asInstanceOf[VertexId],row("name").asInstanceOf[String]))

Convert data frame to strong typed data set?

I have the following class, the run returns a list of ints from a database table.
class ItemList(sqlContext: org.apache.spark.sql.SQLContext, jdbcSqlConn: String) {
def run(date: LocalDate) = {
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"dbo.GetList('$date')"
)).load()
}
}
The following code
val conf = new SparkConf()
val sc = new SparkContext(conf.setAppName("Test").setMaster("local[*]"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val itemListJob = new ItemList(sqlContext, jdbcSqlConn)
val processed = itemListJob.run(rc, priority).select("id").map(d => {
runJob.run(d) // d expected to be int
})
processed.saveAsTextFile("c:\\temp\\mpa")
get the error of
[error] ...\src\main\scala\main.scala:39: type mismatch;
[error] found : org.apache.spark.sql.Row
[error] required: Int
[error] runJob.run(d)
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
I tried
val processed = itemListJob.run(rc, priority).select("id").as[Int].map(d =>
case class itemListRow(id: Int); ....as[itemListRow].
Both of them got errors of
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
Update:
I'm trying to add the import implicits statements
import sc.implicits._ got error of
value implicits is not a member of org.apache.spark.SparkContext
import sqlContext.implicits._ is OK. However, the later statement of processed.saveAsTextFile("c:\\temp\\mpa") got the error of
value saveAsTextFile is not a member of org.apache.spark.sql.Dataset[(Int, java.time.LocalDate)]

You should simply change the line with select("id") to be as follows:
select("id").as[Int]
You should import the implicits for converting Rows to Ints.
import sqlContext.implicits._ // <-- import implicits that add the "magic"
You could also change run to include the conversion as follows (note the comments to the lines I added):
class ItemList(sqlContext: org.apache.spark.sql.SQLContext, jdbcSqlConn: String) {
def run(date: LocalDate) = {
import sqlContext.implicits._ // <-- import implicits that add the "magic"
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"dbo.GetList('$date')"
)).load()
.select("id") // <-- take only "id" (which Spark pushes down and hence makes your query faster
.as[Int] // <-- convert Row into Int
}
}
value saveAsTextFile is not a member of org.apache.spark.sql.Dataset[(Int, java.time.LocalDate)]
The compilation error is because you try to use saveAsTextFile operation on Dataset that is not available.
Writing in Spark SQL is through DataFrameWriter that's available using write operator:
write: DataFrameWriter[T] Interface for saving the content of the non-streaming Dataset out into external storage.
So you should do the following:
processed.write.text("c:\\temp\\mpa")
Done!

VertexRDD giving me type mismatch error

I am running the following code attempting to create a Graph in GraphX in Apache Spark.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.graphx.GraphLoader
import org.apache.spark.graphx.Graph
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.VertexId
//loads file from the array
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/data/google-plus/2309.graph");
//maps lines and takes the first 21 characters of each line which is the node.
val result = lines.map( line => line.substring(0,20))
//creates a new variable with each node followed by a long .
val result2 = result.map(word => (word,1L).toLong)
//where i am getting an error
val vertexRDD: RDD[(Long,Long)] = sc.parallelize(result2)
i am getting the following error:
error: type mismatch;
found : org.apache.spark.rdd.RDD[(Long, Long)]
required: Seq[?]
Error occurred in an application involving default arguments.
val vertexRDD: RDD[(Long, Long)] = sc.parallelize(result2)

First, your maps can be simplified to the following code:
val vertexRDD: RDD[(Long, Long)] =
lines.map(line => (line.substring(0, 17).toLong, 1L))
Now, to your error: you cannot call sc.parallelize with an RDD. Your vertexRDD is already defined by result2. You can then create your graph with result2 and your EdgesRDD:
val g = Graph(result2, edgesRDD)
or, if using my suggestion:
val g = Graph(vertexRDD, edgesRDD)

Converting String RDD to Int RDD

I am new to scala..I want to know when processing large datasets with scala in spark is it possible to read as int RDD instead of String RDD
I tried the below:
val intArr = sc
.textFile("Downloads/data/train.csv")
.map(line=>line.split(","))
.map(_.toInt)
But I am getting the error:
error: value toInt is not a member of Array[String]
I need to convert to int rdd because down the line i need to do the below
val vectors = intArr.map(p => Vectors.dense(p))
which requires the type to be integer
Any kind of help is truly appreciated..thanks in advance

As far as I understood, one line should create one vector, so it should goes like:
val result = sc
.textFile("Downloads/data/train.csv")
.map(line => line.split(","))
.map(numbers => Vectors.dense(numbers.map(_.toInt)))
numbers.map(_.toInt) will map every element of array to int, so result type will be Array[Int]

Spark-Scala RDD

I have a RDD RDD1 with the following Schema:
RDD[String, Array[String]]
(let's call it RDD1)
and I would like create a new RDD RDD2 with each row as RDD[String,String] with the key and value belonging to RDD1.
For example:
RDD1 =Array(("Fruit",("Orange","Apple","Peach")),("Shape",("Square","Rectangle")),("Mathematician",("Aryabhatt"))))
I want the output to be as:
RDD2 = Array(("Fruit","Orange"),("Fruit","Apple"),("Fruit","Peach"),("Shape","Square"),("Shape","Rectangle"),("Mathematician","Aryabhatt"))
Can someone help me with this piece of code?
My Try:
val R1 = RDD1.map(line => (line._1,line._2.split((","))))
val R2 = R1.map(line => line._2.foreach(ph => ph.map(line._1)))
This gives me an error:
error: value map is not a member of Char
I understand that it is because that map function is only applicable to the RDDs and not each string/char. Please help me with a way to use nested functions for this purpose in Spark.

Break down the problem.
("Fruit",Array("Orange","Apple","Peach") -> Array(("Fruit", "Orange"), ("Fruit", "Apple"), ("Fruit", "Peach"))
def flattenLine(line: (String, Array[String])) = line._2.map(x => (line._1, x)
Apply that function to your rdd:
rdd1.flatMap(flattenLine)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Creating Map values in Spark using Scala - scala

Related

How to convert org.apache.spark.sql.Column to data types like Long or String

Convert data frame to strong typed data set?

VertexRDD giving me type mismatch error

Converting String RDD to Int RDD

Spark-Scala RDD

Categories

Resources