scala spark type mismatching - scala

I need to group my rdd by two columns and aggregate the count. I have a function:
def constructDiagnosticFeatureTuple(diagnostic: RDD[Diagnostic])
: RDD[FeatureTuple] = {
val grouped_patients = diagnostic
.groupBy(x => (x.patientID, x.code))
.map(_._2)
.map{ events =>
val p_id = events.map(_.patientID).take(1).mkString
val f_code = events.map(_.code).take(1).mkString
val count = events.size.toDouble
((p_id, f_code), count)
}
//should be in form:
//diagnostic.sparkContext.parallelize(List((("patient", "diagnostics"), 1.0)))
}
At compile time, I am getting an error:
/FeatureConstruction.scala:38:3: type mismatch;
[error] found : Unit
[error] required: org.apache.spark.rdd.RDD[edu.gatech.cse6250.features.FeatureConstruction.FeatureTuple]
[error] (which expands to) org.apache.spark.rdd.RDD[((String, String), Double)]
[error] }
[error] ^
How can I fix it?
I red this post: Scala Spark type missmatch found Unit, required rdd.RDD , but I do not use collect(), so, it does not help me.

Related

Scala type mismatch how to convert Iterable[Double] to Double

I am trying to modify this example as follows:
#BigQueryType.fromQuery("SELECT weight_pounds FROM [bigquery-public-data:samples.natality]")
class Row
#BigQueryType.toTable
case class Result(weight_pounds: Double)
sc.typedBigQuery[Row]()
.map(r => r.weight_pounds.getOrElse(0.0)) // return 0 if weight_pounds is None
.top(100)
.map(x => Result(x))
// Convert elements from Result to TableRow and save output to BigQuery.
.saveAsTypedBigQueryTable(
Table.Spec(args("output")),
writeDisposition = WRITE_TRUNCATE,
createDisposition = CREATE_IF_NEEDED
)
Above gives me following error:
[error] /Users/me/scio/scio-examples/src/main/scala/com/spotify/scio/examples/foo.scala:66:24: type mismatch;
[error] found : Iterable[Double]
[error] required: Double
[error] .map(x => Result(x))
[error] ^
How can I fix this?
The trick is to add a flatMap(x => x) right after top(100)

Spark scala reading text file with map and filter

I've a text file with following format (id,f1,f2,f3,...,fn):
12345,0,0,1,2,...,3
23456,0,0,1,2,...,0
33333,0,1,1,0,...,0
56789,1,0,0,0,...,4
a_123,0,0,0,6,...,3
And I want to read the file (ignore the line like a_123,0,0,0,6,...,3) to create a RDD[(Long, Vector). Here's my solution:
def readDataset(path: String, sparkSession: SparkSession): RDD[(ItemId, Vector)] = {
val sc = sparkSession.sparkContext
sc.textFile(path)
.map({ line => val values=line.split(",")
(
values(0).toLong,
//util.Try(values(0).toLong).getOrElse(0L),
Vectors.dense(values.slice(1, values.length).map {x => x.toDouble }).toSparse
)})
.filter(x => x._1 > 0)
}
However this code can not be compiled:
[ERROR] found : org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.SparseVector)]
[ERROR] required: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] (which expands to) org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] Note: (Long, org.apache.spark.ml.linalg.SparseVector) <: (Long, org.apache.spark.ml.linalg.Vector), but class RDD is invariant in type T.
[ERROR] You may wish to define T as +T instead. (SLS 4.5)
[ERROR] .filter(x => x._1 > 0)
[ERROR] ^
[ERROR] one error found
But if I remove the . toSparse or .filter(x => x._1 > 0) this code can be compiled successfully.
Does someone know why and what should I do to fix it?
Also is there any better way to read the file to RDD with ignoring the non-numeric id lines?
The code compiles successfully if you remove toSparse because the type of your PairRDD is (ItemId, Vector).
The org.apache.spark.ml.linalg.Vector class/type represent the Dense Vector which you are generating using Vector.dense and when you call toSparse it gets converted to org.apache.spark.ml.linalg.SparseVector which is not the type that your PairRDD expects.
As for filtering non-integer IDs I would say your method is a good way to do that.

Slick3.2 Error: No matching Shape found

I'm not sure what is wrong here.
The following code block is throwing error:
(for {
(e,r) <- tblDetail.joinLeft(tblMaster).on((e,r) => r.col1 === e.col3)
} yield (e.id)
Error
No matching Shape found.
[error] Slick does not know how to map the given types.
[error] Possible causes: T in Table[T] does not match your * projection,
[error] you use an unsupported type in a Query (e.g. scala List),
[error] or you forgot to import a driver api into scope.
[error] Required level: slick.lifted.FlatShapeLevel
[error] Source type: (slick.lifted.Rep[Int], slick.lifted.Rep[String],...)
[error] Unpacked type: T
[error] Packed type: G
[error] (e,r) <- tblDetail.joinLeft(tblMaster).on((e,r) => r.col1 === e.col3)
I checked the Slick Tables for tblDetail and tblMaster they seemed to be fine.
tblMaster
class TblMaster(tag:Tag)
extends Table[(Int,String,...)](tag, "tbl_master") {
def id = column[Int]("id")
def col3 = column[String]("col3")
def * = (id,col3)
}
tblDetail
class TblDetail(tag:Tag)
extends Table[Entity](tag, "tbl_detail") {
def id = column[Int]("id")
def col1 = column[String]("col1")
def * : ProvenShape[Entity] = (id,col1) <>
((Entity.apply _).tupled, Entity.unapply)
}
Any help would be appreciable.

use SQL in DStream.transform() over Spark Streaming?

There are some examples for use SQL over Spark Streaming in foreachRDD(). But if I want to use SQL in tranform():
case class AlertMsg(host:String, count:Int, sum:Double)
val lines = ssc.socketTextStream("localhost", 8888)
lines.transform( rdd => {
if (rdd.count > 0) {
val t = sqc.jsonRDD(rdd)
t.registerTempTable("logstash")
val sqlreport = sqc.sql("SELECT host, COUNT(host) AS host_c, AVG(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND lineno > 70 GROUP BY host ORDER BY host_c DESC LIMIT 100")
sqlreport.map(r => AlertMsg(r(0).toString,r(1).toString.toInt,r(2).toString.toDouble))
} else {
rdd
}
}).print()
I got such error:
[error] /Users/raochenlin/Downloads/spark-1.2.0-bin-hadoop2.4/logstash/src/main/scala/LogStash.scala:52: no type parameters for method transform: (transformFunc: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[U])(implicit evidence$5: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U] exist so that it can be applied to arguments (org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable])
[error] --- because ---
[error] argument expression's type is not compatible with formal parameter type;
[error] found : org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable]
[error] required: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[?U]
[error] lines.transform( rdd => {
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
Seems only if I use sqlreport.map(r => r.toString) can be a correct usage?
dstream.transform take a function transformFunc: (RDD[T]) ⇒ RDD[U]
In this case, the if must result in the same type on both evaluations of the condition, which is not the case:
if (count == 0) => RDD[String]
if (count > 0) => RDD[AlertMsg]
In this case, remove the optimization of if rdd.count ... sothat you have an unique transformation path.

Action.async and using for-comprehention in a seperate function

This might be even more Scala than Play related, but I have this issue within some Play code.
I have
def index = Action.async { implicit request =>
val content1Future = Blogs.getAllBlogs(request)
val sidebar1Future = Blogs.getAllBlogsOverview(request)
for {
sidebar1 <- sidebar1Future
content1 <- content1Future
sidebar1Body <- Pagelet.readBody(sidebar1)
content1Body <- Pagelet.readBody(content1)
} yield {
val sidebarcontents = List(sidebar1Body)
val contents = List(content1Body)
Ok(views.html.main("Index", contents, sidebarcontents)).withHeaders(("Cache-Control", "no-cache"))
}
and it works fine.
Now, I try to extract the for-comprehention into a separate function dealing with 'sidebar' and I want to pass in the 'content'.
Like
def index = Action.async { implicit request =>
standardMain(Blogs.getAllBlogs(request))
}
def standardMain(content1Future: => Future[Result])(implicit request: Request[_]) : Future[Result] = {
val sidebar1Future = Blogs.getAllBlogsOverview(request)
for {
sidebar1 <- sidebar1Future
content1 <- content1Future
sidebar1Body <- Pagelet.readBody(sidebar1)
content1Body <- Pagelet.readBody(content1)
} yield {
val sidebarcontents = List(sidebar1Body)
val contents = List(content1Body)
Ok(views.html.main("Index", contents, sidebarcontents)).withHeaders(("Cache-Control", "no-cache"))
}
}
But then I get
[info] Compiling 1 Scala source to D:\Dropbox\Playground\PlayWorld\play-with-forms\target\scala-2.11\classes...
[error] D:\Dropbox\Playground\PlayWorld\play-with-forms\app\controllers\Application.scala:91: type mismatch;
[error] found : scala.concurrent.Future[play.api.mvc.Result]
[error] required: play.api.libs.iteratee.Iteratee[Array[Byte],?]
[error] content1 <- content1Future
[error] ^
[error] D:\Dropbox\Playground\PlayWorld\play-with-forms\app\controllers\Application.scala:90: type mismatch;
[error] found : play.api.libs.iteratee.Iteratee[Array[Byte],Nothing]
[error] required: scala.concurrent.Future[play.api.mvc.Result]
[error] sidebar1 <- sidebar1Future
[error] ^
[error] two errors found
[error] (compile:compile) Compilation failed
[error] application -
I tried many different calls, but somehow I do not know how to get an play.api.libs.iteratee.Iteratee[Array[Byte],?].
How can I achive this. Thanks.