use SQL in DStream.transform() over Spark Streaming? - scala

There are some examples for use SQL over Spark Streaming in foreachRDD(). But if I want to use SQL in tranform():
case class AlertMsg(host:String, count:Int, sum:Double)
val lines = ssc.socketTextStream("localhost", 8888)
lines.transform( rdd => {
if (rdd.count > 0) {
val t = sqc.jsonRDD(rdd)
t.registerTempTable("logstash")
val sqlreport = sqc.sql("SELECT host, COUNT(host) AS host_c, AVG(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND lineno > 70 GROUP BY host ORDER BY host_c DESC LIMIT 100")
sqlreport.map(r => AlertMsg(r(0).toString,r(1).toString.toInt,r(2).toString.toDouble))
} else {
rdd
}
}).print()
I got such error:
[error] /Users/raochenlin/Downloads/spark-1.2.0-bin-hadoop2.4/logstash/src/main/scala/LogStash.scala:52: no type parameters for method transform: (transformFunc: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[U])(implicit evidence$5: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U] exist so that it can be applied to arguments (org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable])
[error] --- because ---
[error] argument expression's type is not compatible with formal parameter type;
[error] found : org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable]
[error] required: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[?U]
[error] lines.transform( rdd => {
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
Seems only if I use sqlreport.map(r => r.toString) can be a correct usage?

dstream.transform take a function transformFunc: (RDD[T]) ⇒ RDD[U]
In this case, the if must result in the same type on both evaluations of the condition, which is not the case:
if (count == 0) => RDD[String]
if (count > 0) => RDD[AlertMsg]
In this case, remove the optimization of if rdd.count ... sothat you have an unique transformation path.

Related

Scala type mismatch how to convert Iterable[Double] to Double

I am trying to modify this example as follows:
#BigQueryType.fromQuery("SELECT weight_pounds FROM [bigquery-public-data:samples.natality]")
class Row
#BigQueryType.toTable
case class Result(weight_pounds: Double)
sc.typedBigQuery[Row]()
.map(r => r.weight_pounds.getOrElse(0.0)) // return 0 if weight_pounds is None
.top(100)
.map(x => Result(x))
// Convert elements from Result to TableRow and save output to BigQuery.
.saveAsTypedBigQueryTable(
Table.Spec(args("output")),
writeDisposition = WRITE_TRUNCATE,
createDisposition = CREATE_IF_NEEDED
)
Above gives me following error:
[error] /Users/me/scio/scio-examples/src/main/scala/com/spotify/scio/examples/foo.scala:66:24: type mismatch;
[error] found : Iterable[Double]
[error] required: Double
[error] .map(x => Result(x))
[error] ^
How can I fix this?
The trick is to add a flatMap(x => x) right after top(100)

scala spark type mismatching

I need to group my rdd by two columns and aggregate the count. I have a function:
def constructDiagnosticFeatureTuple(diagnostic: RDD[Diagnostic])
: RDD[FeatureTuple] = {
val grouped_patients = diagnostic
.groupBy(x => (x.patientID, x.code))
.map(_._2)
.map{ events =>
val p_id = events.map(_.patientID).take(1).mkString
val f_code = events.map(_.code).take(1).mkString
val count = events.size.toDouble
((p_id, f_code), count)
}
//should be in form:
//diagnostic.sparkContext.parallelize(List((("patient", "diagnostics"), 1.0)))
}
At compile time, I am getting an error:
/FeatureConstruction.scala:38:3: type mismatch;
[error] found : Unit
[error] required: org.apache.spark.rdd.RDD[edu.gatech.cse6250.features.FeatureConstruction.FeatureTuple]
[error] (which expands to) org.apache.spark.rdd.RDD[((String, String), Double)]
[error] }
[error] ^
How can I fix it?
I red this post: Scala Spark type missmatch found Unit, required rdd.RDD , but I do not use collect(), so, it does not help me.

Spark scala reading text file with map and filter

I've a text file with following format (id,f1,f2,f3,...,fn):
12345,0,0,1,2,...,3
23456,0,0,1,2,...,0
33333,0,1,1,0,...,0
56789,1,0,0,0,...,4
a_123,0,0,0,6,...,3
And I want to read the file (ignore the line like a_123,0,0,0,6,...,3) to create a RDD[(Long, Vector). Here's my solution:
def readDataset(path: String, sparkSession: SparkSession): RDD[(ItemId, Vector)] = {
val sc = sparkSession.sparkContext
sc.textFile(path)
.map({ line => val values=line.split(",")
(
values(0).toLong,
//util.Try(values(0).toLong).getOrElse(0L),
Vectors.dense(values.slice(1, values.length).map {x => x.toDouble }).toSparse
)})
.filter(x => x._1 > 0)
}
However this code can not be compiled:
[ERROR] found : org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.SparseVector)]
[ERROR] required: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] (which expands to) org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] Note: (Long, org.apache.spark.ml.linalg.SparseVector) <: (Long, org.apache.spark.ml.linalg.Vector), but class RDD is invariant in type T.
[ERROR] You may wish to define T as +T instead. (SLS 4.5)
[ERROR] .filter(x => x._1 > 0)
[ERROR] ^
[ERROR] one error found
But if I remove the . toSparse or .filter(x => x._1 > 0) this code can be compiled successfully.
Does someone know why and what should I do to fix it?
Also is there any better way to read the file to RDD with ignoring the non-numeric id lines?
The code compiles successfully if you remove toSparse because the type of your PairRDD is (ItemId, Vector).
The org.apache.spark.ml.linalg.Vector class/type represent the Dense Vector which you are generating using Vector.dense and when you call toSparse it gets converted to org.apache.spark.ml.linalg.SparseVector which is not the type that your PairRDD expects.
As for filtering non-integer IDs I would say your method is a good way to do that.

filter with instanceOf Tuple

I am trying to find the co-occurrence of words. Following is the code I am using.
val dataset = df.select("entity").rdd.map(row => row.getList(0)).filter(r => r.size() > 0).distinct()
println("dataset")
dataset.take(10).foreach(println)
Example Dataset
dataset
[aa]
[bb]
[cc]
[dd]
[ee]
[ab, ac, ad]
[ff]
[ef, fg]
[ab, gg, hh]
Code Snippet
case class tupleIn(a: String,b: String)
case class tupleOut(i: tupleIn, c: Long)
val cooccurMapping = dataset.flatMap(
list => {
list.toArray().map(e => e.asInstanceOf[String].toLowerCase).flatMap(
ele1 => {
list.toArray().map(e => e.asInstanceOf[String].toLowerCase).map(ele2 => {
if (ele1 != ele2) {
((ele1, ele2), 1L)
}
})
})
})
How to filter from this?
I have tried
.filter(e => e.isInstanceOf[Tuple2[(String, String), Long]])
:121: warning: fruitless type test: a value of type Unit cannot also be a ((String, String), Long)
.filter(e => e.isInstanceOf[Tuple2[(String, String), Long]])
^
:121: error: isInstanceOf cannot test if value types are references.
.filter(e => e.isInstanceOf[Tuple2[(String, String), Long]])
.filter(e => e.isInstanceOf[tupleOut])
:122: warning: fruitless type test: a value of type Unit
cannot also be a coocrTupleOut
.filter(e => e.isInstanceOf[tupleOut])
^ :122: error: isInstanceOf cannot test if value types are references.
.filter(e => e.isInstanceOf[tupleOut])
If I map
.map(e => e.asInstanceOf[Tuple2[(String, String), Long]])
The above snippet works fine but gives this exception after sometime:
java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast
to scala.Tuple2 at
$line84834447093.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2$$anonfun$9.apply(:123)
at
$line84834447093.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2$$anonfun$9.apply(:123)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Why is instanceOf not working in filter() but working in map()
The result of your code is collection of items of type Unit so both the filter and map iterate nothing (note that map does as so it will cast to the type you want wheres as is checks the type)
In any event, if I understand your intent correctly you can get what you want with spark's built in functions:
val l=List(List("aa"),List("bb","vv"),List("bbb"))
val rdd=sc.parallelize(l)
val df=spark.createDataFrame(rdd,"data")
import org.apache.spark.sql.functions._
val ndf=df.withColumn("data",explode($"data"))
val cm=ndf.select($"data".as("elec1")).crossJoin(ndf.select($"data".as("elec2"))).withColumn("cnt",lit(1L))
val coocurenceMap=cm.filter($"elec1" !== $"elec2")

Slick3.2 Error: No matching Shape found

I'm not sure what is wrong here.
The following code block is throwing error:
(for {
(e,r) <- tblDetail.joinLeft(tblMaster).on((e,r) => r.col1 === e.col3)
} yield (e.id)
Error
No matching Shape found.
[error] Slick does not know how to map the given types.
[error] Possible causes: T in Table[T] does not match your * projection,
[error] you use an unsupported type in a Query (e.g. scala List),
[error] or you forgot to import a driver api into scope.
[error] Required level: slick.lifted.FlatShapeLevel
[error] Source type: (slick.lifted.Rep[Int], slick.lifted.Rep[String],...)
[error] Unpacked type: T
[error] Packed type: G
[error] (e,r) <- tblDetail.joinLeft(tblMaster).on((e,r) => r.col1 === e.col3)
I checked the Slick Tables for tblDetail and tblMaster they seemed to be fine.
tblMaster
class TblMaster(tag:Tag)
extends Table[(Int,String,...)](tag, "tbl_master") {
def id = column[Int]("id")
def col3 = column[String]("col3")
def * = (id,col3)
}
tblDetail
class TblDetail(tag:Tag)
extends Table[Entity](tag, "tbl_detail") {
def id = column[Int]("id")
def col1 = column[String]("col1")
def * : ProvenShape[Entity] = (id,col1) <>
((Entity.apply _).tupled, Entity.unapply)
}
Any help would be appreciable.