scala fastparse typechecking - scala

I am puzzled by why the following code using scala fastparse 0.4.3 fails typechecking.
val White = WhitespaceApi.Wrapper{
import fastparse.all._
NoTrace(CharIn(" \t\n").rep)
}
import fastparse.noApi._
import White._
case class Term(tokens: Seq[String])
case class Terms(terms: Seq[Term])
val token = P[String] ( CharIn('a' to 'z', 'A' to 'Z', '0' to '9').rep(min=1).!)
val term: P[Term] = P("[" ~ token.!.rep(sep=" ", min=1) ~ "]").map(x => Term(x))
val terms = P("(" ~ term.!.rep(sep=" ", min=1) ~ ")").map{x => Terms(x)}
val parse = terms.parse("([ab bd ef] [xy wa dd] [jk mn op])")
The error messages:
[error] .../MyParser.scala: type mismatch;
[error] found : Seq[String]
[error] required: Seq[Term]
[error] val terms = P("(" ~ term.!.rep(sep=" ", min=1) ~")").map{x => Terms(x)}
[error] ^
I would imagine that since term is of type Term and since the terms pattern uses term.!.rep(..., it should get a Seq[Term].

I figured it out. My mistake was capturing (with !) redundantly in terms. That line should instead be written:
val terms = P("(" ~ term.rep(sep=" ", min=1) ~ ")").map{x => Terms(x)}
Notice that term.!.rep( has been rewritten to term.rep(.
Apparently capturing in any rule will return the text that the captured subrule matches overriding what the subrule actually returns. I guess this is a feature when used correctly. :)

Related

scala spark type mismatching

I need to group my rdd by two columns and aggregate the count. I have a function:
def constructDiagnosticFeatureTuple(diagnostic: RDD[Diagnostic])
: RDD[FeatureTuple] = {
val grouped_patients = diagnostic
.groupBy(x => (x.patientID, x.code))
.map(_._2)
.map{ events =>
val p_id = events.map(_.patientID).take(1).mkString
val f_code = events.map(_.code).take(1).mkString
val count = events.size.toDouble
((p_id, f_code), count)
}
//should be in form:
//diagnostic.sparkContext.parallelize(List((("patient", "diagnostics"), 1.0)))
}
At compile time, I am getting an error:
/FeatureConstruction.scala:38:3: type mismatch;
[error] found : Unit
[error] required: org.apache.spark.rdd.RDD[edu.gatech.cse6250.features.FeatureConstruction.FeatureTuple]
[error] (which expands to) org.apache.spark.rdd.RDD[((String, String), Double)]
[error] }
[error] ^
How can I fix it?
I red this post: Scala Spark type missmatch found Unit, required rdd.RDD , but I do not use collect(), so, it does not help me.

Spark scala reading text file with map and filter

I've a text file with following format (id,f1,f2,f3,...,fn):
12345,0,0,1,2,...,3
23456,0,0,1,2,...,0
33333,0,1,1,0,...,0
56789,1,0,0,0,...,4
a_123,0,0,0,6,...,3
And I want to read the file (ignore the line like a_123,0,0,0,6,...,3) to create a RDD[(Long, Vector). Here's my solution:
def readDataset(path: String, sparkSession: SparkSession): RDD[(ItemId, Vector)] = {
val sc = sparkSession.sparkContext
sc.textFile(path)
.map({ line => val values=line.split(",")
(
values(0).toLong,
//util.Try(values(0).toLong).getOrElse(0L),
Vectors.dense(values.slice(1, values.length).map {x => x.toDouble }).toSparse
)})
.filter(x => x._1 > 0)
}
However this code can not be compiled:
[ERROR] found : org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.SparseVector)]
[ERROR] required: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] (which expands to) org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] Note: (Long, org.apache.spark.ml.linalg.SparseVector) <: (Long, org.apache.spark.ml.linalg.Vector), but class RDD is invariant in type T.
[ERROR] You may wish to define T as +T instead. (SLS 4.5)
[ERROR] .filter(x => x._1 > 0)
[ERROR] ^
[ERROR] one error found
But if I remove the . toSparse or .filter(x => x._1 > 0) this code can be compiled successfully.
Does someone know why and what should I do to fix it?
Also is there any better way to read the file to RDD with ignoring the non-numeric id lines?
The code compiles successfully if you remove toSparse because the type of your PairRDD is (ItemId, Vector).
The org.apache.spark.ml.linalg.Vector class/type represent the Dense Vector which you are generating using Vector.dense and when you call toSparse it gets converted to org.apache.spark.ml.linalg.SparseVector which is not the type that your PairRDD expects.
As for filtering non-integer IDs I would say your method is a good way to do that.

Scala FastParse Library Error

I am trying to learn the scala fast parse library. Towards this I have written the following code
import fastparse.noApi._
import fastparse.WhitespaceApi
object FastParsePOC {
val White = WhitespaceApi.Wrapper{
import fastparse.all._
NoTrace(" ".rep)
}
def print(input : Parsed[String]): Unit = {
input match {
case Parsed.Success(value, index) => println(s"Success: $value $index")
case f # Parsed.Failure(error, line, col) => println(s"Error: $error $line $col ${f.extra.traced.trace}")
}
}
def main(args: Array[String]) : Unit = {
import White._
val parser = P("Foo" ~ "(" ~ AnyChar.rep(1).! ~ ")")
val input1 = "Foo(Bar(10), Baz(20))"
print(parser.parse(input1))
}
}
But I get error
Error: ")" 21 Extra(Foo(Bar(10), Baz(20)), [traced - not evaluated]) parser:1:1 / (AnyChar | ")"):1:21 ...""
My expected output was "Bar(10), Baz(20)". it seems the parser above does not like the ending ")".
AnyChar.rep(1) also includes ) symbol at the end of the input string, as a result the end ) at ~ ")") isn't reached.
If ) symbol weren't used in Bar and Baz, then this could be solved by excluding ) from AnyChar like this:
val parser = P("Foo" ~ "(" ~ (!")" ~ AnyChar).rep(1).! ~ ")")
val input1 = "Foo(Bar(10*, Baz(20*)"
To make Bar and Baz work with ) symbol you could define separate parsers for each of them (also excluding ) symbol from AnyChar. The following solution is a bit more flexible as it allows more occurrences of Bar and Baz but I hope that you get the idea.
val bar = P("Bar" ~ "(" ~ (!")" ~ AnyChar).rep(1) ~ ")")
val baz = P("Baz" ~ "(" ~ (!")" ~ AnyChar).rep(1) ~ ")")
val parser = P("Foo" ~ "(" ~ (bar | baz).rep(sep = ",").! ~ ")")
val input1 = "Foo(Bar(10), Baz(20))"
print(parser.parse(input1))
Result:
Success: Bar(10), Baz(20) 21

Unexpected Type Mismatch While Using Scala breeze.optimize.linear.LinearProgram

Right now I am playing around with the Linear Program class in Scala Breeze and I've gotten up to the point where I am going to optimize my linear programming problem using the following code.
import breeze.stats.distributions
import breeze.stats._
import breeze.linalg._
val lp = new breeze.optimize.linear.LinearProgram()
val unif_dist = breeze.stats.distributions.Uniform(-1,1)
val U = DenseMatrix.rand(1, 3, unif_dist).toArray
val V = DenseMatrix.rand(2, 3, unif_dist).toArray.grouped(3).toArray
val B = Array.fill(3)(lp.Binary())
val Objective = V.map(vi => U.zip(vi).map(uv => uv._1 * uv._2)).map(uvi => B.zip(uvi).map(buv => buv._1 * buv._2)).map(x => x.reduce(_ + _)).reduce(_ + _)
val lpp = ( Objective subjectTo() )
lp.maximize(lpp)
I receive the following error
scala> lp.minimize(lpp)
<console>:45: error: type mismatch;
found : lp.Problem
required: lp.Problem
lp.minimize(lpp)
^
Has anyone here run into this before, and if so, did you come up with a way to fix it? Additionally, I am open to suggestions on a cleaner way to write the line where I asssign Objective.

use SQL in DStream.transform() over Spark Streaming?

There are some examples for use SQL over Spark Streaming in foreachRDD(). But if I want to use SQL in tranform():
case class AlertMsg(host:String, count:Int, sum:Double)
val lines = ssc.socketTextStream("localhost", 8888)
lines.transform( rdd => {
if (rdd.count > 0) {
val t = sqc.jsonRDD(rdd)
t.registerTempTable("logstash")
val sqlreport = sqc.sql("SELECT host, COUNT(host) AS host_c, AVG(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND lineno > 70 GROUP BY host ORDER BY host_c DESC LIMIT 100")
sqlreport.map(r => AlertMsg(r(0).toString,r(1).toString.toInt,r(2).toString.toDouble))
} else {
rdd
}
}).print()
I got such error:
[error] /Users/raochenlin/Downloads/spark-1.2.0-bin-hadoop2.4/logstash/src/main/scala/LogStash.scala:52: no type parameters for method transform: (transformFunc: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[U])(implicit evidence$5: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U] exist so that it can be applied to arguments (org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable])
[error] --- because ---
[error] argument expression's type is not compatible with formal parameter type;
[error] found : org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable]
[error] required: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[?U]
[error] lines.transform( rdd => {
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
Seems only if I use sqlreport.map(r => r.toString) can be a correct usage?
dstream.transform take a function transformFunc: (RDD[T]) ⇒ RDD[U]
In this case, the if must result in the same type on both evaluations of the condition, which is not the case:
if (count == 0) => RDD[String]
if (count > 0) => RDD[AlertMsg]
In this case, remove the optimization of if rdd.count ... sothat you have an unique transformation path.