I am trying to find the co-occurrence of words. Following is the code I am using.
val dataset = df.select("entity").rdd.map(row => row.getList(0)).filter(r => r.size() > 0).distinct()
println("dataset")
dataset.take(10).foreach(println)
Example Dataset
dataset
[aa]
[bb]
[cc]
[dd]
[ee]
[ab, ac, ad]
[ff]
[ef, fg]
[ab, gg, hh]
Code Snippet
case class tupleIn(a: String,b: String)
case class tupleOut(i: tupleIn, c: Long)
val cooccurMapping = dataset.flatMap(
list => {
list.toArray().map(e => e.asInstanceOf[String].toLowerCase).flatMap(
ele1 => {
list.toArray().map(e => e.asInstanceOf[String].toLowerCase).map(ele2 => {
if (ele1 != ele2) {
((ele1, ele2), 1L)
}
})
})
})
How to filter from this?
I have tried
.filter(e => e.isInstanceOf[Tuple2[(String, String), Long]])
:121: warning: fruitless type test: a value of type Unit cannot also be a ((String, String), Long)
.filter(e => e.isInstanceOf[Tuple2[(String, String), Long]])
^
:121: error: isInstanceOf cannot test if value types are references.
.filter(e => e.isInstanceOf[Tuple2[(String, String), Long]])
.filter(e => e.isInstanceOf[tupleOut])
:122: warning: fruitless type test: a value of type Unit
cannot also be a coocrTupleOut
.filter(e => e.isInstanceOf[tupleOut])
^ :122: error: isInstanceOf cannot test if value types are references.
.filter(e => e.isInstanceOf[tupleOut])
If I map
.map(e => e.asInstanceOf[Tuple2[(String, String), Long]])
The above snippet works fine but gives this exception after sometime:
java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast
to scala.Tuple2 at
$line84834447093.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2$$anonfun$9.apply(:123)
at
$line84834447093.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2$$anonfun$9.apply(:123)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Why is instanceOf not working in filter() but working in map()
The result of your code is collection of items of type Unit so both the filter and map iterate nothing (note that map does as so it will cast to the type you want wheres as is checks the type)
In any event, if I understand your intent correctly you can get what you want with spark's built in functions:
val l=List(List("aa"),List("bb","vv"),List("bbb"))
val rdd=sc.parallelize(l)
val df=spark.createDataFrame(rdd,"data")
import org.apache.spark.sql.functions._
val ndf=df.withColumn("data",explode($"data"))
val cm=ndf.select($"data".as("elec1")).crossJoin(ndf.select($"data".as("elec2"))).withColumn("cnt",lit(1L))
val coocurenceMap=cm.filter($"elec1" !== $"elec2")
Related
I need to group my rdd by two columns and aggregate the count. I have a function:
def constructDiagnosticFeatureTuple(diagnostic: RDD[Diagnostic])
: RDD[FeatureTuple] = {
val grouped_patients = diagnostic
.groupBy(x => (x.patientID, x.code))
.map(_._2)
.map{ events =>
val p_id = events.map(_.patientID).take(1).mkString
val f_code = events.map(_.code).take(1).mkString
val count = events.size.toDouble
((p_id, f_code), count)
}
//should be in form:
//diagnostic.sparkContext.parallelize(List((("patient", "diagnostics"), 1.0)))
}
At compile time, I am getting an error:
/FeatureConstruction.scala:38:3: type mismatch;
[error] found : Unit
[error] required: org.apache.spark.rdd.RDD[edu.gatech.cse6250.features.FeatureConstruction.FeatureTuple]
[error] (which expands to) org.apache.spark.rdd.RDD[((String, String), Double)]
[error] }
[error] ^
How can I fix it?
I red this post: Scala Spark type missmatch found Unit, required rdd.RDD , but I do not use collect(), so, it does not help me.
I've a text file with following format (id,f1,f2,f3,...,fn):
12345,0,0,1,2,...,3
23456,0,0,1,2,...,0
33333,0,1,1,0,...,0
56789,1,0,0,0,...,4
a_123,0,0,0,6,...,3
And I want to read the file (ignore the line like a_123,0,0,0,6,...,3) to create a RDD[(Long, Vector). Here's my solution:
def readDataset(path: String, sparkSession: SparkSession): RDD[(ItemId, Vector)] = {
val sc = sparkSession.sparkContext
sc.textFile(path)
.map({ line => val values=line.split(",")
(
values(0).toLong,
//util.Try(values(0).toLong).getOrElse(0L),
Vectors.dense(values.slice(1, values.length).map {x => x.toDouble }).toSparse
)})
.filter(x => x._1 > 0)
}
However this code can not be compiled:
[ERROR] found : org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.SparseVector)]
[ERROR] required: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] (which expands to) org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] Note: (Long, org.apache.spark.ml.linalg.SparseVector) <: (Long, org.apache.spark.ml.linalg.Vector), but class RDD is invariant in type T.
[ERROR] You may wish to define T as +T instead. (SLS 4.5)
[ERROR] .filter(x => x._1 > 0)
[ERROR] ^
[ERROR] one error found
But if I remove the . toSparse or .filter(x => x._1 > 0) this code can be compiled successfully.
Does someone know why and what should I do to fix it?
Also is there any better way to read the file to RDD with ignoring the non-numeric id lines?
The code compiles successfully if you remove toSparse because the type of your PairRDD is (ItemId, Vector).
The org.apache.spark.ml.linalg.Vector class/type represent the Dense Vector which you are generating using Vector.dense and when you call toSparse it gets converted to org.apache.spark.ml.linalg.SparseVector which is not the type that your PairRDD expects.
As for filtering non-integer IDs I would say your method is a good way to do that.
I am trying to build an Edge RDD for GraphX. I am reading a csv file and converting to DataFrame Then trying to convert to an Edge RDD:
val staticDataFrame = spark.
read.
option("header", true).
option("inferSchema", true).
csv("/projects/pdw/aiw_test/aiw/haris/Customers_DDSW-withDN$.csv")
val edgeRDD: RDD[Edge[(VertexId, VertexId, String)]] =
staticDataFrame.select(
"dealer_customer_number",
"parent_dealer_cust_number",
"dealer_code"
).map{ (row: Array) =>
Edge((
row.getAs[Long]("dealer_customer_number"),
row.getAs[Long]("parent_dealer_cust_number"),
row("dealer_code")
))
}
But I am getting this error:
<console>:81: error: class Array takes type parameters
val edgeRDD: RDD[Edge[(VertexId, VertexId, String)]] = staticDataFrame.select("dealer_customer_number", "parent_dealer_cust_number", "dealer_code").map((row: Array) => Edge((row.getAs[Long]("dealer_customer_number"), row.getAs[Long]("parent_dealer_cust_number"), row("dealer_code"))))
^
The result for
staticDataFrame.select("dealer_customer_number", "parent_dealer_cust_number", "dealer_code").take(1)
is
res3: Array[org.apache.spark.sql.Row] = Array([0000101,null,B110])
First, Array takes type parameters, so you would have to write Array[Something]. But this is probably not what you want anyway.
The dataframe is a Dataset[Row], not a Dataset[Array[_]], therefore you have to change
.map{ (row: Array) =>
to
.map{ (row: Row) =>
Or just omit the typing completely (it should be inferred):
.map{ row =>
My spark job raises a null pointer exception that I cannot trace down. When I print potential null variables, they're all populated on every worker. My data does not contain null values as the same job works within the spark shell. The execute function of the job is below, followed by the error message.
All helper methods not defined in the function are defined within the body of the spark job object, so I believe closure is not the problem.
override def execute(sc:SparkContext) = {
def construct_query(targetTypes:List[String]) = Map("query" ->
Map("nested" ->
Map("path"->"annotations.entities.items",
"query"-> Map("terms"->
Map("annotations.entities.items.type"-> targetTypes)))))
val sourceConfig = HashMap(
"es.nodes" -> params.targetClientHost
)
// Base elastic search RDD returning articles which match the above query on entity types
val rdd = EsSpark.esJsonRDD(sc,
params.targetIndex,
toJson(construct_query(params.entityTypes)),
sourceConfig
).sample(false,params.sampleRate)
// Mapping ES json into news article object, then extracting the entities list of
// well defined annotations
val objectsRDD = rdd.map(tuple => {
val maybeArticle =
try {
Some(JavaJsonUtils.fromJson(tuple._2, classOf[SearchableNewsArticle]))
}catch {
case e: Exception => None
}
(tuple._1,maybeArticle)
}
).filter(tuple => {tuple._2.isDefined && tuple._2.get.annotations.isDefined &&
tuple._2.get.annotations.get.entities.isDefined}).map(tuple => (tuple._1, tuple._2.get.annotations.get.entities.get))
// flat map the RDD of entities lists into a list of (entity text, (entity type, 1)) tuples
(line 79) val entityDataMap: RDD[(String, (String, Int))] = objectsRDD.flatMap(tuple => tuple._2.items.collect({
case item if (item.`type`.isDefined) && (item.text.isDefined) &&
(line 81)(params.entityTypes.contains(item.`type`.get)) => (cleanUpText(item.text.get), (item.`type`.get, 1))
}))
// bucketize the tuples RDD into entity text, List(entity_type, entity_count) to make count aggregation and file writeouts
// easier to follow
val finalResults: Array[(String, (String, Int))] = entityDataMap.reduceByKey((x, y) => (x._1, x._2+y._2)).collect()
val entityTypeMapping = Map(
"HealthCondition" -> "HEALTH_CONDITION",
"Drug" -> "DRUG",
"FieldTerminology" -> "FIELD_TERMINOLOGY"
)
for (finalTuple <- finalResults) {
val entityText = finalTuple._1
val entityType = finalTuple._2._1
if(entityTypeMapping.contains(entityType))
{
if(!Files.exists(Paths.get(entityTypeMapping.get(entityType).get+".txt"))){
val myFile = new java.io.FileOutputStream(new File(entityTypeMapping.get(entityType).get+".txt"),false)
printToFile(myFile) {p => p.println(entityTypeMapping.get(entityType))}
}
}
val myFile = new java.io.FileOutputStream(new File(entityTypeMapping.get(entityType).get+".txt"),true)
printToFile(myFile) {p => p.println(entityText)}
}
}
And the error message below:
java.lang.NullPointerException at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4$$anonfun$apply$1.isDefinedAt(GazetteerGenerator.scala:81)
at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4$$anonfun$apply$1.isDefinedAt(GazetteerGenerator.scala:79)
at
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
at scala.collection.immutable.List.foreach(List.scala:318) at
scala.collection.TraversableLike$class.collect(TraversableLike.scala:278)
at
scala.collection.AbstractTraversable.collect(Traversable.scala:105)
at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4.apply(GazetteerGenerator.scala:79)
at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4.apply(GazetteerGenerator.scala:79)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This question has been resolved. The params attribute was not serialized and available to spark workers. The solution is to form a spark broadcast variable within scope of the areas where the params attribute is needed.
There are some examples for use SQL over Spark Streaming in foreachRDD(). But if I want to use SQL in tranform():
case class AlertMsg(host:String, count:Int, sum:Double)
val lines = ssc.socketTextStream("localhost", 8888)
lines.transform( rdd => {
if (rdd.count > 0) {
val t = sqc.jsonRDD(rdd)
t.registerTempTable("logstash")
val sqlreport = sqc.sql("SELECT host, COUNT(host) AS host_c, AVG(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND lineno > 70 GROUP BY host ORDER BY host_c DESC LIMIT 100")
sqlreport.map(r => AlertMsg(r(0).toString,r(1).toString.toInt,r(2).toString.toDouble))
} else {
rdd
}
}).print()
I got such error:
[error] /Users/raochenlin/Downloads/spark-1.2.0-bin-hadoop2.4/logstash/src/main/scala/LogStash.scala:52: no type parameters for method transform: (transformFunc: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[U])(implicit evidence$5: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U] exist so that it can be applied to arguments (org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable])
[error] --- because ---
[error] argument expression's type is not compatible with formal parameter type;
[error] found : org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable]
[error] required: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[?U]
[error] lines.transform( rdd => {
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
Seems only if I use sqlreport.map(r => r.toString) can be a correct usage?
dstream.transform take a function transformFunc: (RDD[T]) ⇒ RDD[U]
In this case, the if must result in the same type on both evaluations of the condition, which is not the case:
if (count == 0) => RDD[String]
if (count > 0) => RDD[AlertMsg]
In this case, remove the optimization of if rdd.count ... sothat you have an unique transformation path.