how to create key value RDD out of kafka topic data - scala

I am reading data from kafka topic in spark streaming job. I need to create key value RDD out of data.
val messages = KafkaUtils.createStream(streamingContext, "localhost:2181","abc",topics, StorageLevel.MEMORY_ONLY)
messages.print()
create key value RDD out of CustomerId and Tokens
val xactionByCustomer = messages.map(_._2).map {
transaction =>
val key = transaction.customerId
var tokens = transaction.tokens
(key, tokens)
}
Error ::
[error] /home/ec2-user/alok/marseille/src/main/scala/com/jcalc/feed/MarkovPredictor.scala:115: value customerId is not a member of String
[error] val key = transaction.customerId
[error] ^
[error] /home/ec2-user/alok/marseille/src/main/scala/com/jcalc/feed/MarkovPredictor.scala:116: value tokens is not a member of String
[error] var tokens = transaction.tokens
[error] ^
[error] two errors found
[error] (compile:compileIncremental) Compilation failed
sample data ::
(null,W3Q6TF3CCI,X84N230CIH,NNN)
(null,O8IV7KEXT0,G1D590G05V,NNS)
(null,LBQKYNE081,MYU0O7JC5H,NHN)
(null,SRB4P501SW,E0FTI4RN7X,LHL)
(null,HELRFMAXVS,W6F704TN21,LHN)
(null,FS4PLQLI63,TK1O9YHS15,NNN)
(null,KI70UDVJLC,4ANBDAW7SU,LNN)
(null,IP6IVPGCWQ,MD93GGGBKA,NNN)
(null,976N9RPXSP,JKU0SV7UMH,LNL)
(null,J4V3AB1YVT,J9WXC1BRAY,LHN)
I am interested in 2nd & 4th value only for pair RDD.
Any Help ?

Your data looks like tuple: (String, String, String, String) and since you're interested in 2dn & 4th value mapping:
val xactionByCustomer = messages.map(row => (row._2, row._4))
should be enough.

Related

Scala Map Save to File - Hadoop

Scala is being used with hadoop
I have performed map-reduce on a text file stored in hdfs (hadoop). The file is large, therefore, I have attempted to extract the important rows; the five most used words in the file.
Therefore, I have used the .take(n) method to extract the required elements. However, an error is prompt when the result is attempted to be saved to a text file. I have tried saving the file a number of ways:
Method 1
val path = "Books/"+language+"/*"
val textFile = sc.textFile(path)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey(_+_)
counts.collect()
val sortedCounts = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
sortedCounts.collect()
val check = sortedCounts.take(5)
check.foreach(d => Files.write(Paths.get(language), (d._1 + " " + d._2 + "\n").getBytes, StandardOpenOption.CREATE, StandardOpenOption.APPEND))
Error
[info] Compiling 1 Scala source to /home/cloudera/Assessment 2/target/scala-2.10/classes ...
[error] /home/cloudera/Assessment 2/src/main/scala/task2.scala:27:22: not found: value Files
[error] check.foreach(d => Files.write(Paths.get(language), (d._1).getBytes, StandardOpenOption.CREATE, StandardOpenOption.APPEND))
[error] ^
[error] one error found
[error] (Compile / compileIncremental) Compilation failed
Method 2
val path = "Books/"+language+"/*"
val textFile = sc.textFile(path)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey(_+_)
counts.collect()
val sortedCounts = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
sortedCounts.collect()
val check = sortedCounts.take(5)
check.saveAsTextFile(language)
*Error*
[info] Compiling 1 Scala source to /home/cloudera/Assessment 2/target/scala-2.10/classes ...
[error] /home/cloudera/Assessment 2/src/main/scala/task2.scala:27:9: value saveAsTextFile is not a member of Array[(Int, String)]
[error] check.saveAsTextFile(language)
[error] ^
[error] one error found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 11 s, completed Nov 26, 2019 6:20:11 AM
Important
The file saves correctly before the .take() method is used. When sortedCounts is used with the saveAsTextFile(x) the entire map is saved. I do not want this as mentioned previously.
How does one save an array[(string,int)] to a text file on hadoop via scala?
You can't save on Hadoop after you take/collect an RDD because it's a local Scala data-structure then, no longer within Spark
You need to use sortedCounts.saveAsTextFile and you should remove counts.collect

Errors in converting JSON to Map in Scala

I am new to scala and I am trying to write a function that takes in a JSON, converts it to Scala dictionary (Map) and checks for certain keys
Below is part of a function that checks for a bunch keys
import play.api.libs.json.Json
def setParams(jsonString: Map[String, Any]) = {
val paramsMap = Json.parse(jsonString)
if (parmsMap.contains("key_1")) {
println('key_1 present')
}
On compiling it with sbt, I get the following errors
/Users/usr/scala_codes/src/main/scala/wrapped_code.scala:29:26: overloaded method value parse with alternatives:
[error] (input: Array[Byte])play.api.libs.json.JsValue <and>
[error] (input: java.io.InputStream)play.api.libs.json.JsValue <and>
[error] (input: String)play.api.libs.json.JsValue
[error] cannot be applied to (Map[String,Any])
[error] val paramsMap = Json.parse(jsonString)
[error] ^
[error] /Users/usr/scala_codes/src/main/scala/wrapped_code.scala:31:9: not found: value parmsMap
[error] if (parmsMap.contains("key_1")) {
Also, in key-value pairs of the JSON, the keys are all strings but the values could be integers, floats or strings. Do I need to make changes for that?
Seems your input type in setParams function should be String not Map[String, Any]
and you have one typo: if (parmsMap.contains("key_1")) should be if (paramsMap.contains("key_1"))
correct function:
def setParams(jsonString: String): Unit = {
val paramsMap = Json.parse(jsonString).as[Map[String, JsValue]]
if (paramsMap.contains("key_1")) println('key_1 present')
}

scala spark type mismatching

I need to group my rdd by two columns and aggregate the count. I have a function:
def constructDiagnosticFeatureTuple(diagnostic: RDD[Diagnostic])
: RDD[FeatureTuple] = {
val grouped_patients = diagnostic
.groupBy(x => (x.patientID, x.code))
.map(_._2)
.map{ events =>
val p_id = events.map(_.patientID).take(1).mkString
val f_code = events.map(_.code).take(1).mkString
val count = events.size.toDouble
((p_id, f_code), count)
}
//should be in form:
//diagnostic.sparkContext.parallelize(List((("patient", "diagnostics"), 1.0)))
}
At compile time, I am getting an error:
/FeatureConstruction.scala:38:3: type mismatch;
[error] found : Unit
[error] required: org.apache.spark.rdd.RDD[edu.gatech.cse6250.features.FeatureConstruction.FeatureTuple]
[error] (which expands to) org.apache.spark.rdd.RDD[((String, String), Double)]
[error] }
[error] ^
How can I fix it?
I red this post: Scala Spark type missmatch found Unit, required rdd.RDD , but I do not use collect(), so, it does not help me.

Spark scala reading text file with map and filter

I've a text file with following format (id,f1,f2,f3,...,fn):
12345,0,0,1,2,...,3
23456,0,0,1,2,...,0
33333,0,1,1,0,...,0
56789,1,0,0,0,...,4
a_123,0,0,0,6,...,3
And I want to read the file (ignore the line like a_123,0,0,0,6,...,3) to create a RDD[(Long, Vector). Here's my solution:
def readDataset(path: String, sparkSession: SparkSession): RDD[(ItemId, Vector)] = {
val sc = sparkSession.sparkContext
sc.textFile(path)
.map({ line => val values=line.split(",")
(
values(0).toLong,
//util.Try(values(0).toLong).getOrElse(0L),
Vectors.dense(values.slice(1, values.length).map {x => x.toDouble }).toSparse
)})
.filter(x => x._1 > 0)
}
However this code can not be compiled:
[ERROR] found : org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.SparseVector)]
[ERROR] required: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] (which expands to) org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] Note: (Long, org.apache.spark.ml.linalg.SparseVector) <: (Long, org.apache.spark.ml.linalg.Vector), but class RDD is invariant in type T.
[ERROR] You may wish to define T as +T instead. (SLS 4.5)
[ERROR] .filter(x => x._1 > 0)
[ERROR] ^
[ERROR] one error found
But if I remove the . toSparse or .filter(x => x._1 > 0) this code can be compiled successfully.
Does someone know why and what should I do to fix it?
Also is there any better way to read the file to RDD with ignoring the non-numeric id lines?
The code compiles successfully if you remove toSparse because the type of your PairRDD is (ItemId, Vector).
The org.apache.spark.ml.linalg.Vector class/type represent the Dense Vector which you are generating using Vector.dense and when you call toSparse it gets converted to org.apache.spark.ml.linalg.SparseVector which is not the type that your PairRDD expects.
As for filtering non-integer IDs I would say your method is a good way to do that.

use SQL in DStream.transform() over Spark Streaming?

There are some examples for use SQL over Spark Streaming in foreachRDD(). But if I want to use SQL in tranform():
case class AlertMsg(host:String, count:Int, sum:Double)
val lines = ssc.socketTextStream("localhost", 8888)
lines.transform( rdd => {
if (rdd.count > 0) {
val t = sqc.jsonRDD(rdd)
t.registerTempTable("logstash")
val sqlreport = sqc.sql("SELECT host, COUNT(host) AS host_c, AVG(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND lineno > 70 GROUP BY host ORDER BY host_c DESC LIMIT 100")
sqlreport.map(r => AlertMsg(r(0).toString,r(1).toString.toInt,r(2).toString.toDouble))
} else {
rdd
}
}).print()
I got such error:
[error] /Users/raochenlin/Downloads/spark-1.2.0-bin-hadoop2.4/logstash/src/main/scala/LogStash.scala:52: no type parameters for method transform: (transformFunc: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[U])(implicit evidence$5: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U] exist so that it can be applied to arguments (org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable])
[error] --- because ---
[error] argument expression's type is not compatible with formal parameter type;
[error] found : org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable]
[error] required: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[?U]
[error] lines.transform( rdd => {
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
Seems only if I use sqlreport.map(r => r.toString) can be a correct usage?
dstream.transform take a function transformFunc: (RDD[T]) ⇒ RDD[U]
In this case, the if must result in the same type on both evaluations of the condition, which is not the case:
if (count == 0) => RDD[String]
if (count > 0) => RDD[AlertMsg]
In this case, remove the optimization of if rdd.count ... sothat you have an unique transformation path.