Read a large zstd file with Spark (Scala) - scala
I'm trying to load a zstd-compressed JSON file with the archive size of 16.4 GB using Spark 3.1.1 with Scala 2.12.10. See Sample file for reference.
For reference, my PC has 32 GB of RAM. The zstd decompressor I'm using is from the native libraries via
LD_LIBRARY_PATH=/opt/hadoop/lib/native
My configuration:
trait SparkProdContext {
private val master = "local[*]"
private val appName = "testing"
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
My code:
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.catalyst.ScalaReflection
import ss.implicits._
case class Comment (
author: String,
body: String,
score: BigInt,
subreddit_id: String,
subreddit: String,
id: String,
parent_id: String,
link_id: String,
retrieved_on: BigInt,
created_utc: BigInt,
permalink: String
)
val schema = ScalaReflection.schemaFor[Comment].dataType.asInstanceOf[StructType]
val comments = ss.read.schema(schema).json("/home/user/Downloads/RC_2020-03.zst").as[Comment]
Upon running the code I'm getting the following error.
22/01/06 23:59:44 INFO CodecPool: Got brand-new decompressor [.zst]
22/01/06 23:59:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.InternalError: Frame requires too much memory for decoding
at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.base/java.io.InputStream.read(InputStream.java:205)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:152)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:192)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:69)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
Any ideas appreciated!
Related
Deserializing structured stream from kafka with Spark
I try to read a kafka topic from Spark 3.0.2 with the scala code below. Here are my imports: import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient} import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer import org.apache.avro.Schema import org.apache.avro.generic.GenericRecord import org.apache.commons.cli.CommandLine import org.apache.spark.sql._ import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.streaming.OutputMode Functions to retreive schema from a Schema Registry service and to convert it into spark format : var schemaRegistryClient: SchemaRegistryClient = _ var kafkaAvroDeserializer: AvroDeserializer = _ def lookupTopicSchema(topic: String): String = { schemaRegistryClient.getLatestSchemaMetadata(topic).getSchema } def avroSchemaToSparkSchema(avroSchema: String): SchemaConverters.SchemaType = { SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema)) } Definition of class to deserialize data from the kafka topic : object DeserializerWrapper { val deserializer: AvroDeserializer = kafkaAvroDeserializer } class AvroDeserializer extends AbstractKafkaAvroDeserializer { def this(client: SchemaRegistryClient) { this() this.schemaRegistry = client } override def deserialize(bytes: Array[Byte]): String = { val value = super.deserialize(bytes) Option(value) match { case Some(Array()) | None => null case Some(a) => val genericRecord = a.asInstanceOf[GenericRecord] genericRecord.toString } } } The main function to execute the job : def main(): Unit = { val spark = SparkSession.builder.getOrCreate val bootstrapServers = "kafka1:9192" val topic = "sample_topic" val shemaRegistryName = "avro_schema_A" val schemaRegistryUrl = "http://myHost:8001" schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128) val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient) spark.udf.register("deserialize", (bytes: Array[Byte]) => DeserializerWrapper.deserializer.deserialize(bytes)) val kafkaDataFrame = (spark .readStream .format("kafka") .option("kafka.bootstrap.servers", bootstrapServers) .option("subscribe", topic) .option("kafka.group.id", "group_id") .option("startingOffsets", "latest") .load() ) val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""") import org.apache.spark.sql.functions._ val dfValueSchema = { val rawSchema = lookupTopicSchema(shemaRegistryName) avroSchemaToSparkSchema(rawSchema) } val formattedDataFrame = (valueDataFrame .select(from_json(col("message"), dfValueSchema.dataType).alias("parsed_value")) .select("parsed_value.*") ) (formattedDataFrame .writeStream .format("console") .outputMode(OutputMode.Append()) .option("truncate", false) .start() .awaitTermination() ) } When I execute the main function, the job crashes with the error : ------------------------------------------- Batch: 0 ------------------------------------------- +-----+-----+-----+-----+-----+-----+-----+ |col-A|col-B|col-C|col-D|col-E|col-F|col-G| +-----+-----+-----+-----+-----+-----+-----+ +-----+-----+-----+-----+-----+-----+-----+ 21/07/23 15:41:31 ERROR TaskSetManager: Task 0 in stage 3.0 failed 4 times; aborting job 21/07/23 15:41:31 ERROR WriteToDataSourceV2Exec: Data source write support org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite#227770fb is aborting. 21/07/23 15:41:31 ERROR WriteToDataSourceV2Exec: Data source write support org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite#227770fb aborted. 21/07/23 15:41:31 ERROR MicroBatchExecution: Query [id = 5c76f495-b2c8-46e7-90b0-dd715c8466f3, runId = c38ccde6-3c98-488c-bfd9-a71a17296503] terminated with error org.apache.spark.SparkException: Writing job aborted. at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:413) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:361) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.writeWithV2(WriteToDataSourceV2Exec.scala:322) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.run(WriteToDataSourceV2Exec.scala:329) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:39) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:39) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:45) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3627) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2940) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2940) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:586) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:581) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:352) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:350) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:581) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:223) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:352) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:350) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:191) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:185) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:334) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:245) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 13, 172.20.0.4, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$1119/1813496428: (binary) => string) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:457) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.subExpr_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$7(WriteToDataSourceV2Exec.scala:441) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:477) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:385) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at $line55.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$res3$1(<console>:41) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127) ... 17 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:382) ... 37 more Caused by: org.apache.spark.SparkException: Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$1119/1813496428: (binary) => string) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:457) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.subExpr_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$7(WriteToDataSourceV2Exec.scala:441) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:477) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:385) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at $line55.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$res3$1(<console>:41) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127) ... 17 more It seems that the udf throws a NullPointerException, but I explicity implemented a pattern matching to avoid it. Can anybody please help on this ? Thanks a lot.
I finally come over the deserialization process using the ABRiS library. It is way to easy to make connection with the Schema registry and do Ser/Deser operations of the kafka stream. #OneCricketeer I really appreciate your help on this. For information, the connection to the SR can be made simply as : import za.co.absa.abris.config.AbrisConfig val abrisConfig = AbrisConfig .fromConfluentAvro .downloadReaderSchemaById(123) .usingSchemaRegistry("http://schema-registry.com:8081") Then, the conversion process is made with a single line of code : import za.co.absa.abris.avro.functions.from_avro val outputDf = inputStreamDf.select(from_avro(col("value"), abrisConfig) as "data")
java.lang.NoSuchMethodError: com.mongodb.Mongo.<init>(Lcom/mongodb/MongoClientURI
I am very new to scala spark and Mongo. while trying to load some data to MongoDB by spark with the following code. import com.mongodb.spark.config.WriteConfig import com.mongodb.spark.toDocumentRDDFunctions import org.apache.spark.sql.SparkSession import org.apache.spark.{SparkConf, SparkContext} import org.bson.Document object MongoTest { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .master("local[*]") .appName(this.getClass.getSimpleName) .getOrCreate() val conf = new SparkConf().setAppName(this.getClass.getSimpleName).set("spark.driver.allowMultipleContexts", "true") val sc = new SparkContext(conf) val documents = sc.parallelize((1 to 10).map(i => Document.parse(s"{test: $i}"))) documents.saveToMongoDB(WriteConfig(Map("spark.mongodb.output.uri" -> "mongodb://127.0.0.1:27017/sampledb.testMongo"))) } } The error occurs and my spark submit fails with following error: java.lang.NoSuchMethodError: com.mongodb.Mongo.<init>(Lcom/mongodb/MongoClientURI;)V at com.mongodb.MongoClient.<init>(MongoClient.java:328) at com.mongodb.spark.connection.DefaultMongoClientFactory.create(DefaultMongoClientFactory.scala:43) at com.mongodb.spark.connection.MongoClientCache.acquire(MongoClientCache.scala:55) at com.mongodb.spark.MongoConnector.acquireClient(MongoConnector.scala:239) at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:152) at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:171) at com.mongodb.spark.MongoConnector.withCollectionDo(MongoConnector.scala:184) at com.mongodb.spark.MongoSpark$$anonfun$save$1.apply(MongoSpark.scala:116) at com.mongodb.spark.MongoSpark$$anonfun$save$1.apply(MongoSpark.scala:115) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) I use Spark version 2.4.0 and Scala version 2.11.12. Any idea where I am wrong.?
How to add current_timestamp() column to a streaming dataframe?
I'm using Spark 2.4.3 and Scala. I'm fetching messages from a streaming kafka source of the following structure: {"message": "Jan 7 17:53:48 PA-850.abc.com 1,2020/01/07 17:53:41,001801012404,TRAFFIC,drop,2304,2020/01/07 17:53:41,10.7.26.51,10.8.3.11,0.0.0.0,0.0.0.0,interzone-default,,,not-applicable,vsys1,SERVER-VLAN,VPN,ethernet1/6.18,,test-1,2020/01/07 17:53:41,0,1,45194,514,0,0,0x0,udp,deny,588,588,0,1,2020/01/07 17:53:45,0,any,0,35067255521,0x8000000000000000,10.0.0.0-10.255.255.255,10.0.0.0-10.255.255.255,0,1,0,policy-deny,0,0,0,0,,PA-850,from-policy,,,0,,0,,N/A,0,0,0,0,b804eab2-f240-467a-be97-6f8c382afd4c,0","source_host": "10.7.26.51"} My goal is to add a new timestamp column to each row with the current timestamp in my streaming data. I have to insert all these rows into a cassandra table. package devices import configurations._ import org.apache.spark.sql.Row import org.apache.spark.sql.functions.{col, from_json, lower, split} import org.apache.spark.sql.cassandra._ import scala.collection.mutable.{ListBuffer, Map} import scala.io.Source import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{StringType,TimestampType} import org.apache.spark.sql.functions.to_timestamp import org.apache.spark.sql.functions.unix_timestamp object PA { def main(args: Array[String]): Unit = { val spark = SparkBuilder.spark val df = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", configHelper.kafka_host) .option("subscribe", configHelper.log_topic) .option("startingOffsets", "earliest") .option("multiLine", true) .option("includeTimestamp",true) .load() df.printSchema() def getDeviceNameOSMapping():Map[String,String]= { val osmapping=scala.collection.mutable.Map[String, String]() val bufferedSource = Source.fromFile(configHelper.os_mapping_file) for (line <- bufferedSource.getLines) { val cols = line.split(",").map(_.trim) osmapping+=(cols(0).toLowerCase()->cols(1).toLowerCase()) } bufferedSource.close return osmapping } val deviceOSMapping = spark.sparkContext.broadcast(getDeviceNameOSMapping()) val debug = true val msg = df.selectExpr("CAST(value AS STRING)") .withColumn("value", lower(col("value"))) .select(from_json(col("value"), cefFormat.cef_json).as("data")) .select("data.*") import spark.sqlContext.implicits._ val newDF = msg.withColumn("created", lit(current_timestamp())) msg.writeStream .foreachBatch { (batchDF, _) => val syslogDF=batchDF.filter(!$"message".contains("invalid syslog message:")) .filter(!$"message".contains("fluentd worker is now stopping worker=0")) .filter(!$"message".contains("fluentd worker is now running worker=0")) val syslogRDD=syslogDF.rdd.map(r=>{ r.getString(0) }).map(x=>{ parseSysLog(x) }) .filter(x=>deviceOSMapping.value.contains(x._1)) try { val threat_9_0_DF = spark.sqlContext.createDataFrame(syslogRDD.filter(x => deviceOSMapping.value(x._1).equals("9.0") & x._2.equals("threat")) .map(x => Row.fromSeq(x._3)),formatPA.threat_9_0) if(debug) threat_9_0_DF.show(true) threat_9_0_DF.write .cassandraFormat(configHelper.cassandra_table_syslog, configHelper.cassandra_keyspace) .mode("append") .save println("threat_9_0_DF saved") } catch { case e:Exception=>{ println(e.getMessage) } } try { val traffic_9_0_DF = spark.sqlContext.createDataFrame(syslogRDD.filter(x => deviceOSMapping.value(x._1).equals("9.0") & x._2.equals("traffic")) .map(x => Row.fromSeq(x._3)),formatPA.traffic_9_0) if(debug) traffic_9_0_DF.show(true) traffic_9_0_DF.write .cassandraFormat(configHelper.cassandra_table_syslog, configHelper.cassandra_keyspace) .mode("append") .save println("traffic_9_0_DF saved") } catch { case e:Exception=>{ println(e.getMessage) } } }.start().awaitTermination() def parseSysLog(msg: String): (String,String,List[String]) = { //println("PRINTING MESSAGES") //println(msg) val splitmsg=msg.split(",") val traffic_type=splitmsg(3) val temp=splitmsg(0).split(" ") val date_time=temp.dropRight(2).mkString(" ") val domain_name=temp(temp.size-2) val future_use1=temp(temp.size-1) val device_name=domain_name.split("\\.")(0) var result=new ListBuffer[String]() //result+=temp2 result+=date_time result+=domain_name result+=future_use1 result=result++splitmsg.slice(1,splitmsg.size).toList (device_name,traffic_type,result.toList) } } } package configurations import org.apache.spark.sql.types.{StringType, StructType, TimestampType, DateType} object formatPA { val threat_9_0=new StructType() .add("date_time",StringType) .add("log_source",StringType) .add("future_use1",StringType) .add("received_time",StringType) .add("serial_number",StringType) .add("traffic_type",StringType) .add("threat_content_type",StringType) .add("future_use2",StringType) .add("generated_time",StringType) .add("src_ip",StringType) .add("dst_ip",StringType) .add("src_nat",StringType) .add("dst_nat",StringType) .add("rule_name",StringType) .add("src_user",StringType) .add("dst_user",StringType) .add("app",StringType) .add("vsys",StringType) .add("src_zone",StringType) .add("dst_zone",StringType) .add("igr_int",StringType) .add("egr_int",StringType) .add("log_fw_profile",StringType) .add("future_use3",StringType) .add("session_id",StringType) .add("repeat_count",StringType) .add("src_port",StringType) .add("dst_port",StringType) .add("src_nat_port",StringType) .add("dst_nat_port",StringType) .add("flags",StringType) .add("protocol",StringType) .add("action",StringType) .add("miscellaneous",StringType) .add("threat_id",StringType) .add("category",StringType) .add("severity",StringType) .add("direction",StringType) .add("seq_num",StringType) .add("act_flag",StringType) .add("src_geo_location",StringType) .add("dst_geo_location",StringType) .add("future_use4",StringType) .add("content_type",StringType) .add("pcap_id",StringType) .add("file_digest",StringType) .add("apt_cloud",StringType) .add("url_index",StringType) .add("user_agent",StringType) .add("file_type",StringType) .add("x_forwarded_for",StringType) .add("referer",StringType) .add("sender",StringType) .add("subject",StringType) .add("recipient",StringType) .add("report_id",StringType) .add("dghl1",StringType) .add("dghl2",StringType) .add("dghl3",StringType) .add("dghl4",StringType) .add("vsys_name",StringType) .add("device_name",StringType) .add("future_use5",StringType) .add("src_vm_uuid",StringType) .add("dst_vm_uuid",StringType) .add("http_method",StringType) .add("tunnel_id_imsi",StringType) .add("monitor_tag_imei",StringType) .add("parent_session_id",StringType) .add("parent_start_time",StringType) .add("tunnel_type",StringType) .add("threat_category",StringType) .add("content_version",StringType) .add("future_use6",StringType) .add("sctp_association_id",StringType) .add("payload_protocol_id",StringType) .add("http_headers",StringType) .add("url_category_list",StringType) .add("uuid_for_rule",StringType) .add("http_2_connection",StringType) .add("created",TimestampType) val traffic_9_0=new StructType() .add("date_time",StringType) .add("log_source",StringType) .add("future_use1",StringType) .add("received_time",StringType) .add("serial_number",StringType) .add("traffic_type",StringType) .add("threat_content_type",StringType) .add("future_use2",StringType) .add("generated_time",StringType) .add("src_ip",StringType) .add("dst_ip",StringType) .add("src_nat",StringType) .add("dst_nat",StringType) .add("rule_name",StringType) .add("src_user",StringType) .add("dst_user",StringType) .add("app",StringType) .add("vsys",StringType) .add("src_zone",StringType) .add("dst_zone",StringType) .add("igr_int",StringType) .add("egr_int",StringType) .add("log_fw_profile",StringType) .add("future_use3",StringType) .add("session_id",StringType) .add("repeat_count",StringType) .add("src_port",StringType) .add("dst_port",StringType) .add("src_nat_port",StringType) .add("dst_nat_port",StringType) .add("flags",StringType) .add("protocol",StringType) .add("action",StringType) .add("bytes",StringType) .add("bytes_sent",StringType) .add("bytes_received",StringType) .add("packets",StringType) .add("start_time",StringType) .add("end_time",StringType) .add("category",StringType) .add("future_use4",StringType) .add("seq_num",StringType) .add("act_flag",StringType) .add("src_geo_location",StringType) .add("dst_geo_location",StringType) .add("future_use5",StringType) .add("packet_sent",StringType) .add("packet_received",StringType) .add("session_end_reason",StringType) .add("dghl1",StringType) .add("dghl2",StringType) .add("dghl3",StringType) .add("dghl4",StringType) .add("vsys_name",StringType) .add("device_name",StringType) .add("action_source",StringType) .add("src_vm_uuid",StringType) .add("dst_vm_uuid",StringType) .add("tunnel_id_imsi",StringType) .add("monitor_tag_imei",StringType) .add("parent_session_id",StringType) .add("parent_start_time",StringType) .add("tunnel_type",StringType) .add("sctp_association_id",StringType) .add("sctp_chunks",StringType) .add("sctp_chunks_sent",StringType) .add("sctp_chunks_received",StringType) .add("uuid_for_rule",StringType) .add("http_2_connection",StringType) .add("created",TimestampType) } The output for the above code is as follows: +---------+----------+-----------+-------------+-------------+------------+-------------------+-----------+--------------+------+------+-------+-------+---------+--------+--------+---+----+--------+--------+-------+-------+--------------+-----------+----------+------------+--------+--------+------------+------------+-----+--------+------+-------------+---------+--------+--------+---------+-------+--------+----------------+----------------+-----------+------------+-------+-----------+---------+---------+----------+---------+---------------+-------+------+-------+---------+---------+-----+-----+-----+-----+---------+-----------+-----------+-----------+-----------+-----------+--------------+----------------+-----------------+-----------------+-----------+---------------+---------------+-----------+-------------------+-------------------+------------+-----------------+-------------+-----------------+-------+ |date_time|log_source|future_use1|received_time|serial_number|traffic_type|threat_content_type|future_use2|generated_time|src_ip|dst_ip|src_nat|dst_nat|rule_name|src_user|dst_user|app|vsys|src_zone|dst_zone|igr_int|egr_int|log_fw_profile|future_use3|session_id|repeat_count|src_port|dst_port|src_nat_port|dst_nat_port|flags|protocol|action|miscellaneous|threat_id|category|severity|direction|seq_num|act_flag|src_geo_location|dst_geo_location|future_use4|content_type|pcap_id|file_digest|apt_cloud|url_index|user_agent|file_type|x_forwarded_for|referer|sender|subject|recipient|report_id|dghl1|dghl2|dghl3|dghl4|vsys_name|device_name|future_use5|src_vm_uuid|dst_vm_uuid|http_method|tunnel_id_imsi|monitor_tag_imei|parent_session_id|parent_start_time|tunnel_type|threat_category|content_version|future_use6|sctp_association_id|payload_protocol_id|http_headers|url_category_list|uuid_for_rule|http_2_connection|created| +---------+----------+-----------+-------------+-------------+------------+-------------------+-----------+--------------+------+------+-------+-------+---------+--------+--------+---+----+--------+--------+-------+-------+--------------+-----------+----------+------------+--------+--------+------------+------------+-----+--------+------+-------------+---------+--------+--------+---------+-------+--------+----------------+----------------+-----------+------------+-------+-----------+---------+---------+----------+---------+---------------+-------+------+-------+---------+---------+-----+-----+-----+-----+---------+-----------+-----------+-----------+-----------+-----------+--------------+----------------+-----------------+-----------------+-----------+---------------+---------------+-----------+-------------------+-------------------+------------+-----------------+-------------+-----------------+-------+ +---------+----------+-----------+-------------+-------------+------------+-------------------+-----------+--------------+------+------+-------+-------+---------+--------+--------+---+----+--------+--------+-------+-------+--------------+-----------+----------+------------+--------+--------+------------+------------+-----+--------+------+-------------+---------+--------+--------+---------+-------+--------+----------------+----------------+-----------+------------+-------+-----------+---------+---------+----------+---------+---------------+-------+------+-------+---------+---------+-----+-----+-----+-----+---------+-----------+-----------+-----------+-----------+-----------+--------------+----------------+-----------------+-----------------+-----------+---------------+---------------+-----------+-------------------+-------------------+------------+-----------------+-------------+-----------------+-------+ threat_9_0_DF saved 20/01/08 14:59:49 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 69 if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, fromJavaTimestamp, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 69, created), TimestampType), true, false) AS created#773 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:292) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:593) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:593) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ArrayIndexOutOfBoundsException: 69 at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:174) at org.apache.spark.sql.Row$class.isNullAt(Row.scala:191) at org.apache.spark.sql.catalyst.expressions.GenericRow.isNullAt(rows.scala:166) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_34$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289) ... 25 more
It looks like it does not matter that the messages are in JSON format, does it? Let's then use a sample dataset of any schema and add a timestamp column. val messages = spark.range(3) scala> messages.printSchema root |-- id: long (nullable = false) val withTs = messages.withColumn("timestamp", current_timestamp()) scala> withTs.printSchema root |-- id: long (nullable = false) |-- timestamp: timestamp (nullable = false) That gives you a dataset with a timestamp column. The following line in your code should work, too (you don't need lit). val xDF = thDF.withColumn("created", lit(current_timestamp())) //This does not get cast to TimestampType What do you mean by "This does not get cast to TimestampType"? How do you check it out? Are you perhaps confusing TimestampType in Spark and Cassandra? The Spark connector for Cassandra should handle it. Let's give that a try: val litTs = spark.range(3).withColumn("ts", lit(current_timestamp)) scala> litTs.printSchema root |-- id: long (nullable = false) |-- ts: timestamp (nullable = false) import org.apache.spark.sql.types._ val dataType = litTs.schema("ts").dataType assert(dataType.isInstanceOf[TimestampType]) scala> println(dataType) TimestampType
NullPointerException in org.apache.spark.ml.feature.Tokenizer
I want to separately use TF-IDF features on the title and description fields, respectively and then combine those features in the VectorAssembler so that the final classifier can operate on those features. It works fine if I use a single serial flow that is simply titleTokenizer -> titleHashingTF -> VectorAssembler But I need both like so: titleTokenizer -> titleHashingTF -> VectorAssembler descriptionTokenizer -> descriptionHashingTF Code here: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession import org.apache.spark.ml.{Pipeline, PipelineModel} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.feature.{HashingTF, Tokenizer, StringIndexer, VectorAssembler} import org.apache.spark.ml.linalg.Vector import org.apache.spark.sql.Row import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.log4j.{Level, Logger} object SimplePipeline { def main(args: Array[String]) { // setup boilerplate val conf = new SparkConf() .setAppName("Pipeline example") val sc = new SparkContext(conf) val spark = SparkSession .builder() .appName("Session for SimplePipeline") .getOrCreate() val all_df = spark.read.json("file:///Users/me/data.json") val numLabels = all_df.count() // split into training and testing val Array(training, testing) = all_df.randomSplit(Array(0.75, 0.25)) val nTraining = training.count(); val nTesting = testing.count(); println(s"Loaded $nTraining training labels..."); println(s"Loaded $nTesting testing labels..."); // convert string labels to integers val indexer = new StringIndexer() .setInputCol("rating") .setOutputCol("label") // tokenize our string inputs val titleTokenizer = new Tokenizer() .setInputCol("title") .setOutputCol("title_words") val descriptionTokenizer = new Tokenizer() .setInputCol("description") .setOutputCol("description_words") // count term frequencies val titleHashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(titleTokenizer.getOutputCol) .setOutputCol("title_tfs") val descriptionHashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(descriptionTokenizer.getOutputCol) .setOutputCol("description_tfs") // combine features together val assembler = new VectorAssembler() .setInputCols(Array(titleHashingTF.getOutputCol, descriptionHashingTF.getOutputCol)) .setOutputCol("features") // set params for our model val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01) // pipeline that combines all stages val stages = Array(indexer, titleTokenizer, titleHashingTF, descriptionTokenizer, descriptionHashingTF, assembler, lr); val pipeline = new Pipeline().setStages(stages); // Fit the pipeline to training documents. val model = pipeline.fit(training) // Make predictions. val predictions = model.transform(testing) // Select example rows to display. predictions.select("label", "rawPrediction", "prediction").show() sc.stop() } } and my data file is simply a line-break separated file of JSON objects: {"title" : "xxxxxx", "description" : "yyyyy" .... } {"title" : "zzzzzz", "description" : "zxzxzx" .... } The error I get is very long a difficult to understand, but the important part (I think) is a java.lang.NullPointerException: ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 12) org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$1: (string) => array<string>) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39) at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39) ... 23 more How should I be properly crafting my Pipeline to do this? (Also I'm completely new to Scala)
The problem here is that you don't validate the data and some of the values are NULL. It is pretty easy to reproduce this: val df = Seq((1, Some("abcd bcde cdef")), (2, None)).toDF("id", "description") val tokenizer = new Tokenizer().setInputCol("description") tokenizer.transform(df).foreach(_ => ()) org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$1: (string) => array<string>) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072) ... Caused by: java.lang.NullPointerException at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39) ... You can for example drop: tokenizer.transform(df.na.drop(Array("description"))) or replace these with empty strings: tokenizer.transform(df.na.fill(Map("description" -> ""))) whichever makes more sense in your application.
structured streaming with Spark 2.0.2, Kafka source and scalapb
I am using structured streaming (Spark 2.0.2) to consume kafka messages. Using scalapb, messages in protobuf. I am getting the following error. Please help.. Exception in thread "main" scala.ScalaReflectionException: is not a term at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:199) at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:84) at org.apache.spark.sql.catalyst.ScalaReflection$class.constructParams(ScalaReflection.scala:811) at org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:800) at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:582) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:460) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583) at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61) at org.apache.spark.sql.Encoders$.product(Encoders.scala:274) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47) at PersonConsumer$.main(PersonConsumer.scala:33) at PersonConsumer.main(PersonConsumer.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) The following is my code ... object PersonConsumer { import org.apache.spark.rdd.RDD import com.trueaccord.scalapb.spark._ import org.apache.spark.sql.{SQLContext, SparkSession} import com.example.protos.demo._ def main(args : Array[String]) { def parseLine(s: String): Person = Person.parseFrom( org.apache.commons.codec.binary.Base64.decodeBase64(s)) val spark = SparkSession.builder. master("local") .appName("spark session example") .getOrCreate() import spark.implicits._ val ds1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe","person").load() val ds2 = ds1.selectExpr("CAST(value AS STRING)").as[String] val ds3 = ds2.map(str => parseLine(str)).createOrReplaceTempView("persons") val ds4 = spark.sqlContext.sql("select name from persons") val query = ds4.writeStream .outputMode("append") .format("console") .start() query.awaitTermination() } }
The line with val ds3 should be: val ds3 = ds2.map(str => parseLine(str)) sqlContext.protoToDataFrame(ds3).registerTempTable("persons") The RDD needs to be converted to a data frame before it is saved as temp table.
In Person class, gender is a enum and this was the cause for this problem. After removing this field, it works fine. The following is the answer I got from Shixiong(Ryan) of DataBricks. The problem is "optional Gender gender = 3;". The generated class "Gender" is a trait, and Spark cannot know how to create a trait so it's not supported. You can define your class which is supported by SQL Encoder, and convert this generated class to the new class in parseLine.