Read a large zstd file with Spark (Scala) - scala

I'm trying to load a zstd-compressed JSON file with the archive size of 16.4 GB using Spark 3.1.1 with Scala 2.12.10. See Sample file for reference.
For reference, my PC has 32 GB of RAM. The zstd decompressor I'm using is from the native libraries via
LD_LIBRARY_PATH=/opt/hadoop/lib/native
My configuration:
trait SparkProdContext {
private val master = "local[*]"
private val appName = "testing"
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
My code:
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.catalyst.ScalaReflection
import ss.implicits._
case class Comment (
author: String,
body: String,
score: BigInt,
subreddit_id: String,
subreddit: String,
id: String,
parent_id: String,
link_id: String,
retrieved_on: BigInt,
created_utc: BigInt,
permalink: String
)
val schema = ScalaReflection.schemaFor[Comment].dataType.asInstanceOf[StructType]
val comments = ss.read.schema(schema).json("/home/user/Downloads/RC_2020-03.zst").as[Comment]
Upon running the code I'm getting the following error.
22/01/06 23:59:44 INFO CodecPool: Got brand-new decompressor [.zst]
22/01/06 23:59:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.InternalError: Frame requires too much memory for decoding
at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.base/java.io.InputStream.read(InputStream.java:205)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:152)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:192)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:69)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
Any ideas appreciated!

Related

Deserializing structured stream from kafka with Spark

I try to read a kafka topic from Spark 3.0.2 with the scala code below.
Here are my imports:
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.commons.cli.CommandLine
import org.apache.spark.sql._
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.streaming.OutputMode
Functions to retreive schema from a Schema Registry service and to convert it into spark format :
var schemaRegistryClient: SchemaRegistryClient = _
var kafkaAvroDeserializer: AvroDeserializer = _
def lookupTopicSchema(topic: String): String = {
schemaRegistryClient.getLatestSchemaMetadata(topic).getSchema
}
def avroSchemaToSparkSchema(avroSchema: String): SchemaConverters.SchemaType = {
SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
}
Definition of class to deserialize data from the kafka topic :
object DeserializerWrapper {
val deserializer: AvroDeserializer = kafkaAvroDeserializer
}
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val value = super.deserialize(bytes)
Option(value) match {
case Some(Array()) | None => null
case Some(a) =>
val genericRecord = a.asInstanceOf[GenericRecord]
genericRecord.toString
}
}
}
The main function to execute the job :
def main(): Unit = {
val spark = SparkSession.builder.getOrCreate
val bootstrapServers = "kafka1:9192"
val topic = "sample_topic"
val shemaRegistryName = "avro_schema_A"
val schemaRegistryUrl = "http://myHost:8001"
schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
spark.udf.register("deserialize", (bytes: Array[Byte]) => DeserializerWrapper.deserializer.deserialize(bytes))
val kafkaDataFrame = (spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("kafka.group.id", "group_id")
.option("startingOffsets", "latest")
.load()
)
val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""")
import org.apache.spark.sql.functions._
val dfValueSchema = {
val rawSchema = lookupTopicSchema(shemaRegistryName)
avroSchemaToSparkSchema(rawSchema)
}
val formattedDataFrame = (valueDataFrame
.select(from_json(col("message"), dfValueSchema.dataType).alias("parsed_value"))
.select("parsed_value.*")
)
(formattedDataFrame
.writeStream
.format("console")
.outputMode(OutputMode.Append())
.option("truncate", false)
.start()
.awaitTermination()
)
}
When I execute the main function, the job crashes with the error :
-------------------------------------------
Batch: 0
-------------------------------------------
+-----+-----+-----+-----+-----+-----+-----+
|col-A|col-B|col-C|col-D|col-E|col-F|col-G|
+-----+-----+-----+-----+-----+-----+-----+
+-----+-----+-----+-----+-----+-----+-----+
21/07/23 15:41:31 ERROR TaskSetManager: Task 0 in stage 3.0 failed 4 times; aborting job
21/07/23 15:41:31 ERROR WriteToDataSourceV2Exec: Data source write support org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite#227770fb is aborting.
21/07/23 15:41:31 ERROR WriteToDataSourceV2Exec: Data source write support org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite#227770fb aborted.
21/07/23 15:41:31 ERROR MicroBatchExecution: Query [id = 5c76f495-b2c8-46e7-90b0-dd715c8466f3, runId = c38ccde6-3c98-488c-bfd9-a71a17296503] terminated with error
org.apache.spark.SparkException: Writing job aborted.
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:413)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:361)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.writeWithV2(WriteToDataSourceV2Exec.scala:322)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.run(WriteToDataSourceV2Exec.scala:329)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:39)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:39)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:45)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3627)
at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2940)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2940)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:586)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:581)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:352)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:350)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:581)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:223)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:352)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:350)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:191)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:185)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:334)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:245)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 13, 172.20.0.4, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$1119/1813496428: (binary) => string)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:457)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.subExpr_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$7(WriteToDataSourceV2Exec.scala:441)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:477)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:385)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at $line55.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$res3$1(<console>:41)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
... 17 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:382)
... 37 more
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$1119/1813496428: (binary) => string)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:457)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.subExpr_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$7(WriteToDataSourceV2Exec.scala:441)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:477)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:385)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at $line55.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$res3$1(<console>:41)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
... 17 more
It seems that the udf throws a NullPointerException, but I explicity implemented a pattern matching to avoid it.
Can anybody please help on this ?
Thanks a lot.
I finally come over the deserialization process using the ABRiS library. It is way to easy to make connection with the Schema registry and do Ser/Deser operations of the kafka stream.
#OneCricketeer I really appreciate your help on this.
For information, the connection to the SR can be made simply as :
import za.co.absa.abris.config.AbrisConfig
val abrisConfig = AbrisConfig
.fromConfluentAvro
.downloadReaderSchemaById(123)
.usingSchemaRegistry("http://schema-registry.com:8081")
Then, the conversion process is made with a single line of code :
import za.co.absa.abris.avro.functions.from_avro
val outputDf = inputStreamDf.select(from_avro(col("value"), abrisConfig) as "data")

java.lang.NoSuchMethodError: com.mongodb.Mongo.<init>(Lcom/mongodb/MongoClientURI

I am very new to scala spark and Mongo. while trying to load some data to MongoDB by spark with the following code.
import com.mongodb.spark.config.WriteConfig
import com.mongodb.spark.toDocumentRDDFunctions
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
import org.bson.Document
object MongoTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.appName(this.getClass.getSimpleName)
.getOrCreate()
val conf = new SparkConf().setAppName(this.getClass.getSimpleName).set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext(conf)
val documents = sc.parallelize((1 to 10).map(i => Document.parse(s"{test: $i}")))
documents.saveToMongoDB(WriteConfig(Map("spark.mongodb.output.uri" -> "mongodb://127.0.0.1:27017/sampledb.testMongo")))
}
}
The error occurs and my spark submit fails with following error:
java.lang.NoSuchMethodError: com.mongodb.Mongo.<init>(Lcom/mongodb/MongoClientURI;)V
at com.mongodb.MongoClient.<init>(MongoClient.java:328)
at com.mongodb.spark.connection.DefaultMongoClientFactory.create(DefaultMongoClientFactory.scala:43)
at com.mongodb.spark.connection.MongoClientCache.acquire(MongoClientCache.scala:55)
at com.mongodb.spark.MongoConnector.acquireClient(MongoConnector.scala:239)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:152)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:171)
at com.mongodb.spark.MongoConnector.withCollectionDo(MongoConnector.scala:184)
at com.mongodb.spark.MongoSpark$$anonfun$save$1.apply(MongoSpark.scala:116)
at com.mongodb.spark.MongoSpark$$anonfun$save$1.apply(MongoSpark.scala:115)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I use Spark version 2.4.0 and Scala version 2.11.12. Any idea where I am wrong.?

How to add current_timestamp() column to a streaming dataframe?

I'm using Spark 2.4.3 and Scala.
I'm fetching messages from a streaming kafka source of the following structure:
{"message": "Jan 7 17:53:48 PA-850.abc.com 1,2020/01/07 17:53:41,001801012404,TRAFFIC,drop,2304,2020/01/07 17:53:41,10.7.26.51,10.8.3.11,0.0.0.0,0.0.0.0,interzone-default,,,not-applicable,vsys1,SERVER-VLAN,VPN,ethernet1/6.18,,test-1,2020/01/07 17:53:41,0,1,45194,514,0,0,0x0,udp,deny,588,588,0,1,2020/01/07 17:53:45,0,any,0,35067255521,0x8000000000000000,10.0.0.0-10.255.255.255,10.0.0.0-10.255.255.255,0,1,0,policy-deny,0,0,0,0,,PA-850,from-policy,,,0,,0,,N/A,0,0,0,0,b804eab2-f240-467a-be97-6f8c382afd4c,0","source_host": "10.7.26.51"}
My goal is to add a new timestamp column to each row with the current timestamp in my streaming data. I have to insert all these rows into a cassandra table.
package devices
import configurations._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, from_json, lower, split}
import org.apache.spark.sql.cassandra._
import scala.collection.mutable.{ListBuffer, Map}
import scala.io.Source
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StringType,TimestampType}
import org.apache.spark.sql.functions.to_timestamp
import org.apache.spark.sql.functions.unix_timestamp
object PA {
def main(args: Array[String]): Unit = {
val spark = SparkBuilder.spark
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", configHelper.kafka_host)
.option("subscribe", configHelper.log_topic)
.option("startingOffsets", "earliest")
.option("multiLine", true)
.option("includeTimestamp",true)
.load()
df.printSchema()
def getDeviceNameOSMapping():Map[String,String]= {
val osmapping=scala.collection.mutable.Map[String, String]()
val bufferedSource = Source.fromFile(configHelper.os_mapping_file)
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
osmapping+=(cols(0).toLowerCase()->cols(1).toLowerCase())
}
bufferedSource.close
return osmapping
}
val deviceOSMapping = spark.sparkContext.broadcast(getDeviceNameOSMapping())
val debug = true
val msg = df.selectExpr("CAST(value AS STRING)")
.withColumn("value", lower(col("value")))
.select(from_json(col("value"), cefFormat.cef_json).as("data"))
.select("data.*")
import spark.sqlContext.implicits._
val newDF = msg.withColumn("created", lit(current_timestamp()))
msg.writeStream
.foreachBatch { (batchDF, _) =>
val syslogDF=batchDF.filter(!$"message".contains("invalid syslog message:"))
.filter(!$"message".contains("fluentd worker is now stopping worker=0"))
.filter(!$"message".contains("fluentd worker is now running worker=0"))
val syslogRDD=syslogDF.rdd.map(r=>{
r.getString(0)
}).map(x=>{
parseSysLog(x)
})
.filter(x=>deviceOSMapping.value.contains(x._1))
try {
val threat_9_0_DF = spark.sqlContext.createDataFrame(syslogRDD.filter(x => deviceOSMapping.value(x._1).equals("9.0") & x._2.equals("threat"))
.map(x => Row.fromSeq(x._3)),formatPA.threat_9_0)
if(debug)
threat_9_0_DF.show(true)
threat_9_0_DF.write
.cassandraFormat(configHelper.cassandra_table_syslog, configHelper.cassandra_keyspace)
.mode("append")
.save
println("threat_9_0_DF saved")
}
catch {
case e:Exception=>{
println(e.getMessage)
}
}
try {
val traffic_9_0_DF = spark.sqlContext.createDataFrame(syslogRDD.filter(x => deviceOSMapping.value(x._1).equals("9.0") & x._2.equals("traffic"))
.map(x => Row.fromSeq(x._3)),formatPA.traffic_9_0)
if(debug)
traffic_9_0_DF.show(true)
traffic_9_0_DF.write
.cassandraFormat(configHelper.cassandra_table_syslog, configHelper.cassandra_keyspace)
.mode("append")
.save
println("traffic_9_0_DF saved")
}
catch {
case e:Exception=>{
println(e.getMessage)
}
}
}.start().awaitTermination()
def parseSysLog(msg: String): (String,String,List[String]) = {
//println("PRINTING MESSAGES")
//println(msg)
val splitmsg=msg.split(",")
val traffic_type=splitmsg(3)
val temp=splitmsg(0).split(" ")
val date_time=temp.dropRight(2).mkString(" ")
val domain_name=temp(temp.size-2)
val future_use1=temp(temp.size-1)
val device_name=domain_name.split("\\.")(0)
var result=new ListBuffer[String]()
//result+=temp2
result+=date_time
result+=domain_name
result+=future_use1
result=result++splitmsg.slice(1,splitmsg.size).toList
(device_name,traffic_type,result.toList)
}
}
}
package configurations
import org.apache.spark.sql.types.{StringType, StructType, TimestampType, DateType}
object formatPA {
val threat_9_0=new StructType()
.add("date_time",StringType)
.add("log_source",StringType)
.add("future_use1",StringType)
.add("received_time",StringType)
.add("serial_number",StringType)
.add("traffic_type",StringType)
.add("threat_content_type",StringType)
.add("future_use2",StringType)
.add("generated_time",StringType)
.add("src_ip",StringType)
.add("dst_ip",StringType)
.add("src_nat",StringType)
.add("dst_nat",StringType)
.add("rule_name",StringType)
.add("src_user",StringType)
.add("dst_user",StringType)
.add("app",StringType)
.add("vsys",StringType)
.add("src_zone",StringType)
.add("dst_zone",StringType)
.add("igr_int",StringType)
.add("egr_int",StringType)
.add("log_fw_profile",StringType)
.add("future_use3",StringType)
.add("session_id",StringType)
.add("repeat_count",StringType)
.add("src_port",StringType)
.add("dst_port",StringType)
.add("src_nat_port",StringType)
.add("dst_nat_port",StringType)
.add("flags",StringType)
.add("protocol",StringType)
.add("action",StringType)
.add("miscellaneous",StringType)
.add("threat_id",StringType)
.add("category",StringType)
.add("severity",StringType)
.add("direction",StringType)
.add("seq_num",StringType)
.add("act_flag",StringType)
.add("src_geo_location",StringType)
.add("dst_geo_location",StringType)
.add("future_use4",StringType)
.add("content_type",StringType)
.add("pcap_id",StringType)
.add("file_digest",StringType)
.add("apt_cloud",StringType)
.add("url_index",StringType)
.add("user_agent",StringType)
.add("file_type",StringType)
.add("x_forwarded_for",StringType)
.add("referer",StringType)
.add("sender",StringType)
.add("subject",StringType)
.add("recipient",StringType)
.add("report_id",StringType)
.add("dghl1",StringType)
.add("dghl2",StringType)
.add("dghl3",StringType)
.add("dghl4",StringType)
.add("vsys_name",StringType)
.add("device_name",StringType)
.add("future_use5",StringType)
.add("src_vm_uuid",StringType)
.add("dst_vm_uuid",StringType)
.add("http_method",StringType)
.add("tunnel_id_imsi",StringType)
.add("monitor_tag_imei",StringType)
.add("parent_session_id",StringType)
.add("parent_start_time",StringType)
.add("tunnel_type",StringType)
.add("threat_category",StringType)
.add("content_version",StringType)
.add("future_use6",StringType)
.add("sctp_association_id",StringType)
.add("payload_protocol_id",StringType)
.add("http_headers",StringType)
.add("url_category_list",StringType)
.add("uuid_for_rule",StringType)
.add("http_2_connection",StringType)
.add("created",TimestampType)
val traffic_9_0=new StructType()
.add("date_time",StringType)
.add("log_source",StringType)
.add("future_use1",StringType)
.add("received_time",StringType)
.add("serial_number",StringType)
.add("traffic_type",StringType)
.add("threat_content_type",StringType)
.add("future_use2",StringType)
.add("generated_time",StringType)
.add("src_ip",StringType)
.add("dst_ip",StringType)
.add("src_nat",StringType)
.add("dst_nat",StringType)
.add("rule_name",StringType)
.add("src_user",StringType)
.add("dst_user",StringType)
.add("app",StringType)
.add("vsys",StringType)
.add("src_zone",StringType)
.add("dst_zone",StringType)
.add("igr_int",StringType)
.add("egr_int",StringType)
.add("log_fw_profile",StringType)
.add("future_use3",StringType)
.add("session_id",StringType)
.add("repeat_count",StringType)
.add("src_port",StringType)
.add("dst_port",StringType)
.add("src_nat_port",StringType)
.add("dst_nat_port",StringType)
.add("flags",StringType)
.add("protocol",StringType)
.add("action",StringType)
.add("bytes",StringType)
.add("bytes_sent",StringType)
.add("bytes_received",StringType)
.add("packets",StringType)
.add("start_time",StringType)
.add("end_time",StringType)
.add("category",StringType)
.add("future_use4",StringType)
.add("seq_num",StringType)
.add("act_flag",StringType)
.add("src_geo_location",StringType)
.add("dst_geo_location",StringType)
.add("future_use5",StringType)
.add("packet_sent",StringType)
.add("packet_received",StringType)
.add("session_end_reason",StringType)
.add("dghl1",StringType)
.add("dghl2",StringType)
.add("dghl3",StringType)
.add("dghl4",StringType)
.add("vsys_name",StringType)
.add("device_name",StringType)
.add("action_source",StringType)
.add("src_vm_uuid",StringType)
.add("dst_vm_uuid",StringType)
.add("tunnel_id_imsi",StringType)
.add("monitor_tag_imei",StringType)
.add("parent_session_id",StringType)
.add("parent_start_time",StringType)
.add("tunnel_type",StringType)
.add("sctp_association_id",StringType)
.add("sctp_chunks",StringType)
.add("sctp_chunks_sent",StringType)
.add("sctp_chunks_received",StringType)
.add("uuid_for_rule",StringType)
.add("http_2_connection",StringType)
.add("created",TimestampType)
}
The output for the above code is as follows:
+---------+----------+-----------+-------------+-------------+------------+-------------------+-----------+--------------+------+------+-------+-------+---------+--------+--------+---+----+--------+--------+-------+-------+--------------+-----------+----------+------------+--------+--------+------------+------------+-----+--------+------+-------------+---------+--------+--------+---------+-------+--------+----------------+----------------+-----------+------------+-------+-----------+---------+---------+----------+---------+---------------+-------+------+-------+---------+---------+-----+-----+-----+-----+---------+-----------+-----------+-----------+-----------+-----------+--------------+----------------+-----------------+-----------------+-----------+---------------+---------------+-----------+-------------------+-------------------+------------+-----------------+-------------+-----------------+-------+
|date_time|log_source|future_use1|received_time|serial_number|traffic_type|threat_content_type|future_use2|generated_time|src_ip|dst_ip|src_nat|dst_nat|rule_name|src_user|dst_user|app|vsys|src_zone|dst_zone|igr_int|egr_int|log_fw_profile|future_use3|session_id|repeat_count|src_port|dst_port|src_nat_port|dst_nat_port|flags|protocol|action|miscellaneous|threat_id|category|severity|direction|seq_num|act_flag|src_geo_location|dst_geo_location|future_use4|content_type|pcap_id|file_digest|apt_cloud|url_index|user_agent|file_type|x_forwarded_for|referer|sender|subject|recipient|report_id|dghl1|dghl2|dghl3|dghl4|vsys_name|device_name|future_use5|src_vm_uuid|dst_vm_uuid|http_method|tunnel_id_imsi|monitor_tag_imei|parent_session_id|parent_start_time|tunnel_type|threat_category|content_version|future_use6|sctp_association_id|payload_protocol_id|http_headers|url_category_list|uuid_for_rule|http_2_connection|created|
+---------+----------+-----------+-------------+-------------+------------+-------------------+-----------+--------------+------+------+-------+-------+---------+--------+--------+---+----+--------+--------+-------+-------+--------------+-----------+----------+------------+--------+--------+------------+------------+-----+--------+------+-------------+---------+--------+--------+---------+-------+--------+----------------+----------------+-----------+------------+-------+-----------+---------+---------+----------+---------+---------------+-------+------+-------+---------+---------+-----+-----+-----+-----+---------+-----------+-----------+-----------+-----------+-----------+--------------+----------------+-----------------+-----------------+-----------+---------------+---------------+-----------+-------------------+-------------------+------------+-----------------+-------------+-----------------+-------+
+---------+----------+-----------+-------------+-------------+------------+-------------------+-----------+--------------+------+------+-------+-------+---------+--------+--------+---+----+--------+--------+-------+-------+--------------+-----------+----------+------------+--------+--------+------------+------------+-----+--------+------+-------------+---------+--------+--------+---------+-------+--------+----------------+----------------+-----------+------------+-------+-----------+---------+---------+----------+---------+---------------+-------+------+-------+---------+---------+-----+-----+-----+-----+---------+-----------+-----------+-----------+-----------+-----------+--------------+----------------+-----------------+-----------------+-----------+---------------+---------------+-----------+-------------------+-------------------+------------+-----------------+-------------+-----------------+-------+
threat_9_0_DF saved
20/01/08 14:59:49 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 69
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, fromJavaTimestamp, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 69, created), TimestampType), true, false) AS created#773
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:292)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:593)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:593)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 69
at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:174)
at org.apache.spark.sql.Row$class.isNullAt(Row.scala:191)
at org.apache.spark.sql.catalyst.expressions.GenericRow.isNullAt(rows.scala:166)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_34$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)
... 25 more
It looks like it does not matter that the messages are in JSON format, does it?
Let's then use a sample dataset of any schema and add a timestamp column.
val messages = spark.range(3)
scala> messages.printSchema
root
|-- id: long (nullable = false)
val withTs = messages.withColumn("timestamp", current_timestamp())
scala> withTs.printSchema
root
|-- id: long (nullable = false)
|-- timestamp: timestamp (nullable = false)
That gives you a dataset with a timestamp column.
The following line in your code should work, too (you don't need lit).
val xDF = thDF.withColumn("created", lit(current_timestamp())) //This does not get cast to TimestampType
What do you mean by "This does not get cast to TimestampType"? How do you check it out? Are you perhaps confusing TimestampType in Spark and Cassandra? The Spark connector for Cassandra should handle it.
Let's give that a try:
val litTs = spark.range(3).withColumn("ts", lit(current_timestamp))
scala> litTs.printSchema
root
|-- id: long (nullable = false)
|-- ts: timestamp (nullable = false)
import org.apache.spark.sql.types._
val dataType = litTs.schema("ts").dataType
assert(dataType.isInstanceOf[TimestampType])
scala> println(dataType)
TimestampType

NullPointerException in org.apache.spark.ml.feature.Tokenizer

I want to separately use TF-IDF features on the title and description fields, respectively and then combine those features in the VectorAssembler so that the final classifier can operate on those features.
It works fine if I use a single serial flow that is simply
titleTokenizer -> titleHashingTF -> VectorAssembler
But I need both like so:
titleTokenizer -> titleHashingTF
-> VectorAssembler
descriptionTokenizer -> descriptionHashingTF
Code here:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer, StringIndexer, VectorAssembler}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.log4j.{Level, Logger}
object SimplePipeline {
def main(args: Array[String]) {
// setup boilerplate
val conf = new SparkConf()
.setAppName("Pipeline example")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Session for SimplePipeline")
.getOrCreate()
val all_df = spark.read.json("file:///Users/me/data.json")
val numLabels = all_df.count()
// split into training and testing
val Array(training, testing) = all_df.randomSplit(Array(0.75, 0.25))
val nTraining = training.count();
val nTesting = testing.count();
println(s"Loaded $nTraining training labels...");
println(s"Loaded $nTesting testing labels...");
// convert string labels to integers
val indexer = new StringIndexer()
.setInputCol("rating")
.setOutputCol("label")
// tokenize our string inputs
val titleTokenizer = new Tokenizer()
.setInputCol("title")
.setOutputCol("title_words")
val descriptionTokenizer = new Tokenizer()
.setInputCol("description")
.setOutputCol("description_words")
// count term frequencies
val titleHashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(titleTokenizer.getOutputCol)
.setOutputCol("title_tfs")
val descriptionHashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(descriptionTokenizer.getOutputCol)
.setOutputCol("description_tfs")
// combine features together
val assembler = new VectorAssembler()
.setInputCols(Array(titleHashingTF.getOutputCol, descriptionHashingTF.getOutputCol))
.setOutputCol("features")
// set params for our model
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
// pipeline that combines all stages
val stages = Array(indexer, titleTokenizer, titleHashingTF, descriptionTokenizer, descriptionHashingTF, assembler, lr);
val pipeline = new Pipeline().setStages(stages);
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Make predictions.
val predictions = model.transform(testing)
// Select example rows to display.
predictions.select("label", "rawPrediction", "prediction").show()
sc.stop()
}
}
and my data file is simply a line-break separated file of JSON objects:
{"title" : "xxxxxx", "description" : "yyyyy" .... }
{"title" : "zzzzzz", "description" : "zxzxzx" .... }
The error I get is very long a difficult to understand, but the important part (I think) is a java.lang.NullPointerException:
ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 12)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$1: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
... 23 more
How should I be properly crafting my Pipeline to do this?
(Also I'm completely new to Scala)
The problem here is that you don't validate the data and some of the values are NULL. It is pretty easy to reproduce this:
val df = Seq((1, Some("abcd bcde cdef")), (2, None)).toDF("id", "description")
val tokenizer = new Tokenizer().setInputCol("description")
tokenizer.transform(df).foreach(_ => ())
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$1: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072)
...
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
...
You can for example drop:
tokenizer.transform(df.na.drop(Array("description")))
or replace these with empty strings:
tokenizer.transform(df.na.fill(Map("description" -> "")))
whichever makes more sense in your application.

structured streaming with Spark 2.0.2, Kafka source and scalapb

I am using structured streaming (Spark 2.0.2) to consume kafka messages. Using scalapb, messages in protobuf. I am getting the following error. Please help..
Exception in thread "main" scala.ScalaReflectionException: is
not a term at
scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:199)
at
scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:84)
at
org.apache.spark.sql.catalyst.ScalaReflection$class.constructParams(ScalaReflection.scala:811)
at
org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:39)
at
org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:800)
at
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:39)
at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:582)
at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:460)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.immutable.List.foreach(List.scala:381) at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.immutable.List.flatMap(List.scala:344) at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583)
at
org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:274) at
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47)
at PersonConsumer$.main(PersonConsumer.scala:33) at
PersonConsumer.main(PersonConsumer.scala) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
The following is my code ...
object PersonConsumer {
import org.apache.spark.rdd.RDD
import com.trueaccord.scalapb.spark._
import org.apache.spark.sql.{SQLContext, SparkSession}
import com.example.protos.demo._
def main(args : Array[String]) {
def parseLine(s: String): Person =
Person.parseFrom(
org.apache.commons.codec.binary.Base64.decodeBase64(s))
val spark = SparkSession.builder.
master("local")
.appName("spark session example")
.getOrCreate()
import spark.implicits._
val ds1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe","person").load()
val ds2 = ds1.selectExpr("CAST(value AS STRING)").as[String]
val ds3 = ds2.map(str => parseLine(str)).createOrReplaceTempView("persons")
val ds4 = spark.sqlContext.sql("select name from persons")
val query = ds4.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
}
}
The line with val ds3 should be:
val ds3 = ds2.map(str => parseLine(str))
sqlContext.protoToDataFrame(ds3).registerTempTable("persons")
The RDD needs to be converted to a data frame before it is saved as temp table.
In Person class, gender is a enum and this was the cause for this problem. After removing this field, it works fine.
The following is the answer I got from Shixiong(Ryan) of DataBricks.
The problem is "optional Gender gender = 3;". The generated class "Gender" is a trait, and Spark cannot know how to create a trait so it's not supported. You can define your class which is supported by SQL Encoder, and convert this generated class to the new class in parseLine.