Null pointer exception with GenericWriteAheadSink and FileCheckpointCommitter in Flink 1.8 - scala

I'm implementing a custom NSQ sink for Flink. I have it working as a subclass of RichSinkFunction, but I'd like to get the write-ahead log implementation working for extra data integrity.
Using O'Reilly's WriteAheadSinkExample available here, I attempted to implement my own:
package com.wistia.analytics
import java.net.{InetSocketAddress, SocketAddress}
import com.github.mitallast.nsq._
import org.apache.flink.api.scala.createTypeInformation
import java.lang.Iterable
import java.nio.file.{Files, Paths}
import java.util.UUID
import org.apache.commons.lang3.StringUtils
import org.apache.flink.api.common.ExecutionConfig
import org.apache.flink.streaming.runtime.operators.{CheckpointCommitter, GenericWriteAheadSink}
import scala.collection.mutable
class WALNsqSink(val topic: String) extends GenericWriteAheadSink[String](
// CheckpointCommitter that commits checkpoints to the local filesystem
new FileCheckpointCommitter(System.getProperty("java.io.tmpdir")),
// Serializer for records
createTypeInformation[String]
.createSerializer(new ExecutionConfig),
// Random JobID used by the CheckpointCommitter
UUID.randomUUID.toString) {
var client: NSQClient = _
var producer: NSQProducer = _
override def open(): Unit = {
val lookup = new NSQLookup {
def nodes(): List[SocketAddress] = List(new InetSocketAddress("127.0.0.1",4150))
def lookup(topic: String): List[SocketAddress] = List(new InetSocketAddress("127.0.0.1",4150))
}
client = NSQClient(lookup)
producer = client.producer()
}
def sendValues(readings: Iterable[String], checkpointId: Long, timestamp: Long): Boolean = {
val arr = mutable.Seq()
readings.forEach{ reading =>
arr :+ reading
}
producer.mpubStr(topic=topic, data=arr)
true
}
}
reusing FileCheckpointCommitter at the bottom of the class, but I get a null pointer exception inside GenericWriteAheadSink:
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:638)
at org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:123)
at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
at com.wistia.analytics.NsqProcessor$.main(NsqProcessor.scala:24)
at com.wistia.analytics.NsqProcessor.main(NsqProcessor.scala)
Caused by: java.lang.NullPointerException
at org.apache.flink.streaming.runtime.operators.GenericWriteAheadSink.processElement(GenericWriteAheadSink.java:277)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
Nonzero exit code returned from runner: 1
(Compile / run) Nonzero exit code returned from runner: 1
Total time: 45 s, completed Feb 10, 2020 6:41:06 PM
I have no idea where to go from here. Any help appreciated

The issue here is most certainly the fact that You never call the open() method of the superclass. This will cause some of the variables to be uninitialized.
This should be solved by calling the super.open() inside Your open() method.

Related

How should I test akka-streams RestartingSource usage

I'm working on an application that has a couple of long-running streams going, where it subscribes to data about a certain entity and processes that data. These streams should be up 24/7, so we needed to handle failures (network issues etc).
For that purpose, we've wrapped our sources in RestartingSource.
I'm now trying to verify this behaviour, and while it looks like it functions, I'm struggling to create a test where I push in some data, verify that it processes correctly, then send an error, and verify that it reconnects after that and continues processing.
I've boiled that down to this minimal case:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{RestartSource, Sink, Source}
import akka.stream.testkit.TestPublisher
import org.scalatest.concurrent.Eventually
import org.scalatest.{FlatSpec, Matchers}
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext
class MinimalSpec extends FlatSpec with Matchers with Eventually {
"restarting a failed source" should "be testable" in {
implicit val sys: ActorSystem = ActorSystem("akka-grpc-measurements-for-test")
implicit val mat: ActorMaterializer = ActorMaterializer()
implicit val ec: ExecutionContext = sys.dispatcher
val probe = TestPublisher.probe[Int]()
val restartingSource = RestartSource
.onFailuresWithBackoff(1 second, 1 minute, 0d) { () => Source.fromPublisher(probe) }
var last: Int = 0
val sink = Sink.foreach { l: Int => last = l }
restartingSource.runWith(sink)
probe.sendNext(1)
eventually {
last shouldBe 1
}
probe.sendNext(2)
eventually {
last shouldBe 2
}
probe.sendError(new RuntimeException("boom"))
probe.expectSubscription()
probe.sendNext(3)
eventually {
last shouldBe 3
}
}
}
This test consistently fails on the last eventually block with Last failure message: 2 was not equal to 3. What am I missing here?
Edit: akka version is 2.5.31
I figured it out after having had a look at the TestPublisher code. Its subscription is a lazy val. So when RestartSource detects the error, and executes the factory method () => Source.fromPublisher(probe) again, it gets a new Source, but the subscription of the probe is still pointing to the old Source. Changing the code to initialize both a new Source and TestPublisher works.

Can't understand TwoPhaseCommitSinkFunction lifecycle

I needed a sink to Postgres DB, so I started to build a custom Flink SinkFunction. As FlinkKafkaProducer implements TwoPhaseCommitSinkFunction, then I decided to do the same. As stated in O'Reilley's book Stream Processing with Apache Flink, you just need to implement the abstract methods, enable checkpointing and you're up to go. But what really happens when I run my code is that commit method is called only once, and it is called before invoke, what is totally unexpected since you shouldn't be ready to commit if your set of ready-to-commit transactions is empty. And the worst is that, after committing, invoke is called for all of the transaction lines present in my file, and then abort is called, which is even more unexpected.
When the Sink is initialized, It is of my understanding that the following should occur:
beginTransaction is called and sends an identifier to invoke
invoke adds the lines to the transaction, according to the identifier received
pre-commit makes all final modification on current transaction data
commit handles the finalized transaction of pre-commited data
So, I can't see why my program doesn't show this behaviour.
Here goes my sink code:
package PostgresConnector
import java.sql.{BatchUpdateException, DriverManager, PreparedStatement, SQLException, Timestamp}
import java.text.ParseException
import java.util.{Date, Properties, UUID}
import org.apache.flink.api.common.ExecutionConfig
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{SinkFunction, TwoPhaseCommitSinkFunction}
import org.apache.flink.streaming.api.scala._
import org.slf4j.{Logger, LoggerFactory}
class PostgreSink(props : Properties, config : ExecutionConfig) extends TwoPhaseCommitSinkFunction[(String,String,String,String),String,String](createTypeInformation[String].createSerializer(config),createTypeInformation[String].createSerializer(config)){
private var transactionMap : Map[String,Array[(String,String,String,String)]] = Map()
private var parsedQuery : PreparedStatement = _
private val insertionString : String = "INSERT INTO mydb (field1,field2,point) values (?,?,point(?,?))"
override def invoke(transaction: String, value: (String,String,String,String), context: SinkFunction.Context[_]): Unit = {
val LOG = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
val res = this.transactionMap.get(transaction)
if(res.isDefined){
var array = res.get
array = array ++ Array(value)
this.transactionMap += (transaction -> array)
}else{
val array = Array(value)
this.transactionMap += (transaction -> array)
}
LOG.info("\n\nPassing through invoke\n\n")
()
}
override def beginTransaction(): String = {
val LOG: Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
val identifier = UUID.randomUUID.toString
LOG.info("\n\nPassing through beginTransaction\n\n")
identifier
}
override def preCommit(transaction: String): Unit = {
val LOG = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
try{
val tuple : Option[Array[(String,String,String,String)]]= this.transactionMap.get(transaction)
if(tuple.isDefined){
tuple.get.foreach( (value : (String,String,String,String)) => {
LOG.info("\n\n"+value.toString()+"\n\n")
this.parsedQuery.setString(1,value._1)
this.parsedQuery.setString(2,value._2)
this.parsedQuery.setString(3,value._3)
this.parsedQuery.setString(4,value._4)
this.parsedQuery.addBatch()
})
}
}catch{
case e : SQLException =>
LOG.info("\n\nError when adding transaction to batch: SQLException\n\n")
case f : ParseException =>
LOG.info("\n\nError when adding transaction to batch: ParseException\n\n")
case g : NoSuchElementException =>
LOG.info("\n\nError when adding transaction to batch: NoSuchElementException\n\n")
case h : Exception =>
LOG.info("\n\nError when adding transaction to batch: Exception\n\n")
}
this.transactionMap = this.transactionMap.empty
LOG.info("\n\nPassing through preCommit...\n\n")
}
override def commit(transaction: String): Unit = {
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
if(this.parsedQuery != null) {
LOG.info("\n\n" + this.parsedQuery.toString+ "\n\n")
}
try{
this.parsedQuery.executeBatch
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
LOG.info("\n\nExecuting batch\n\n")
}catch{
case e : SQLException =>
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
LOG.info("\n\n"+"Error : SQLException"+"\n\n")
}
this.transactionMap = this.transactionMap.empty
LOG.info("\n\nPassing through commit...\n\n")
}
override def abort(transaction: String): Unit = {
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
this.transactionMap = this.transactionMap.empty
LOG.info("\n\nPassing through abort...\n\n")
}
override def open(parameters: Configuration): Unit = {
val LOG: Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
val driver = props.getProperty("driver")
val url = props.getProperty("url")
val user = props.getProperty("user")
val password = props.getProperty("password")
Class.forName(driver)
val connection = DriverManager.getConnection(url + "?user=" + user + "&password=" + password)
this.parsedQuery = connection.prepareStatement(insertionString)
LOG.info("\n\nConfiguring BD conection parameters\n\n")
}
}
And this is my main program:
package FlinkCEPClasses
import PostgresConnector.PostgreSink
import org.apache.flink.api.java.io.TextInputFormat
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.cep.PatternSelectFunction
import org.apache.flink.cep.pattern.conditions.SimpleCondition
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.core.fs.{FileSystem, Path}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.cep.scala.{CEP, PatternStream}
import org.apache.flink.streaming.api.functions.source.FileProcessingMode
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import java.util.Properties
import org.apache.flink.api.common.ExecutionConfig
import org.slf4j.{Logger, LoggerFactory}
class FlinkCEPPipeline {
val LOG: Logger = LoggerFactory.getLogger(classOf[FlinkCEPPipeline])
LOG.info("\n\nStarting the pipeline...\n\n")
var env : StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(10)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
env.setParallelism(1)
//var input : DataStream[String] = env.readFile(new TextInputFormat(new Path("/home/luca/Desktop/lines")),"/home/luca/Desktop/lines",FileProcessingMode.PROCESS_CONTINUOUSLY,1)
var input : DataStream[String] = env.readTextFile("/home/luca/Desktop/lines").name("Raw stream")
var tupleStream : DataStream[(String,String,String,String)] = input.map(new S2PMapFunction()).name("Tuple Stream")
var properties : Properties = new Properties()
properties.setProperty("driver","org.postgresql.Driver")
properties.setProperty("url","jdbc:postgresql://localhost:5432/mydb")
properties.setProperty("user","luca")
properties.setProperty("password","root")
tupleStream.addSink(new PostgreSink(properties,env.getConfig)).name("Postgres Sink").setParallelism(1)
tupleStream.writeAsText("/home/luca/Desktop/output",FileSystem.WriteMode.OVERWRITE).name("File Sink").setParallelism(1)
env.execute()
}
My S2PMapFunction code:
package FlinkCEPClasses
import org.apache.flink.api.common.functions.MapFunction
case class S2PMapFunction() extends MapFunction[String,(String,String,String,String)] {
override def map(value: String): (String, String, String,String) = {
var tuple = value.replaceAllLiterally("(","").replaceAllLiterally(")","").split(',')
(tuple(0),tuple(1),tuple(2),tuple(3))
}
}
My pipeline works like this: I read lines from a file, map them to a tuple of strings, and use the data inside the tuples to save them in a Postgres DB
If you want to simulate the data, just create a file with lines in a format like this:
(field1,field2,pointx,pointy)
Edit
The execution order of the TwoPhaseCommitSinkFUnction's methods is the following:
Starting pipeline...
beginTransaction
preCommit
beginTransaction
commit
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
abort
I'm not an expert on this topic, but a couple of guesses:
preCommit is called whenever Flink begins a checkpoint, and commit is called when the checkpoint is complete. These methods are called simply because checkpointing is happening, regardless of whether the sink has received any data.
Checkpointing is happening periodically, regardless of whether any data is flowing through your pipeline. Given your very short checkpointing interval (10 msec), it does seem plausible that the first checkpoint barrier will reach the sink before the source has managed to send it any data.
It also looks like you are assuming that only one transaction will be open at a time. I'm not sure that's strictly guaranteed, but so long as maxConcurrentCheckpoints is 1 (which is the default), you should be okay.
So, here goes the "answer" for this question. Just to be clear: at this moment, the problem about the TwoPhaseCommitSinkFunction hasn't been solved yet. If what you're looking for is about the original problem, then you should look for another answer. If you don't care about what you'll use as a sink, then maybe I can help you with that.
As suggested by #DavidAnderson, I started to study the Table API and see if it could solve my problem, which was using Flink to insert lines in my database table.
It turned out to be really simple, as you'll see.
OBS: Beware of the version you are using. My Flink's version is 1.9.0.
Source code
package FlinkCEPClasses
import java.sql.Timestamp
import java.util.Properties
import org.apache.flink.api.common.typeinfo.{TypeInformation, Types}
import org.apache.flink.api.java.io.jdbc.JDBCAppendTableSink
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.{EnvironmentSettings, Table}
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.sinks.TableSink
import org.postgresql.Driver
class TableAPIPipeline {
// --- normal pipeline initialization in this block ---
var env : StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(10)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
env.setParallelism(1)
var input : DataStream[String] = env.readTextFile("/home/luca/Desktop/lines").name("Original stream")
var tupleStream : DataStream[(String,Timestamp,Double,Double)] = input.map(new S2PlacaMapFunction()).name("Tuple Stream")
var properties : Properties = new Properties()
properties.setProperty("driver","org.postgresql.Driver")
properties.setProperty("url","jdbc:postgresql://localhost:5432/mydb")
properties.setProperty("user","myuser")
properties.setProperty("password","mypassword")
// --- normal pipeline initialization in this block END ---
// These two lines create what Flink calls StreamTableEnvironment.
// It seems pretty similar to a normal stream initialization.
val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv = StreamTableEnvironment.create(env,settings)
//Since I wanted to sink data into a database, I used JDBC TableSink,
//because it is very intuitive and is a exact match with my need. You may
//look for other TableSink classes that fit better in you solution.
var tableSink : JDBCAppendTableSink = JDBCAppendTableSink.builder()
.setBatchSize(1)
.setDBUrl("jdbc:postgresql://localhost:5432/mydb")
.setDrivername("org.postgresql.Driver")
.setPassword("mypassword")
.setUsername("myuser")
.setQuery("INSERT INTO mytable (data1,data2,data3) VALUES (?,?,point(?,?))")
.setParameterTypes(Types.STRING,Types.SQL_TIMESTAMP,Types.DOUBLE,Types.DOUBLE)
.build()
val fieldNames = Array("data1","data2","data3","data4")
val fieldTypes = Array[TypeInformation[_]](Types.STRING,Types.SQL_TIMESTAMP,Types.DOUBLE, Types.DOUBLE)
// This is the crucial part of the code: first, you need to register
// your table sink, informing the name, the field names, field types and
// the TableSink object.
tableEnv.registerTableSink("postgres-table-sink",
fieldNames,
fieldTypes,
tableSink
)
// Then, you transform your DataStream into a Table object.
var table = tableEnv.fromDataStream(tupleStream)
// Finally, you insert your stream data into the registered sink.
table.insertInto("postgres-table-sink")
env.execute()
}

Class 'SessionTrigger' must either be declared abstract or implement abstract member

I am building a tutorial on Flink 1.2 and I want to run some simple windowing examples. One of them is Session Windows.
The code I want to run is the following:
import <package>.Session
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows
import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow
import scala.util.Try
object SessionWindowExample {
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val source = env.socketTextStream("localhost", 9000)
//session map
val values = source.map(value => {
val columns = value.split(",")
val endSignal = Try(Some(columns(2))).getOrElse(None)
Session(columns(0), columns(1).toDouble, endSignal)
})
val keyValue = values.keyBy(_.sessionId)
// create global window
val sessionWindowStream = keyValue.
window(GlobalWindows.create()).
trigger(PurgingTrigger.of(new SessionTrigger[GlobalWindow]()))
sessionWindowStream.sum("value").print()
env.execute()
}
}
As you 'll notice I need to instantiate a new SessionTrigger object which I do, based on this class:
import <package>.Session
import org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext
import org.apache.flink.streaming.api.windowing.triggers.{Trigger, TriggerResult}
import org.apache.flink.streaming.api.windowing.windows.Window
class SessionTrigger[W <: Window] extends Trigger[Session,W] {
override def onElement(element: Session, timestamp: Long, window: W, ctx: TriggerContext): TriggerResult = {
if(element.endSignal.isDefined) TriggerResult.FIRE
else TriggerResult.CONTINUE
}
override def onProcessingTime(time: Long, window: W, ctx: TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
override def onEventTime(time: Long, window: W, ctx: TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
}
However, InteliJ keeps complaining that:
Class 'SessionTrigger' must either be declared abstract or implement abstract member 'clear(window: W, ctx: TriggerContext):void' in 'org.apache.flink.streaming.api.windowing.triggers.Trigger'.
I tried adding this in the class:
override def clear(window: W, ctx: TriggerContext): Unit = ctx.deleteEventTimeTimer(4)
but it is not working. This is the error I am getting:
03/27/2017 15:48:38 TriggerWindow(GlobalWindows(), ReducingStateDescriptor{serializer=co.uk.DRUK.flink.windowing.SessionWindowExample.SessionWindowExample$$anon$2$$anon$1#1aec64d0, reduceFunction=org.apache.flink.streaming.api.functions.aggregation.SumAggregator#1a052a00}, PurgingTrigger(co.uk.DRUK.flink.windowing.SessionWindowExample.SessionTrigger#f2f2cc1), WindowedStream.reduce(WindowedStream.java:276)) -> Sink: Unnamed(4/4) switched to CANCELED
03/27/2017 15:48:38 Job execution switched to status FAILED.
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply$mcV$sp(JobManager.scala:900)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:843)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:843)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at co.uk.DRUK.flink.windowing.SessionWindowExample.SessionWindowExample$$anonfun$1.apply(SessionWindowExample.scala:27)
at co.uk.DRUK.flink.windowing.SessionWindowExample.SessionWindowExample$$anonfun$1.apply(SessionWindowExample.scala:24)
at org.apache.flink.streaming.api.scala.DataStream$$anon$4.map(DataStream.scala:521)
at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:185)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:272)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
at java.lang.Thread.run(Thread.java:745)
Process finished with exit code 1
Anybody know why?
Well the exception clearly says
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at co.uk.DRUK.flink.windowing.SessionWindowExample.SessionWindowExample$$anonfun$1.apply(SessionWindowExample.scala:27)
at co.uk.DRUK.flink.windowing.SessionWindowExample.SessionWindowExample$$anonfun$1.apply(SessionWindowExample.scala:24)
which obviously maps to the following line of code
Session(columns(0), columns(1).toDouble, endSignal)
So the next obvious thing is to log your columns and value after
val columns = value.split(",")
I suspect that value just doesn't contain second comma-separated column at least for some values.

Code working in Spark-Shell not in eclipse

I have a small Scala code which works properly on Spark-Shell but not in Eclipse with Scala plugin. I can access hdfs using plugin tried writing another file and it worked..
FirstSpark.scala
package bigdata.spark
import org.apache.spark.SparkConf
import java. io. _
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object FirstSpark {
def main(args: Array[String])={
val conf = new SparkConf().setMaster("local").setAppName("FirstSparkProgram")
val sparkcontext = new SparkContext(conf)
val textFile =sparkcontext.textFile("hdfs://pranay:8020/spark/linkage")
val m = new Methods()
val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
q.saveAsTextFile("hdfs://pranay:8020/output") }
}
Methods.scala
package bigdata.spark
import java.util.function.ToDoubleFunction
class Methods {
def isHeader(s:String):Boolean={
s.contains("id_1")
}
def parse(line:String) ={
val pieces = line.split(',')
val id1=pieces(0).toInt
val id2=pieces(1).toInt
val matches=pieces(11).toBoolean
val mapArray=pieces.slice(2, 11).map(toDouble)
MatchData(id1,id2,mapArray,matches)
}
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
Error Message:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
Can anyone please help me with this
Try changing class Methods { .. } to object Methods { .. }.
I think the problem is at val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x)). When Spark sees the filter and map functions it tries to serialize the functions passed to them (x => !m.isHeader(x) and x=> m.parse(x)) so that it can dispatch the work of executing them to all of the executors (this is the Task referred to). However, to do this, it needs to serialize m, since this object is referenced inside the function (it is in the closure of the two anonymous methods) - but it cannot do this since Methods is not serializable. You could add extends Serializable to the Methods class, but in this case an object is more appropriate (and is already Serializable).

NullPointerException when Plain SQL and String Interpolation

I'm having some trouble doing a Plain SQL Query using String Interpolation. I'm trying to query a MS-SQL 2008 database using Microsofts sqljdbc4.jar.
The table is simple:
[serv_num]
number (nvarchar(15)) | name (nvarchar(50))
=============================================
1234 | AA
+345 | BB
The code I'm using is as follows:
import java.sql.{Connection, DriverManager}
import org.slf4j.LoggerFactory
import scala.collection.JavaConversions._
import scala.slick.session.Database
import Database.threadLocalSession
import scala.slick.jdbc.{GetResult, StaticQuery => Q}
import Q.interpolation
object NumImport extends App {
def logger = LoggerFactory.getLogger("NumImport")
implicit val getServNumResult = GetResult(r => ServNum(r.<<, r.<<))
override def main(args: Array[String]) {
val uri = "jdbc:sqlserver://192.168.1.2;databaseName=slicktestdb;user=yyyy;password=xxxxxx"
val drv = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
Database.forURL(uri, driver = drv) withSession {
def getServNum(n: String) = sql"select number, name from serv_num where number = $n".as[ServNum]
val result = getServNum("1234")
println(result.firstOption)
}
}
case class ServNum(num: String, name: String)
}
My issue is that I'm getting a java.lang.NullPointerException on result.firstOption (I think!). This happens when a row is found. If try to select a number that doesn't exist, say getServNum("34"), then result.firstOption is a None, as expected.
Stacktrace:
08:54:47.932 TKD [run-main] DEBUG scala.slick.session.BaseSession - Preparing statement: select number, name from serv_num where number = ?
[error] (run-main) java.lang.NullPointerException
java.lang.NullPointerException
at scala.slick.jdbc.StaticQuery.extractValue(StaticQuery.scala:16)
at scala.slick.jdbc.StatementInvoker$$anon$1.extractValue(StatementInvoker.scala:37)
at scala.slick.session.PositionedResultIterator.foreach(PositionedResult.scala:212)
at scala.slick.jdbc.Invoker$class.foreach(Invoker.scala:91)
at scala.slick.jdbc.StatementInvoker.foreach(StatementInvoker.scala:10)
at scala.slick.jdbc.Invoker$class.firstOption(Invoker.scala:41)
at scala.slick.jdbc.StatementInvoker.firstOption(StatementInvoker.scala:10)
at scala.slick.jdbc.UnitInvoker$class.firstOption(Invoker.scala:148)
at scala.slick.jdbc.StaticQuery0.firstOption(StaticQuery.scala:95)
I tried debugging and following the stacktrace, but I'm quite new to both Scala and Slick, so I'm lost. If anyone could help me or point me in the right direction I would appreciate it a lot.
Thanks.
In looking at the slick source code where the exception was happening, I could see that it was trying to invoke apply on the GetResult object, but that object seemed to be null. In thinking about what might of caused it, I'm suggesting that you move the implicit inside of the main method. It seems that the way things are structured, slick is losing track of the implicit ResultSet to case class converter (the GetResult)