NullPointerException exception when using Flink's leftOuterJoinLateral in Scala - scala

I am trying to follow the documentation and create a Table Function to "flatten" some data. The Table Function seems to work fine when using the joinLateral to do the flattening. When using leftOuterJoinLateral though, I get the following error. I'm using Scala and have tried both Table API and SQL with the same result:
Caused by: java.lang.NullPointerException: Null result cannot be stored in a Case Class.
Here is my job:
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.table.api.scala._
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.functions.TableFunction
object example_job{
// Split the List[Int] into multiple rows
class Split() extends TableFunction[Int] {
def eval(nums: List[Int]): Unit = {
nums.foreach(x =>
if(x != 3) {
collect(x)
})
}
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.createLocalEnvironment()
val tableEnv = StreamTableEnvironment.create(env)
val splitMe = new Split()
// Create some dummy data
val events: DataStream[(String, List[Int])] = env.fromElements(("simon", List(1,2,3)), ("jessica", List(3)))
val table = tableEnv.fromDataStream(events, 'name, 'numbers)
.leftOuterJoinLateral(splitMe('numbers) as 'number)
.select('name, 'number)
table.toAppendStream[(String, Int)].print()
env.execute("Flink jira ticket example")
}
}
When I change .leftOuterJoinLateral to .joinLateral I get the expected result:
(simon,1)
(simon,2)
When using the .leftOuterJoinLateral I would expect something like:
(simon,1)
(simon,2)
(simon,null) // or (simon, None)
(jessica,null) // or (jessica, None)
Seems like this might be a bug with the Scala API? I wanted to check here first before raising a ticket just in case I'm doing something stupid!

The problem is that Flink per default does expect that all fields of a row are non-null. That's why the program fails when it sees the null result from the outer join operation. In order to accept null values, you either need to disable the null check via
val tableConfig = tableEnv.getConfig
tableConfig.setNullCheck(false)
Or you must specify the result type to tolerate null values, e.g. specifying a custom POJO output type:
table.toAppendStream[MyOutput].print()
with
class MyOutput(var name: String, var number: Integer) {
def this() {
this(null, null)
}
override def toString: String = s"($name, $number)"
}

Related

Cassandra (CQL) select IN query with Cassandra4IO

I am using scala with Cassandra4io library. I am trying to perform a select IN query. The parameter of IN is like a tuple (comma separated string values). And it has not worked for me. I tried different approaches.
// keys (List[String])
val clientIdCommaSepValues = keys.mkString(",")
val selectValue = selectQuery(clientIdCommaSepValues)
private def selectQuery(clientids: String) =
cql"select * from clientinformation WHERE (clientid IN ( ${clientids} ))".as[CassandraClientInfoRow]
this worked only when the value is one (length of keys is 1).
or
private val selectQuery =
cqlt"select * from clientinformation WHERE (clientid IN ${Put[String]}) ".as[CassandraClientInfoRow]
I also tried to put ' ' quotes on the strings.
sorry for the delay on this. It turns out that adding that extra set of parenthesis around your value (in the example above IN (${clientIds})) throws off the string interpolator leading it to select the wrong Binder datatype which is used to serialize the datatype in your query before it sends it off to Cassandra (ouch!).
This selected TEXT instead of List[TEXT]
What you want to do instead is reformulate the query like so:
val keys: List[String] = ???
val selectValue = selectQuery(keys)
private def selectQuery(clientids: List[String]) =
cql"select * from clientinformation WHERE clientid IN ${clientids}".as[CassandraClientInfoRow]"""
I was able to reproduce this on my end and drop the parens. Here's what I did
CREATE KEYSPACE example WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
CREATE TABLE IF NOT EXISTS test_data (
id TEXT,
data INT,
PRIMARY KEY ((id))
);
package com.ringcentral.cassandra4io
import cats.effect._
import com.datastax.oss.driver.api.core.CqlSession
import com.ringcentral.cassandra4io.cql._
import fs2._
import java.net.InetSocketAddress
import scala.jdk.CollectionConverters._
object Investigation extends IOApp {
final case class TestDataRow(id: String, data: Int)
def insert(in: TestDataRow, session: CassandraSession[IO]): IO[Boolean] =
cql"INSERT INTO test_data (id, data) VALUES (${in.id}, ${in.data})"
.execute(session)
override def run(args: List[String]): IO[ExitCode] = {
val rSession = {
val builder =
CqlSession
.builder()
.addContactPoints(List(InetSocketAddress.createUnresolved("localhost", 9042)).asJava)
.withLocalDatacenter("dc1")
.withKeyspace("example")
CassandraSession.connect[IO](builder)
}
rSession.use { session =>
val insertData: Stream[IO, INothing] =
Stream.eval(insert(TestDataRow("test", 1), session) *> insert(TestDataRow("test2", 2), session)).drain
def query(ids: List[String]): Stream[IO, TestDataRow] =
cql"SELECT id, data FROM test_data WHERE id IN $ids"
.as[TestDataRow]
.select(session)
(insertData ++ query(List("test", "test2")))
.evalTap(i => IO(println(i)))
.compile
.drain
.as(ExitCode.Success)
}
}
}
This works great since now it selects the right Binder which is List(TEXT) as you can see above! Sorry for the trouble you had and the cryptic error messages but thank you for using this library :D

Mockito verify fails when method takes a function as argument

I have a Scala test which uses Mockito to verify that certain DataFrame transformations are invoked. I broke it down to this simple problematic example
import org.apache.spark.sql.DataFrame
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.functions._
import org.mockito.{Mockito, MockitoSugar}
class SimpleTest extends AnyFunSuite{
def withGreeting(df: DataFrame):DataFrame = {
df.withColumn("greeting", lit("hello"))
}
test("sample test") {
val mockDF = MockitoSugar.mock[DataFrame]
val mockDF2 = MockitoSugar.mock[DataFrame]
MockitoSugar.doReturn(mockDF2).when(mockDF).transform(withGreeting)
mockDF.transform(withGreeting)
val orderVerifier = Mockito.inOrder(mockDF)
orderVerifier.verify(mockDF).transform(withGreeting)
}
}
I'm trying to assert that the transform was called on my mockDF, but it fails with
Argument(s) are different! Wanted:
dataset.transform(<function1>);
-> at org.apache.spark.sql.Dataset.transform(Dataset.scala:2182)
Actual invocations have different arguments:
dataset.transform(<function1>);
Why would the verify fail in this case?
You need to save lambda expression argument for transform as val for correct testing and pass it to all transform calls:
def withGreeting(df: DataFrame):DataFrame = {
df.withColumn("greeting", lit("hello"))
}
test("sample test") {
val mockDF = MockitoSugar.mock[DataFrame]
val mockDF2 = MockitoSugar.mock[DataFrame]
val withGreetingExpression = df => withGreeting(df)
MockitoSugar.doReturn(mockDF2).when(mockDF).transform(withGreetingExpression)
mockDF.transform(withGreetingExpression)
val orderVerifier = Mockito.inOrder(mockDF)
orderVerifier.verify(mockDF).transform(withGreetingExpression)
}
Mockito requires to provide same (or equal) arguments to the mocked functions calls. When you are passing lambda expression without saving each call transform(withGreeting) creates new object Function[DataFrame, DataFrame]
transform(withGreeting)
is the same as:
transform(new Function[DataFrame, DataFrame] {
override def apply(df: DataFrame): DataFrame = withGreeting(df)
})
And they aren't equal to each other - this is the cause of error message:
Argument(s) are different!
For example, try to execute:
println(((df: DataFrame) => withGreeting(df)) == ((df: DataFrame) => withGreeting(df))) //false
You can read more about objects equality in java (in the scala it's same):
wikibooks
javaworld.com

Convert prepareStament object to Json Scala

I'am trying to convert prepareStament(object uses for sending SQL statement to the database ) to Json with scala.
So far, I've discovered that the best way to convert an object to Json in scala is to do it with the net.liftweb library.
But when I tried it, I got an empty json.
this is the code
import java.sql.DriverManager
import net.liftweb.json._
import net.liftweb.json.Serialization.write
object Main {
def main (args: Array[String]): Unit = {
implicit val formats = DefaultFormats
val jdbcSqlConnStr = "sqlserverurl**"
val conn = DriverManager.getConnection(jdbcSqlConnStr)
val statement = conn.prepareStatement("exec select_all")
val piedPierJSON2= write(statement)
println(piedPierJSON2)
}
}
this is the result
{}
I used an object I created , and the conversion worked.
case class Person(name: String, address: Address)
case class Address(city: String, state: String)
val p = Person("Alvin Alexander", Address("Talkeetna", "AK"))
val piedPierJSON3 = write(p)
println(piedPierJSON3)
This is the result
{"name":"Alvin Alexander","address":{"city":"Talkeetna","state":"AK"}}
I understood where the problem was, PrepareStament is an interface, and none of its subtypes are serializable...
I'm going to try to wrap it up and put it in a different class.

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}