Flink: RowRowConverter seems to fail for nested DataTypes - scala

I am trying to load a complex JSON file (multiple different data types, nested objects/arrays etc) from my local, read them in as a source using the Table API File System Connector, convert them into DataStream, and then do some action afterwards (not shown here for brevity).
The conversion gives me a DataStream of type DataStream[Row], which I need to convert to DataStream[RowData] (for sink purposes, won't go into details here). Thankfully, there's a RowRowConverter utility that helps to do this mapping. It works when I tried a completely flat JSON, but when I introduced Arrays and Maps within the JSON, it no longer works.
Here is the exception that was thrown - a null pointer exception:
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.allocateWriter(ArrayObjectArrayConverter.java:140)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toBinaryArrayData(ArrayObjectArrayConverter.java:114)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toInternal(ArrayObjectArrayConverter.java:93)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toInternal(ArrayObjectArrayConverter.java:40)
at org.apache.flink.table.data.conversion.DataStructureConverter.toInternalOrNull(DataStructureConverter.java:61)
at org.apache.flink.table.data.conversion.RowRowConverter.toInternal(RowRowConverter.java:75)
at flink.ReadJsonNestedData$.$anonfun$main$2(ReadJsonNestedData.scala:48)
Interestingly, when I setup my breakpoints and debugger this is what I discovered: RowRowConverter::toInternal, the first time it was called works, will go all the way down to ArrayObjectArrayConverter::allocateWriter()
However, for some strange reason, RowRowConverter::toInternal runs twice, and if I continue stepping through eventually it will come back here, which is where the null pointer exception happens.
Example of the JSON (simplified with only a single nested for brevity). I placed it in my /src/main/resources folder
{"discount":[670237.997082,634079.372133,303534.821218]}
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.table.api.DataTypes
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment
import org.apache.flink.table.data.conversion.RowRowConverter
import org.apache.flink.table.types.FieldsDataType
import org.apache.flink.table.types.logical.RowType
import scala.collection.JavaConverters._
object ReadJsonNestedData {
def main(args: Array[String]): Unit = {
// setup
val jsonResource = getClass.getResource("/NESTED.json")
val jsonFilePath = jsonResource.getPath
val tableName = "orders"
val readJSONTable =
s"""
| CREATE TABLE $tableName (
| `discount` ARRAY<DECIMAL(12, 6)>
| )WITH (
| 'connector' = 'filesystem',
| 'path' = '$jsonFilePath',
| 'format' = 'json'
|)""".stripMargin
val colFields = Array(
"discount"
)
val defaultDataTypes = Array(
DataTypes.ARRAY(DataTypes.DECIMAL(12, 6))
)
val rowType = RowType.of(defaultDataTypes.map(_.getLogicalType), colFields)
val defaultDataTypesAsList = defaultDataTypes.toList.asJava
val dataType = new FieldsDataType(rowType, defaultDataTypesAsList)
val rowConverter = RowRowConverter.create(dataType)
// Job
val env = StreamExecutionEnvironment.getExecutionEnvironment()
val tableEnv = StreamTableEnvironment.create(env)
tableEnv.executeSql(readJSONTable)
val ordersTable = tableEnv.from(tableName)
val dataStream = tableEnv
.toDataStream(ordersTable)
.map(row => rowConverter.toInternal(row))
dataStream.print()
env.execute()
}
}
I would hence like to know:
Why RowRowConverter is not working and how I can remedy it
Why RowRowConverter::toInternal is running twice for the same Row .. which may be the cause of that NullPointerException
If my method of instantiating and using the RowRowConverter is correct based on my code above.
Thank you!
Environment:
IntelliJ 2021.3.2 (Ultimate)
AdoptOpenJDK 1.8
Scala: 2.12.15
Flink: 1.13.5
Flink Libraries Used (for this example):
flink-table-api-java-bridge
flink-table-planner-blink
flink-clients
flink-json

The first call of RowRowConverter::toInternal is an internal implementation for making a deep copy of the StreamRecord emitted by table source, which is independent from the converter in your map function. The reason of the NPE is that the RowRowConverter in the map function is not initialized by calling RowRowConverter::open. You can use RichMapFunction instead to invoke the RowRowConverter::open in RichMapFunction::open.

Thank you to #renqs for the answer.
Here is the code, if anyone is interested.
class ConvertRowToRowDataMapFunction(fieldsDataType: FieldsDataType)
extends RichMapFunction[Row, RowData] {
private final val rowRowConverter = RowRowConverter.create(fieldsDataType)
override def open(parameters: Configuration): Unit = {
super.open(parameters)
rowRowConverter.open(this.getClass.getClassLoader)
}
override def map(row: Row): RowData =
this.rowRowConverter.toInternal(row)
}
// at main function
// ... continue from previous
val dataStream = tableEnv
.toDataStream(personsTable)
.map(new ConvertRowToRowDataMapFunction(dataType))

Related

Apache Flink: Cannot write out complex data type for Parquet

I am trying to write complex data types (e.g. Array, Map) into a Parquet File Format using Apache Flink. For my use-case, I am reading data from a JSON file, doing some internal data conversions and then attempting to use a FileSink.
However, it didn't work. This is strange because Parquet Documentation states the following:
Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.
I would expect that it was able to process nested data types properly, unless I am doing something wrong.
Here is the error message:
Caused by: java.lang.UnsupportedOperationException: Unsupported type: ARRAY<DECIMAL(12, 6)>
at org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.convertToParquetType(ParquetSchemaConverter.java:615)
at org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.convertToParquetType(ParquetSchemaConverter.java:553)
at org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.convertToParquetMessageType(ParquetSchemaConverter.java:547)
at org.apache.flink.formats.parquet.row.ParquetRowDataBuilder$ParquetWriteSupport.<init>(ParquetRowDataBuilder.java:72)
at org.apache.flink.formats.parquet.row.ParquetRowDataBuilder$ParquetWriteSupport.<init>(ParquetRowDataBuilder.java:70)
at org.apache.flink.formats.parquet.row.ParquetRowDataBuilder.getWriteSupport(ParquetRowDataBuilder.java:67)
at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:563)
at org.apache.flink.formats.parquet.row.ParquetRowDataBuilder$FlinkParquetBuilder.createWriter(ParquetRowDataBuilder.java:135)
at org.apache.flink.formats.parquet.ParquetWriterFactory.create(ParquetWriterFactory.java:56)
Here is the JSON file that I'm using, stored under src/main/resources/NESTED.json
{"discount":[670237.997082,634079.372133,303534.821218]}
Source Code:
object ReadJsonNestedData {
def main(args: Array[String]): Unit = {
// setup
val jsonResource = getClass.getResource("/NESTED.json")
val jsonFilePath = jsonResource.getPath
val tableName = "orders"
val readJSONTable =
s"""
| CREATE TABLE $tableName (
| `discount` ARRAY<DECIMAL(12, 6)>
| )WITH (
| 'connector' = 'filesystem',
| 'path' = '$jsonFilePath',
| 'format' = 'json'
|)""".stripMargin
val colFields = Array("discount")
val defaultDataTypes = Array(DataTypes.ARRAY(DataTypes.DECIMAL(12, 6)))
val rowType = RowType.of(defaultDataTypes.map(_.getLogicalType), colFields)
val defaultDataTypesAsList = defaultDataTypes.toList.asJava
val dataType = new FieldsDataType(rowType, defaultDataTypesAsList)
// Job
val env = StreamExecutionEnvironment.getExecutionEnvironment()
val tableEnv = StreamTableEnvironment.create(env)
tableEnv.executeSql(readJSONTable)
val ordersTable = tableEnv.from(tableName)
val dataStream = tableEnv
.toDataStream(ordersTable)
.map(new ConvertRowToRowDataMapFunction(dataType))
val sink = FileSink
.forBulkFormat(
WriteParquetJobExample.outputBasePath,
ParquetRowDataBuilder.createWriterFactory(rowType, Config.hadoopConfig, true)
)
.build()
dataStream.sinkTo(sink)
env.execute()
}
}
class ConvertRowToRowDataMapFunction(fieldsDataType: FieldsDataType)
extends RichMapFunction[Row, RowData] {
private final val rowRowConverter = RowRowConverter.create(fieldsDataType)
override def open(parameters: Configuration): Unit = {
super.open(parameters)
rowRowConverter.open(this.getClass.getClassLoader)
}
override def map(row: Row): RowData =
this.rowRowConverter.toInternal(row)
}
Environment:
IntelliJ 2021.3.2 (Ultimate)
AdoptOpenJDK 1.8
Scala: 2.12.15
Flink: 1.13.5
Flink Libraries Used (for this example):
flink-table-api-java-bridge
flink-table-planner-blink
flink-clients
flink-json
Thank you in advance for the help!
Unfortunately MAP, ARRAY and ROW types are supported by Flink Parquet format only since Flink 1.15 (see FLINK-17782, not released yet). You may want to upgrade Flink version to 1.15 once it is released, or make your own implementation based on the latest code on master branch for now.

Create Spark UDF of a function that depends on other resources

I have a code for tokenizing a string.
But that tokenization method uses some data which is loaded when my application starts.
val stopwords = getStopwords();
val tokens = tokenize("hello i am good",stopwords)
def tokenize(string:String,stopwords: List[String]) : List[String] = {
val splitted = string.split(" ")
// I use this stopwords for filtering my splitted array.
// Then i return the items back.
}
Now I want to make the tokenize method an UDF for Spark.I want to use it to create new column in DataFrame Transformations.
I created simple UDFs before which had no dependencies like it needs items that needs to be read from text file etc.
Can some one tell me how to do these kind of operation?
This is what I have tried ,and its working.
val moviesDF = Seq(
("kingdomofheaven"),
("enemyatthegates"),
("salesinfointheyearofdecember"),
).toDF("column_name")
val tokenizeUDF: UserDefinedFunction = udf(tokenize(_: String): List[String])
moviesDF.withColumn("tokenized", tokenizeUDF(col("column_name"))).show(100, false)
def tokenize(name: String): List[String] = {
val wordFreqMap: Map[String, Double] = DataProviderUtil.getWordFreqMap()
val stopWords: Set[String] = DataProviderUtil.getStopWordSet()
val maxLengthWord: Int = wordFreqMap.keys.maxBy(_.length).length
.................
.................
}
Its giving me the expected output:
+----------------------------+--------------------------+
|columnname |tokenized |
+----------------------------+--------------------------+
|kingdomofheaven |[kingdom, heaven] |
|enemyatthegates |[enemi, gate] |
|salesinfointheyearofdecember|[sale, info, year, decemb]|
+----------------------------+--------------------------+
Now my question is , will it work when its deployed ? Currently I am
running it locally. My main concern it that this function reads from a
file to get information like stopwords,wordfreq etc for making the
tokenization possible. So registering it like this will work properly
?
At this point, if you deploy this code Spark will try to serialize your DataProviderUtil, you would need to mark as serializable that class. Another possibility is to declare you logic inside an Object. Functions inside objects are considered static functions and they are not serialized.

Getting java.lang.ArrayIndexOutOfBoundsException: 1 in spark when applying aggregate functions

I am trying to do some transformations on a data set. After reading the data set when performing df.show() operations, I am getting the rows listed in spark shell. But when I try to do df.count or any aggregate functions, I am getting
java.lang.ArrayIndexOutOfBoundsException: 1.
val itpostsrow = sc.textFile("/home/jayk/Downloads/spark-data")
import scala.util.control.Exception.catching
import java.sql.Timestamp
implicit class StringImprovements(val s:String) {
def toIntSafe = catching(classOf[NumberFormatException])
opt s.toInt
def toLongsafe = catching(classOf[NumberFormatException])
opt s.toLong
def toTimeStampsafe = catching(classOf[IllegalArgumentException]) opt Timestamp.valueOf(s)
}
case class Post(commentcount:Option[Int],lastactivitydate:Option[java.sql.Timestamp],ownerUserId:Option[Long],body:String,score:Option[Int],creattiondate:Option[java.sql.Timestamp],viewcount:Option[Int],title:String,tags:String,answerCount:Option[Int],acceptedanswerid:Option[Long],posttypeid:Option[Long],id:Long)
def stringToPost(row:String):Post = {
val r = row.split("~")
Post(r(0).toIntSafe,
r(1).toTimeStampsafe,
r(2).toLongsafe,
r(3),
r(4).toIntSafe,
r(5).toTimeStampsafe,
r(6).toIntSafe,
r(7),
r(8),
r(9).toIntSafe,
r(10).toLongsafe,
r(11).toLongsafe,
r(12).toLong)
}
val itpostsDFcase1 = itpostsrow.map{x=>stringToPost(x)}
val itpostsDF = itpostsDFcase1.toDF()
Your function stringToPost() might cause a Java error ArrayIndexOutOfBoundsException if the text file contains some empty row or if the number of fields after the split is not 13.
Due to Spark's lazy evaluation one notices such errors only when performing an action like count.

Spark UDF with Maxmind Geo Data

I'm trying to use the Maxmind snowplow library to pull out geo data on each IP that I have in a dataframe.
We are using Spark SQL (spark version 2.1.0) and I created an UDF in the following class:
class UdfDefinitions #Inject() extends Serializable with StrictLogging {
sparkSession.sparkContext.addFile("s3n://s3-maxmind-db/latest/GeoIPCity.dat")
val s3Config = configuration.databases.dataWarehouse.s3
val lruCacheConst = 20000
val ipLookups = IpLookups(geoFile = Some(SparkFiles.get(s3Config.geoIPFileName) ),
ispFile = None, orgFile = None, domainFile = None, memCache = false, lruCache = lruCacheConst)
def lookupIP(ip: String): LookupIPResult = {
val loc: Option[IpLocation] = ipLookups.getFile.performLookups(ip)._1
loc match {
case None => LookupIPResult("", "", "")
case Some(x) => LookupIPResult(Option(x.countryName).getOrElse(""),
x.city.getOrElse(""), x.regionName.getOrElse(""))
}
}
val lookupIPUDF: UserDefinedFunction = udf(lookupIP _)
}
The intention is to create the pointer to the file (ipLookups) outside the UDF and use it inside, so not to open files on each row. This get an error of task no serialized and when we use the addFiles in the UDF, we get a too many files open error (when using a large dataset, on a small dataset it does work).
This thread show how to use to solve the problem using RDD, but we would like to use Spark SQL. using maxmind geoip in spark serialized
Any thoughts?
Thanks
The problem here is that IpLookups is not Serializable. Yet it makes the lookups from a static file (frmo what I gathered) so you should be able to fix that. I would advise that you clone the repo and make IpLookups Serializable. Then, to make it work with spark SQL, wrap everything in a class like you did. The in the main spark job, you can write something as follows:
val IPResolver = new MySerializableIpResolver()
val resolveIP = udf((ip : String) => IPResolver.resolve(ip))
data.withColumn("Result", resolveIP($"IP"))
If you do not have that many distinct IP addresses, there is another solution: you could do everything in the driver.
val ipMap = data.select("IP").distinct.collect
.map(/* calls to the non serializable IpLookups but that's ok, we are in the driver*/)
.toMap
val resolveIP = udf((ip : String) => ipMap(ip))
data.withColumn("Result", resolveIP($"IP"))

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.