Scala Option Types not recognized in apache flink table api - scala

I am working on building a flink application which reads data from kafka
topics, apply some transformations and writes to the Iceberg table.
I read the data from kafka topic (which is in json) and use circe to decode
that to scala case class with scala Option values in it.
All the transformations on the datastream works fine.
Case Class Looks like below
Event(app_name: Option[String], service_name: Option[String], ......)
But when I try to convert the stream to a table to write to iceberg table
due to the case classes the columns are converted to Raw type as shown
below.
table.printSchema()
service_name RAW('scala.Option', '...'),
conversion_id RAW('scala.Option', '...'),
......
And the table write fails as below.
Query schema: [app_name: RAW('scala.Option', '...'), .........
Sink schema: [app_name: STRING, .......
Does flink table api support scala case classes with option values?
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/serialization/types_serialization/#special-types
I found out that it is supported in datastream at this documentation.
Is there a way to do this in Table API.
Thanks in advance for the help..

I had the exact same issue of Option being parsed as RAW and found yet another workaround that might interest you:
TL;DR:
instead of using .get which returns an Option and diligently declaring the return type to be Types.OPTION(Types.INSTANT) which doesn't work, I instead use .getOrElse("key", null) and declare the type as conventional. Table API then recognizes the column type, creates a nullable column and interprets the null correctly. I can then filter those rows with IS NOT NULL.
Detailed example:
For me it starts with a custom map function in which I unpack data where certain fields could be missing:
class CustomExtractor extends RichMapFunction[MyInitialRecordClass, Row] {
def map(in: MyInitialRecordClass): Row = {
Row.ofKind(
RowKind.INSERT,
in._id
in.name
in.times.getOrElse("time", null) // This here did the trick instead of using .get and returning an option
)
}
}
And then I use it like this explicitly declaring a return type.
val stream: DataStream[Row] = data
.map(new CustomExtractor())
.returns(
Types.ROW_NAMED(
Array("id", "name", "time"),
Types.STRING,
Types.STRING,
Types.INSTANT
)
val table = tableEnv.fromDataStream(stream)
table.printSchema()
// (
// `id` STRING,
// `name` STRING,
// `time` TIMESTAMP_LTZ(9),
// )
tableEnv.createTemporaryView("MyTable", table)
tableEnv
.executeSql("""
|SELECT
| id, name, time
|FROM MyTable""".stripMargin)
.print()
// +----+------------------+-------------------+---------------------+
// | op | id | name | time |
// +----+------------------+-------------------+---------------------+
// | +I | foo | bar | <Null> |
// | +I | spam | ham | 2022-10-12 11:32:06 |
This was at least for me exactly what I wanted. I'm very much new to Flink and would be curious if the Pros here think this is workable or horribly hacky instead.
Using scala 2.12.15 and flink 1.15.2

The type system of the Table API is more restrictive than the one of the DataStream API. Unsupported classes are immediately treated as black-boxed type RAW. This allows objects to still pass the API but it might not be supported by every connector.
From the exception, it looks like you declared the sink table with app_name: STRING, so I guess you are fine with a string representation. If this is the case, I would recommend to implement a user-defined function that performs the conversion to string.

Related

getting description of bigquery table

I am trying to get the description of a bq table using the following:
val bigquery = BigQueryOptions.getDefaultInstance().getService()
val table = bigquery.getTable(tableId)
tableDescription = table.getDescription()
what I am thinking of what if not description is provided to the table I am trying to get the description from? It's written in the documenation that getDescription returns string, but what if there is no descprition provided? will it return null? and if yes, how do I avoid this using an option for example or any other way?
The documentation indicates that the method has the signature:
public String getDescription()
It's reasonable to assume that it might return null since we're dealing with a Java library. You can handle this safely by wrapping the expression as an Option:
tableDescription: Option[String] = Option(table.getDescription())

How to enumerate over columns with tokio-postgres when the field types are unknown at compile-time?

I would like a generic function that converts the result of a SQL query to JSON. I would like to build a JSON string manually (or use an external library). For that to happen, I need to be able to enumerate the columns in a row dynamically.
let rows = client
.query("select * from ExampleTable;")
.await?;
// This is how you read a string if you know the first column is a string type.
let thisValue: &str = rows[0].get(0);
Dynamic types are possible with Rust, but not with the tokio-postgres library API.
The row.get function of tokio-postgres is designed to require generic inference according to the source code
Without the right API, how can I enumerate rows and columns?
You need to enumerate the rows and columns, doing so you can get the column reference while enumerating, and from that get the postgresql-type. With the type information it's possible to have conditional logic to choose different sub-functions to both: i) get the strongly typed variable; and, ii) convert to a JSON value.
for (rowIndex, row) in rows.iter().enumerate() {
for (colIndex, column) in row.columns().iter().enumerate() {
let colType: string = col.type_().to_string();
if colType == "int4" { //i32
let value: i32 = row.get(colIndex);
return value.to_string();
}
else if colType == "text" {
let value: &str = row.get(colIndex);
return value; //TODO: escape characters
}
//TODO: more type support
else {
//TODO: raise error
}
}
}
Bonus tips for tokio-postgres code maintainers
Ideally, tokio-postgres would include a direct API that returns a dyn any type. The internals of row.rs already use the database column type information to confirm that the supplied generic type is valid. Ideally a new API uses would use the internal column information quite directly with improved FromSQL API, but a simpler middle-ground exists:-
It would be possible for an extra function layer in row.rs that uses the same column type conditional logic used in this answer to then leverage the existing get function. If a user such as myself needs to handle this kind of conditional logic, I also need to maintain this code when new types are handled by tokio-postgresql, therefore, this kind of logic should be included inside the library where such functionality can be better maintained.

Slick insert not working while trying to return inserted row

My goal here is to retrieve the Board entity upon insert. If the entity exists then I just want to return the existing object (which coincides with the argument of the add method). Otherwise I'd like to return the new row inserted in the database.
I am using Play 2.7 with Slick 3.2 and MySQL 5.7.
The implementation is based on this answer which is more than insightful.
Also from Essential Slick
exec(messages returning messages +=
Message("Dave", "So... what do we do now?"))
DAO code
#Singleton
class SlickDao #Inject()(db: Database,implicit val playDefaultContext: ExecutionContext) extends MyDao {
override def add(board: Board): Future[Board] = {
val insert = Boards
.filter(b => b.id === board.id && ).exists.result.flatMap { exists =>
if (!exists) Boards returning Boards += board
else DBIO.successful(board) // no-op - return specified board
}.transactionally
db.run(insert)
}
EDIT: also tried replacing the += part with
Boards returning Boards.map(_.id) into { (b, boardId) => sb.copy(id = boardId) } += board
and this does not work either
The table definition is the following:
object Board {
val Boards: TableQuery[BoardTable] = TableQuery[BoardTable]
class BoardTable(tag: Tag) extends Table[BoardRow](tag, "BOARDS") {
// columns
def id = column[String]("ID", O.Length(128))
def x = column[String]("X")
def y = column[Option[Int]]("Y")
// foreign key definitions
.....
// primary key definitions
def pk = primaryKey("PK_BOARDS", (id,y))
// default projection
def * = (boardId, x, y).mapTo[BoardRow]
}
}
I would expect that there would e a new row in the table but although the exists query gets executed
select exists(select `ID`, `X`, `Y`
from `BOARDS`
where ((`ID` = '92f10c23-2087-409a-9c4f-eb2d4d6c841f'));
and the result is false there is no insert.
There is neither any logging in the database that any insert statements are received (I am referring to the general_log file)
So first of all the problem for the query execution was a mishandling of the futures that the DAO produced. I was assigning the insert statement to a future but this future was never submitted to an execution context. My bad even more so that I did not mention it in the description of the problem.
But when this was actually fixed I could see the actual error in the logs of my application. The stack trace was the following:
slick.SlickException: This DBMS allows only a single column to be returned from an INSERT, and that column must be an AutoInc column.
at slick.jdbc.JdbcStatementBuilderComponent$JdbcCompiledInsert.buildReturnColumns(JdbcStatementBuilderComponent.scala:67)
at slick.jdbc.JdbcActionComponent$ReturningInsertActionComposerImpl.x$17$lzycompute(JdbcActionComponent.scala:659)
at slick.jdbc.JdbcActionComponent$ReturningInsertActionComposerImpl.x$17(JdbcActionComponent.scala:659)
at slick.jdbc.JdbcActionComponent$ReturningInsertActionComposerImpl.keyColumns$lzycompute(JdbcActionComponent.scala:659)
at slick.jdbc.JdbcActionComponent$ReturningInsertActionComposerImpl.keyColumns(JdbcActionComponent.scala:659)
So this is a MySQL thing in its core. I had to redesign my schema in order to make this retrieval after insert possible. This redesign includes an introduction of a dedicated primary key (completely unrelated to the business logic) which is also an AutoInc column as the stack trace prescribes.
In the end the solution becomes too involved and instead decided to use the actual argument of the add method to return if the insert was actually successful. So the implementation of the add method ended up being something like this
override def add(board: Board): Future[Board] = {
db.run(Boards.insertOrUpdate(board).map(_ => board))
}
while there was some appropriate Future error handling in the controller which was invoking the underlying repo.
If you're lucky enough and not using MySQL with Slick I suppose you might have been able to do this without a dedicated AutoInc primary key. If not then I suppose this is a one way road.

Typing a "camel caser" in Flow: variance issues?

try flow link
I've been messing around with typing a "camel caser" function (one that consumes JSON and camel cases its keys). I've run into a few issues along the way, and I'm curious if y'all have any suggestions.
A camel caser never changes the shape of its argument, so I would like to preserve the type of whatever I pass in; ideally, calling camelize on an array of numbers would return another array of numbers, etc.
I've started with the following:
type JSON = null | string | number | boolean | { [string]: JSON } | JSON[]
function camelize<J: JSON>(json: J): J {
throw "just typecheck please"
}
This works great for the simple cases, null, string, number, and boolean, but things don't quite work perfectly for JSON dictionaries or array. For example:
const dictionary: { [string]: number } = { key: 123 }
const camelizedDictionary = camelize(dictionary)
will fail with a type error. A similar issue will come up if you pass in a value of, say, type number[]. I think I understand the issue: arrays and dictionaries are mutable, and hence invariant in the type of the values they point to; an array of numbers is not a subtype of JSON[], so Flow complains. If arrays and dictionaries were covariant though, I believe this approach would work.
Given that they're not covariant though, do y'all have any suggestions for how I should think about this?
Flow now supports $ReadOnly and $ReadOnlyArray so a different approach would be to define the JSON type as
type JSON =
| null
| string
| number
| boolean
| $ReadOnly<{ [string]: JSON }>
| $ReadOnlyArray<JSON>
This solves one of the issues noted above because $ReadOnlyArray<number> is a subtype of $ReadOnlyArray<JSON>.
This might not work depending on the implementation of the camelize function since it might modify the keys to be camel-case. But, it's good to know about the read-only utilities in Flow since they are powerful and allow for more concise and, possibly, more correct function types.
Use property variance to solve your problem with dictionaries:
type JSON = null | string | number | boolean | { +[string]: JSON } | JSON[]
https://flowtype.org/blog/2016/10/04/Property-Variance.html
As for your problem with Arrays, as you've pointed out the issue is with mutability. Unfortunately Array<number> is not a subtype of Array<JSON>. I think the only way to get what you want is to explicitly enumerate all of the allowed Array types:
type JSON = null | string | number | boolean | { +[string]: JSON } | Array<JSON> | Array<number>;
I've added just Array<number> here to make my point. Obviously this is burdensome, especially if you also want to include arbitrary mixes of JSON elements (like Array<string | number | null>). But it will hopefully address the common issues.
(I've also changed it to the array syntax with which I am more familiar but there should be no difference in functionality).
There has been talk of adding a read-only version of Array which would be analogous to covariant object types, and I believe it would solve your problem here. But so far nothing has really come of it.
Complete tryflow based on the one you gave.

How to make Hive UDFs defined using the spark Interface show up in the SHOW FUNCTIONS list?

I am trying to register a Hive UDF. Compare the following two approaches:
Using the spark interface
sqlContext.registerFunction( "foobar", ( s : String ) => "foobar" )
Using the hive interface
#Description(
name = "foobar",
value = "_FUNC_(str) - foobars the string",
extended = "Example:\n" +
" foobar('baz') == 'foobar' "
)
class FooBarFunction extends UDF
{
def evaluate(s : Text) : Text = new Text("foobar")
}
sqlContext.sql("CREATE TEMPORARY FUNCTION foobar AS 'my.package.path.FooBarFunction'")
I would like to use the Spark interface which is much more convenient, especially when it comes to defining aggregations.
However, there are two (related) things that I could not figure out:
UDFs defined using "CREATE TEMPORARY FUNCTION" show up in the list generated by "SHOW FUNCTIONS". UDFs defined using the spark interface don't. Can I register the function somehow so that it does show up in the list?
I did not find a way to add a description to the function registered using the spark interface like I did with the #Description annotation. Is there any way to add this documentation?