How to output nested Row from Beam SQL (SqlTransform)? - apache-beam

I want to have Row with nested Row from output of Beam SQL (SqlTransform), but failing.
Questions:
What is the proper way to output Row with nested Row from SqlTransform? (Row type is described in the docs, so I believe it's supported)
If this is a bug/missing feature, is the problem of Beam itself? Or runner-dependent? (I'm currently using on DirectRunner, but going to use DataflowRunner in future.)
Version info:
OS: macOS 10.15.7 (Catalina)
Java: 11.0.11 (AdoptOpenJDK)
Beam SDK: 2.32.0
Here's what I've tried, with no luck.
With Calcite dialect
SELECT ROW(foo, bar) as my_nested_row FROM PCOLLECTION
I was expecting this outputs row with following schema
Field{name=my_nested_row, description=, type=ROW<foo STRING NOT NULL, bar INT64 NOT NULL> NOT NULL, options={{}}}
but actually row is split into scalar fields like
Field{name=my_nested_row$$0, description=, type=STRING NOT NULL, options={{}}}
Field{name=my_nested_row$$1, description=, type=INT64 NOT NULL, options={{}}}
Zeta SQL
SELECT STRUCT(foo, bar) as my_nested_row FROM PCOLLECTION
I got an error
java.lang.UnsupportedOperationException: Does not support expr node kind RESOLVED_MAKE_STRUCT
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.convertRexNodeFromResolvedExpr (ExpressionConverter.java:363)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.convertRexNodeFromResolvedExpr (ExpressionConverter.java:323)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.convertRexNodeFromComputedColumnWithFieldList (ExpressionConverter.java:375)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.retrieveRexNode (ExpressionConverter.java:203)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ProjectScanConverter.convert (ProjectScanConverter.java:45)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ProjectScanConverter.convert (ProjectScanConverter.java:29)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.QueryStatementConverter.convertNode (QueryStatementConverter.java:102)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.QueryStatementConverter.convert (QueryStatementConverter.java:89)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.QueryStatementConverter.convertRootQuery (QueryStatementConverter.java:55)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLPlannerImpl.rel (ZetaSQLPlannerImpl.java:98)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRelInternal (ZetaSQLQueryPlanner.java:197)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRel (ZetaSQLQueryPlanner.java:185)
at org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery (BeamSqlEnv.java:111)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand (SqlTransform.java:171)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand (SqlTransform.java:109)
at org.apache.beam.sdk.Pipeline.applyInternal (Pipeline.java:548)
at org.apache.beam.sdk.Pipeline.applyTransform (Pipeline.java:482)
at org.apache.beam.sdk.values.PCollection.apply (PCollection.java:363)
at dev.tmshn.playbeam.Main.main (Main.java:29)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:829)

Unfortunately Beam SQL does not yet support nested rows, mainly due to a lack of support in Calcite (and therefore a corresponding lack of support for the ZetaSQL implementation). See this similar question focused on Dataflow.
On the bright side, the Jira issue tracking this support seems to be resolved for 2.34.0, so proper support is likely upcoming.

Related

Schema capitalization(uppercase) problem when reading with Spark

Using Scala here:
Val df = spark.read.format("jdbc").
option("url", "<host url>").
option("dbtable", "UPPERCASE_SCHEMA.table_name").
option("user", "postgres").
option("password", "<password>").
option("numPartitions", 50).
option("fetchsize", 20).
load()
The database I'm using the above code to call from has many schemas and they are all in uppercase letters (UPPERCASE_SCHEMA).
No matter how I try to denote that the schema is in all caps, Spark converts it to lowercase which fails to initialize with the actual DB.
I've tried making it a variable and explicitly denoting it is all uppercase, etc. in multiple languages, but no luck.
Would anyone know a workaround?
When I went into the actual DB (Postgres) and temporarily changed the schema to all lowercase, it worked absolutely fine.
Try to set spark.sql.caseSensitive to true (false by default)
spark.conf.set('spark.sql.caseSensitive', true)
You can see in the source code its definition:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L833
In addition, you can see in the JDBCWriteSuite how it affects the JDBC connector:
https://github.com/apache/spark/blob/ee95ec35b4f711fada4b62bc27281252850bb475/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

Scala Spark - Cannot resolve a column name

This should be pretty straightforward, but I'm having an issue with the following code:
val test = spark.read
.option("header", "true")
.option("delimiter", ",")
.csv("sample.csv")
test.select("Type").show()
test.select("Provider Id").show()
test is a dataframe like so:
Type
Provider Id
A
asd
A
bsd
A
csd
B
rrr
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve '`Provider Id`' given input columns: [Type, Provider Id];;
'Project ['Provider Id]
It selected and shows the Type column just fine but couldn't get it to work for the Provider Id. I wondered if it were because the column name had a space, so I tried using backticks, removing and replacing the space, but nothing seemed to work. Also, it ran fine when I'm using Spark libraries 3.x but doesn't work when I'm using Spark 2.1.x (meanwhile I need to use 2.1.x)
Additional: I tried changing the CSV column order from Type - Provider Id to Provider Id then Type. The error was the opposite, Provider Id shows but for Type it's throwing an exception now.
Any suggestions?
test.printSchema()
You can use the result from printSchema() to see how exactly spark read your column in, then use that in your code.

Window function in pyspark - strange behavior

I'm using Pyspark window functions extensively in my code. But it seems to be not working properly.
But i'm getting the correct results only for the last record by order by column for the partition.
Documentation says , it is experimental, can we use it in production systems
http://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.Window
Sample code:
invWindow = Window.partitionBy(masterDrDF["ResId"], masterDrDF["vrsn_strt_dts"]).orderBy(masterDrDF["vrsn_strt_dts"]).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
max(when(invDetDF["InvoiceItemType"].like('ABD%'), 1).otherwise(0)).over(invWindow).alias("ABD_PKG_IN")

Use consistency level in Phantom-dsl and Cassandra

Currently using --
cqlsh> show version
[cqlsh 4.1.1 | Cassandra 2.0.17 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Using phantom-dsl 1.12.2 , Scala 2.10 ..
I can't figure out how to set consistency levels on queries.
There are predefined functions insert() , select() as part of CassandraTable .. How can I pass the consistency level to them ?
insert.value(....).consistencyLevel_=(ConsistencyLevel.QUORUM)
does not work and fails with an error ( probably because this appends a "USING CONSISTENCY QUORUM" at the end of the query). Here's the actual exception I get
com.datastax.driver.core.exceptions.SyntaxError: line 1:424 no viable alternative at input 'CONSISTENCY'
at com.datastax.driver.core.Responses$Error.asException(Responses.java:122) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:120) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:186) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler.access$2300(RequestHandler.java:45) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:754) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:576) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
I see from the documentation and discussion on this pull request that I could do a setConsistencyLevel(ConsistencyLevel.QUORUM) on a SimpleStatement, but I would prefer not rewrite all the different insert statements.
UPDATE
Just to close the loop on this issue. I worked around this by creating a custom InsertQuery and then using that instead of the one provided by final def insert in CassandraTable
def qinsert()(implicit keySpace: KeySpace) = {
val table = this.asInstanceOf[T]
new InsertQuery[T, M, Unspecified](table, CQLQuery("INSERT into keyspace.tablename", consistencyLevel = ConsistencyLevel.QUORUM)
}
First of all there is no setValue method inside phantom and the API method you are using is missing an = at the end.
The correct structure is:
Table.insert
.value(_.name, "test")
.consistencyLevel_=(ConsistencyLevel.Quorum)
As you are on stackoverflow, an error stack trace and specific details of what doesn't work is generally preferable to "does not work".
I have finally figured out how to properly set the consistency level using phantom-dsl.
Using a statement you can do the following:
statement.setConsistencyLevel(ConsistencyLevel.QUORUM)
Also, take a look on the test project I've been working to help guys like you with examples using phantom-dsl:
https://github.com/iamthiago/cassandra-phantom

MyBatis error: Caused by: java.sql.SQLSyntaxErrorException: unexpected token: < required: (

Why do I get this error from MyBatis 3?
Caused by: java.sql.SQLSyntaxErrorException: unexpected token: < required: (
This is my SQL:
SELECT * FROM GC0101.AGENT_POOL_CLIENT_ASSIGNMENT WHERE GO_CD = ?
AND ASSIGNMENT_STATUS_CD IN %lt;foreach item="item" index="index"
collection="assignmentStatusCd" open="(" separator="," close=")"%gt;
? %lt;/foreach%gt;
created from this query:
#Select("SELECT * FROM GC0101.AGENT_POOL_CLIENT_ASSIGNMENT WHERE GO_CD =
#{generalOfficeCd, jdbcType=CHAR} AND ASSIGNMENT_STATUS_CD IN " +
"<foreach item=\"item\" index=\"index\" collection=\"assignmentStatusCd\"
open=\"(\" separator=\",\" close=\")\"%gt; #{item, jdbcType=CHAR} %lt;/foreach%gt;")
If that is indeed your annotation value then it will not work because you are trying to include an XML statement inside the annotation's only value which is not parsed for XML elements, it's taken "as it is".
You can find a possible solution in this post: How to use Annotations with iBatis (myBatis) for an IN query? (see the #Select("<script>...") example in LordOfThePigs' answer - works with the latest MyBatis version) or try with a #SelectProvider annotation. See the Mybatis 3 API docs for more details and this note as an advice:
Java Annotations are unfortunately limited in their expressiveness and flexibility. Despite a lot of time spent in investigation, design and trials, the most powerful MyBatis mappings simply cannot be built with Annotations – without getting ridiculous that is.