How do I set "for fetch only" when querying ibm db2 using the jdbc driver from spark? - pyspark

I have some code to query a db2 database that works if I don't include "for fetch only," but returns an error if I do. I was wondering if it's already being done, or how I could set it.
connection_url = f"jdbc:db2://{host}:{port}/{database}:user={username};password={password};"
df = (spark
.read
.format("jdbc")
.option("driver", "com.ibm.db2.jcc.DB2Driver")
.option("url",connection_url)
.option("query",query)
.load())
return(df)
Error when I include for fetch only:
com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, SQLSTATE=42601, SQLERRMC=for;
and the detailed is:
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
162 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
163 else:
--> 164 return self._df(self._jreader.load())
165
166 def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o4192.load.
: com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, SQLSTATE=42601, SQLERRMC=for;
;), DRIVER=4.25.13
at com.ibm.db2.jcc.am.b6.a(b6.java:810)
at com.ibm.db2.jcc.am.b6.a(b6.java:66)
at com.ibm.db2.jcc.am.b6.a(b6.java:140)
at com.ibm.db2.jcc.am.k3.c(k3.java:2824)
at com.ibm.db2.jcc.am.k3.d(k3.java:2808)
at com.ibm.db2.jcc.am.k3.a(k3.java:2234)
at com.ibm.db2.jcc.am.k4.a(k4.java:8242)
at com.ibm.db2.jcc.t4.ab.i(ab.java:206)
at com.ibm.db2.jcc.t4.ab.b(ab.java:96)
at com.ibm.db2.jcc.t4.p.a(p.java:32)
at com.ibm.db2.jcc.t4.av.i(av.java:150)
at com.ibm.db2.jcc.am.k3.al(k3.java:2203)
at com.ibm.db2.jcc.am.k4.bq(k4.java:3730)
at com.ibm.db2.jcc.am.k4.a(k4.java:4609)
at com.ibm.db2.jcc.am.k4.b(k4.java:4182)
at com.ibm.db2.jcc.am.k4.bd(k4.java:780)
at com.ibm.db2.jcc.am.k4.executeQuery(k4.java:745)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.getQueryOutputSchema(JDBCRDD.scala:68)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:241)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:356)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:323)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:323)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:222)
at sun.reflect.GeneratedMethodAccessor704.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:750)
I've searched ibm's documentation, and stack overflow using every possible permutation I can think of.
I've read documentation about setting the isolation level since I also get a failure when running queries with with ur and was thinking that that if I could find out why that fails, I'd understand why for fetch only fails, (there's an answer here ) but it makes things clear as mud because I couldn't use it to find an analogous solution for for fetch only
I've looked at the db2 documentation on ibm's website, and searched stack overflow but this is eluding me.
edit: queries that run and don't run
Runs in dbvisualizer and pyspark
select
id_number
from
myschema.mytable
FETCH FIRST
10 ROWS ONLY
another one
select
id_number
from
myschema.mytable
Runs in dbvisualizer but not in pyspark
select
id_number
from
myschema.mytable
FETCH FIRST
10 ROWS ONLY FOR FETCH ONLY
another one
select
id_number
from
myschema.mytable
FOR FETCH ONLY
edit 2:
an example is that I run this code:
connection_url = f"jdbc:db2://{host}:{port}/{database}:user={username};password={password};"
df = (spark
.read
.format("jdbc")
.option("driver", "com.ibm.db2.jcc.DB2Driver")
.option("url",connection_url)
.option("query","""
select
id_number
from
myschema.mytable
FOR FETCH ONLY
""")
.load())
return(df)
and it doesn't work. and then I run this code:
connection_url = f"jdbc:db2://{host}:{port}/{database}:user={username};password={password};"
df = (spark
.read
.format("jdbc")
.option("driver", "com.ibm.db2.jcc.DB2Driver")
.option("url",connection_url)
.option("query","""
select
id_number
from
myschema.mytable
-- FOR FETCH ONLY
""")
.load())
return(df)
and it does work. and then I went into dbvisualizer, and verified that both versions of the query do work, so it's not a syntax error from what I can tell.
dbvisualizer says the database major version is 12 and minor is 1 and I believe it's z/os. I'm using the jdbc driver version 4.25.13 in both pyspark and dbvisualizer downloaded from maven here
edit 3:
this query runs fine in db visualizer, but fails in pyspark.
select
id_number
from
myschema.mytable
FOR READ ONLY

Alright. I found out what's happening. tl;dr: spark already does it.
The documentation here states:
A query that will be used to read data into Spark. The specified query will be parenthesized and used as a subquery in the FROM clause. Spark will also assign an alias to the subquery clause. As an example, spark will issue a query of the following form to the JDBC Source.
SELECT FROM (<user_specified_query>) spark_gen_alias
I'm fairly certain the relevant code is here:
val sqlText = options.prepareQuery +
s"SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" +
s" $myWhereClause $getGroupByClause $getOrderByClause $myLimitClause $myOffsetClause"
So FOR FETCH ONLY falls within the subquery, which is not allowed in DB2.
Fortunately though, it looks like CONCUR_READ_ONLY jdbc option is set, which is equivalent to FOR READ ONLY per documentation here
JDBC setting
Db2® cursor setting
IBM Informix® cursor setting
CONCUR_READ_ONLY
FOR READ ONLY
FOR READ ONLY
CONCUR_UPDATABLE
FOR UPDATE
FOR UPDATE
HOLD_CURSORS_OVER_COMMIT
WITH HOLD
WITH HOLD
TYPE_FORWARD_ONLY
SCROLL not specified
SCROLL not specified
TYPE_SCROLL_INSENSITIVE
INSENSITIVE SCROLL
SCROLL
TYPE_SCROLL_SENSITIVE
SENSITIVE STATIC, SENSITIVE DYNAMIC, or ASENSITIVE, depending on the cursorSensitivity Connection and DataSource property
Not supported
The relevant code in spark is:
stmt = conn.prepareStatement(sqlText,
ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
from here
As a side note, it looks like even if it wasn't specified explicitly in the code above, CONCUR_READ_ONLY is the default flag for ResultSet in java sql:
Concurrency
Description
ResultSet.CONCUR_READ_ONLY
Creates a read-only result set. This is the default
ResultSet.CONCUR_UPDATABLE
Creates an updateable result set.
source

Related

Spark doesn't recognize the column name in SQL query while can output it to a dataset

I'm applying the SQL query like that:
s"SELECT * FROM my_table_joined WHERE (timestamp > '2022-01-23' and writetime is not null and acceptTimestamp is not null)"
and I'm getting the error message like that.
warning: there was one deprecation warning (since 2.0.0); for details, enable `:setting -deprecation' or `:replay -deprecation'
org.postgresql.util.PSQLException: ERROR: column "accepttimestamp" does not exist
Hint: Perhaps you meant to reference the column "mf_joined.acceptTimestamp".
Position: 103
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2497)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2233)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:310)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:446)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:370)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:149)
at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:108)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:221)
at $$$e76229fa87b6865de321c5274e52c2f9$$$$w$getDFFromJdbcSource(<console>:1133)
... 326 elided
If I omit acceptTimestamp like that:
s"SELECT * FROM my_table_joined WHERE (timestamp > '2022-01-23' and writetime is not null)"
I'm getting the data as below:
+-------------------+----------+----+------------------+-----------------+---+-----+------+----------+---------------+-------+-----------------------+----------+---------+-------------+------------+---------------+---------+-----+-------------------+-----------------------+---------------+--------------+-------------+-------------------+-------------------+---+---+------------------+-----+----+----+------------------+---+
|timestamp |flags |type|lon |lat |alt|speed|course|satellites|digital_twin_id|unit_id|unit_ts |name |unit_type|measure_units|access_level|uid |placement|stale|start |writetime |acceptTimestamp|delayWindowEnd|DiffInSeconds|time |hour |max|min|mean |count|max2|min2|mean2 |rnb|
+-------------------+----------+----+------------------+-----------------+---+-----+------+----------+---------------+-------+-----------------------+----------+---------+-------------+------------+---------------+---------+-----+-------------------+-----------------------+---------------+--------------+-------------+-------------------+-------------------+---+---+------------------+-----+----+----+------------------+---+
please note acceptTimestamp is here!
So how I should handle this column in my query to make it taken into account?
From the exception, it seems this is related to Postgres not Spark. If you look at the error message you got, the column name is folded to lowercase accepttimestamp whereas in your query the T is in uppercase acceptTimestamp.
To make the column name case-sensitive for Postgres, you need to use double-quotes. Try this:
val query = s"""SELECT * FROM my_table_joined
WHERE timestamp > '2022-01-23'
and writetime is not null
and "acceptTimestamp" is not null"""

Databricks - "Alter Table Owner to userid" is not working with Spark.sql in Pyspark notebook

I am trying to run the below command in spark sql in my pyspark notebook (databricks) and it is getitng an error but the same command is working in sql notebook.
ALTER TABLE sales.product OWNER TO `john001#mycomp.com`;
Pyspark Code below
source_sql = "ALTER TABLE sales.product OWNER TO `john001#mycomp.com`;"
spark.Sql(source_sql)
running the above print statement in spark.sql is throwing an error as shown below
----> 7 spark.sql(source_sql)
/databricks/spark/python/pyspark/sql/session.py in sql(self, sqlQuery)
707 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
708 """
--> 709 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
710
711 #since(2.0)
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
but if I run the same in %sql in the cell - it is working.
can someone suggest on how to run the same in spark.Sql("ALTER TABLE sales.product OWNER TO john001#mycomp.com;")
Spark SQL's ALTER TABLE command does not have the OWNER TO option. This is what's being executed in your pyspark code, and why it fails.
Databricks' ALTER TABLE command does have this option; it is a different SQL dialect. This is what's being executed in your sql notebook, and why it succeeds.

Formatting SQL query

I am trying to make a query for my PostgreSQL database.
I think the format of my query is wrong and I can't seem to get it to work, I have posted my code below:
query = cur.execute('SELECT "KINASE_NAME" FROM public."Phosphosite_table"
WHERE "GENE_NAME" LIKE %(genename)s AND "RESIDUE" LIKE %(location)s')
The aim is to take the kinase name is the gene name and location match.
my error message appears as the following:
ProgrammingError Traceback (most recent call last)
<ipython-input-33-9eae43b913d6> in <module>()
35 cur = connection.cursor()
36
---> 37 query = cur.execute('SELECT "KINASE_NAME" FROM public."Phosphosite_table" WHERE "GENE_NAME" LIKE%(genename)s AND "RESIDUE" LIKE %(location)s')
Thanks!
Connor
Don't use string operations to build SQL queries. Use the proper %s syntax.
genname = "foo"
location = "bar"
cur.execute("SELECT ... LIKE %s and ... LIKE %s", (genname, location))
No quotes around the values must be used. The quoting will be done by the DB API library.

How to change Postgres JDBC driver properties to change return class on count function?

I am running a Jasper report (via an jrxml), I am connecting / reading from a Postgres database.
The Sql returns a value from a count function, this then causes java.lang.ClassCastException when writing this value to the Jasper report (via an xml), can I amend the JDBC driver properties to handle this (rather than amend the sql).
The line in the SQL that caused the error was
COALESCE(B.GP_COUNT,0) as GP_COUNT
If I amend the line that populates GP_COUNT using a CAST statement then this works OK in the xml:-
CAST(COUNT(DISTINCT PD_CDE) AS INT4) AS GP_COUNT
I am looking for a solution that avoids changes to the xml’s & jrxml’s (as we have hundreds of reports to convert to Postgres from DB2)
Any help appreciated, I am not a java person so I apologise in advance.
The PostgreSQL JDBC Driver does not return a string, but a BIGINT as result of the count aggregate function.
This Java code:
Class.forName("org.postgresql.Driver");
java.sql.Connection conn = java.sql.DriverManager.getConnection(
"jdbc:postgresql://127.0.0.1/mydb?user=myuser"
);
java.sql.Statement stmt = conn.createStatement();
java.sql.ResultSet rs = stmt.executeQuery("SELECT count(*) FROM pg_class");
System.out.println("Type of count(*) is a BIGINT: "
+ (rs.getMetaData().getColumnType(1) == java.sql.Types.BIGINT)
);
rs.close();
stmt.close();
conn.close();
produces:
Type of count(*) is a BIGINT: true

Spark: strange nullPointerException when extracting data from PostgreSQL

I'm working with PostgreSQL 9.6 and Spark 2.0.0
I want to create a DataFrame form a postgreSQL table, as following:
val query =
"""(
SELECT events.event_facebook_id,
places.placeid, places.likes as placelikes,
artists.facebookId, artists.likes as artistlikes
FROM events
LEFT JOIN eventsplaces on eventsplaces.event_id = events.event_facebook_id
LEFT JOIN places on eventsplaces.event_id = places.facebookid
LEFT JOIN eventsartists on eventsartists.event_id = events.event_facebook_id
LEFT JOIN artists on eventsartists.artistid = artists.facebookid) df"""
The request is valid (if I run it on psql, I don't get any error) but with Spark,
if I execute the following code, I get a NullPointerException:
sqlContext
.read
.format("jdbc")
.options(
Map(
"url" -> claudeDatabaseUrl,
"dbtable" -> query))
.load()
.show()
If I change, in the query artists.facebookId by an other column as artists.description (which can be null contrary to facebookId), the exception disappears.
I find this very very strange, any idea?
You have different facebookId's in your query: artists.facebook[I]d and artists.facebook[i]d.
Please, try to use the correct one.