Connect with postgresql using the V2 Spark SQL jdbc datasource

Connect with postgresql using the V2 Spark SQL jdbc datasource - scala

I am trying to create a connection with postgresql using the Spark SQL jdbc datasource. I am using spark 3.2.1
I have started the spark shell including the postgres driver and I can connect and execute queries without problems. I am using this statement:
val df = spark.read.format("jdbc").option("url", "jdbc:postgresql://host:port/").option("driver", "org.postgresql.Driver").option("dbtable", "test").option("user", "postgres").option("password", "*******").option("pushDownAggregate",true).load()
I am adding the pushDownAggregate option because I would like the aggregations are delegated to the source. But for some reason this is not happening.
If I run the df.queryExecution statement in my case it returns the following result:
res7: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
Relation [typesmallint#183,typeinteger#184,typebigint#185L,typenumeric#186,typereal#187,typedoubleprecision#188,typechar#189,typevarchar#190,typetext#191,typebytea#192,typetimestamp#193,typetimestamptz#194,typedate#195,typetime#196,typetimetz#197,typeinterval#198,typeboolean#199,typecidr#200,typeinet#201,typemacaddr#202,typemacaddr8#203,typebit#204,typepoint#205,typeline#206,... 7 more fields] JDBCRelation(test) [numPartitions=1]
== Analyzed Logical Plan ==
typesmallint: smallint, typeinteger: int, typebigint: bigint, typenumeric: decimal(38,0), typereal: float, typedoubleprecision: double, typechar: string, typevarchar: string, typetext: string, typebytea: binary, typetimestam...
Looking at this I get the feeling that for some reason the V2 JDBC data source is not being used. If this data source was in use, this query would return RelationV2 and not Relation.
What configuration would be necessary to use the V2 data source?

Related

How to create a temporary table in snowflake based on pyspark dataframe

I can read the snowflake table in pyspark dataframe using sqlContext
sql = f"""select * from table1""";
df = sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.option("query", sql)
.load()
How do I create a temporary table in snowflake (using pyspark code) and insert values from this pyspark dataframe (df)?

just save as usual, with snowflake format
snowflake_options = {
...
'sfDatabase': 'dbabc',
'dbtable': 'tablexyz',
...
}
(df
.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.save()
)

I don't believe this can be done. At least not the way you want.
You can, technically, create a temporary table; but persisting it is something that I have had a great deal of difficulty finding how to do (i.e. I haven't). If you run the following:
spark.sparkContext._jvm.net.snowflake.spark.snowflake.Utils.runQuery(snowflake_options, 'create temporary table tmp_table (id int, value text)')
you'll notice that it successfully returns a java object indicating the temp table was created successfully; but once you try and run any further statements on it, you'll get nasty errors that mean it no longer exists. Somehow we mere mortals would need to find a way to access and persist the Snowflake session through the jvm api. That being said, I also think that would run contrary to the Spark paradigm.
If you really need the special-case performance boost of running transformations on Snowflake instead of bringing it all into Spark, just keep everything in Snowflake to begin with by either
Using CTEs in the query, or
Using the runQuery api described above to create "temporary" permanent/transient tables and designing Snowflake queries that insert directly to those and then clean them up (DROP them) when you are done.

How to get Create Statement of Table in some other database in Spark using JDBC

Problem statement:
I have a Impala database where multiple tables are present
I am creating Spark JDBC connection to Impala and loading these tables into spark dataframe for my validations like this which works fine:
val df = spark.read.format("jdbc")
.option("url","url")
.option("dbtable","tablename")
.load()
Now the next step and my actual problem is I need to find the create statement which was used to create the tables in Impala itself
Since I cannot run command like below as it gives error, is there anyway I can fetch the show create statement for tables present in Impala.
val df = spark.read.format("jdbc")
.option("url","url")
.option("dbtable","show create table tablename")
.load()

Perhaps you can use Spark SQL "natively" to execute something like
val createstmt = spark.sql("show create table <tablename>")
The resulting dataframe will have a single column (type string) which contains a complete CREATE TABLE statement.
But, if you still choose to go JDBC route there is always an option to use the good old JDBC interface. Scala understands everything written in Java, after all...
import java.sql.*
Connection conn = DriverManager.getConnection("url")
Statement stmt = conn.createStatement()
ResultSet rs = stmt.executeQuery("show create table <tablename>")
...etc...

How to call a stored proc in db2 database from spark scala

I have to call a stored proc in db2 that takes 3 arguments and return an integer . Can anyone help me to call this sp from spark Scala code.
Below is the stored proc in db2.
CREATE PROCEDURE TEST_PROC(IN V_DATE DATE,IN V_GROUP VARCHAR(20),IN V_FREQ
VARCHAR(20),IN V_RULE VARCHAR(20), OUT ID INTEGER)
LANGUAGE SQL
MODIFIES SQL DATA
BEGIN
LOCK TABLE CAL_LOG IN EXCLUSIVE MODE;
SET ID = (10+ COALESENCE((SELECT MAX(ID) FROM CAL_LOG WITH UR),0));
INSERT INTO CAL_RESULT(ID,P_DATE,GROUP,FREQ,RULE)
VALUES(ID,V-DATE,V_GROUP,V_FREQ,V_RULE);
COMMIT:
END;
PROC is created and it is working as expected.
Now i want to call this proc from spark scala code.
I am trying the below code
val result = spark.read.format("jdbc")
.options(Map(
"url"-> //the db2 url
"driver" - > // my db2 driver
"user name" - > // username
"password" -> // password
""dbtable" -> "(CALL TEST_PROC('2020-07-08','TEST',''TEST','TEST,?)) as proc_result;"
)).load()
but the code snippet is giving below error
DB# SQL Error: SQLCODE=-104, SQLSTATE=42601
com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, SQLSTATE=42601

I think you should use JDBC connection directly instead of Spark, as your stored procedure only returns an integer. If you need that value you can retrieve that from the call to the stored procedure, but using Scala without using Spark.
You can find a sample at https://www.ibm.com/support/knowledgecenter/SSEPEK_12.0.0/java/src/tpc/imjcc_tjvcscsp.html
That's the standard way to call it on any language:
If you need to replace parameters, you can use the prepareCall as described on the link above
assign the parameters values with the registerParameter (In or Out)
run the executeQuery as your sp returns an integer
close the connection

I recommend ScalikeJDBC
Maven coordinates (Scala 2.11): org.scalikejdbc:scalikejdbc_2.11:3.4.1
import scalikejdbc._
// Initialize JDBC driver & connection pool
Class.forName(<db2 driver>)
ConnectionPool.singleton(<url>, <user>, <password>)
// ad-hoc session provider on the REPL
implicit val session = AutoSession
// Now you can run anything you want
sql"""
CREATE PROCEDURE TEST_PROC(IN V_DATE DATE,IN V_GROUP VARCHAR(20),IN V_FREQ
VARCHAR(20),IN V_RULE VARCHAR(20), OUT ID INTEGER)
LANGUAGE SQL
MODIFIES SQL DATA
BEGIN
LOCK TABLE CAL_LOG IN EXCLUSIVE MODE;
SET ID = (10+ COALESENCE((SELECT MAX(ID) FROM CAL_LOG WITH UR),0));
INSERT INTO CAL_RESULT(ID,P_DATE,GROUP,FREQ,RULE)
VALUES(ID,V-DATE,V_GROUP,V_FREQ,V_RULE);
COMMIT:
END;""".execute.apply()
Fetch data as follows
sql"""(CALL TEST_PROC('2020-07-08','TEST',''TEST','TEST,?))
as proc_result;""".execute.apply()
The result can be turned into a dataframe again if needed.

You cannot call stored procedure using Apache Spark though load same data using Spark load
Load fron db2
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df= sqlContext.load("jdbc", Map(
"url" -> "jdbc:db2://xx.xx.xx.xx:50000/SQLDB:securityMechanism=9;currentSchema=vaquarkhan;user=<ur-username>;password=xxxxx;",
"driver" -> "com.ibm.db2.jcc.DB2Driver",
"dbtable" -> "scheam.TableName"))
Create temp table/df and add filters to get required response.

How to determine if source is empty?

I have an etl process that is using an athena source. I cannot figure out how to create a data frame if there is no data yet in the source. I was using the GlueContext:
trans_ddf = glueContext.create_dynamic_frame.from_catalog(
database=my_db, table_name=my_table, transformation_ctx="trans_ddf")
This fails if there is no data in the source db, because it can't infer the schema.
I also tried using the sql function on the spark session:
has_rows_df = spark.sql("select cast(count(*) as boolean) as hasRows from my_table limit 1")
has_rows = has_rows_df.collect()[0].hasRows
This also fails because it can't infer the schema.
How can I create a data frame so I can determine if the source has any data?

has_rows_df.head(1).isEmpty
should do the job,robustly.
See How to check if spark dataframe is empty?

Unable to create Hbase table using Hive query through Spark

Using the following tutorial: https://hadooptutorial.info/hbase-integration-with-hive/, I was able to do the HBase integration with Hive. After the configuration I was successfully able to create Hbase table using Hive query with Hive table mapping.
Hive query:
CREATE TABLE upc_hbt(key string, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,value:value")
TBLPROPERTIES ("hbase.table.name" = "upc_hbt");
Spark-Scala:
val createTableHql : String = s"CREATE TABLE upc_hbt2(key string, value string)"+
"STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'"+
"WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,value:value')"+
"TBLPROPERTIES ('hbase.table.name' = 'upc_hbt2')"
hc.sql(createTableHql)
But when I execute the same Hive query through Spark it throws the following error:
Exception in thread "main" org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.hbase.HBaseStorageHandler
It’s seem like during the Hive execution through Spark it can’t find the auxpath jar location. Is there anyway to solve this problem?
Thank you very much in advance.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Connect with postgresql using the V2 Spark SQL jdbc datasource - scala

Related

How to create a temporary table in snowflake based on pyspark dataframe

How to get Create Statement of Table in some other database in Spark using JDBC

How to call a stored proc in db2 database from spark scala

How to determine if source is empty?

Unable to create Hbase table using Hive query through Spark

Categories

Resources