Converting Postgres Function to Impala UDF or a function in Spark - postgresql

I have a postgres function that is called in a query. Its similar to this sample:
CREATE OR REPLACE FUNCTION test_function(id integer, dt date, days int[], accts text[], flag boolean) RETURNS float[] AS $$
DECLARE
pt_dates date[];
pt_amt integer[];
amt float[];
BEGIN
if cleared then
pt_dates := array(select dt from tabl);
pt_amt := array(select amt from tab1);
if array_upper(days, 1) is not null then
for j in 1 .. array_upper(days, 1)
loop
amt+=amt;
end loop;
end if;
return amt;
END;
$$ LANGUAGE plpgsql;
If I wish to convert this to in to the Data Lake Environment, which is the best way to do it? Impala UDF? or Spark UDF ? or Hive UDF? In Impala UDF, how do I access the impala database? if I write Spark UDF can I use it in the impala-shell?
Please advise.

There's a lot lot lot of questions in your 1 post. So i'll choose just the Spark related question.
You have this SQL query that represents the data processing you wish to perform.
Here is a general formula to do this with Spark:
Take some amount of data, move it to S3
Go into AWS EMR and create a new cluster
SSH into the master node, and run pyspark console
once it has started, you can read in your S3 data via rdd = sc.readText("s3://path/to/your/s3/buckets/")
apply a schema to it with a map function rdd2 = rdd.map(..add schema..)
convert that rdd2 into a dataframe and store that as a new var. rdd2DF = rdd2.toDF()
perform a rdd2DF.registerTempTable('newTableName') on that
write a SQL query and store the result: output = sqlContext.sql("SELECT a,b,c FROM newTableName")
show the output: output.show()
Now i know this is literally too high level to be a specific answer to your question, but everything i just said is very google'able.
And this is an example of a separated Compute and Storage scenario leveraging EMR with Spark and SparkSQL to process a lot of data with SQL queries.

Related

How to execute a update query in spark sql temp tables

I am trying the below code but it is throwing some random error that I am unable to understand:
df.registerTempTable("Temp_table")
spark.sql("Update Temp_table set column_a='1'")
Currently spark sql does not support UPDATE statments. The workaround is to use create a delta lake / iceberg table using your spark dataframe and execute you sql query directly on this table.
For iceberg implementation refer to :
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html

How to create a temporary table in snowflake based on pyspark dataframe

I can read the snowflake table in pyspark dataframe using sqlContext
sql = f"""select * from table1""";
df = sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.option("query", sql)
.load()
How do I create a temporary table in snowflake (using pyspark code) and insert values from this pyspark dataframe (df)?
just save as usual, with snowflake format
snowflake_options = {
...
'sfDatabase': 'dbabc',
'dbtable': 'tablexyz',
...
}
(df
.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.save()
)
I don't believe this can be done. At least not the way you want.
You can, technically, create a temporary table; but persisting it is something that I have had a great deal of difficulty finding how to do (i.e. I haven't). If you run the following:
spark.sparkContext._jvm.net.snowflake.spark.snowflake.Utils.runQuery(snowflake_options, 'create temporary table tmp_table (id int, value text)')
you'll notice that it successfully returns a java object indicating the temp table was created successfully; but once you try and run any further statements on it, you'll get nasty errors that mean it no longer exists. Somehow we mere mortals would need to find a way to access and persist the Snowflake session through the jvm api. That being said, I also think that would run contrary to the Spark paradigm.
If you really need the special-case performance boost of running transformations on Snowflake instead of bringing it all into Spark, just keep everything in Snowflake to begin with by either
Using CTEs in the query, or
Using the runQuery api described above to create "temporary" permanent/transient tables and designing Snowflake queries that insert directly to those and then clean them up (DROP them) when you are done.

Using loop to create spark SQL queries

I am trying to create some spark SQL queries for different tables which i have collected as a list. I want to create SQL queries for all the tables present in the hive database.The hive context has been initialized Following is my approach.
tables= spark.sql("select tables in survey_db")
# registering dataframe as temp view with 2 columns - tableName and db name
tables.createOrReplaceTempView("table_list")
# collecting my table names in a list
table_array= spark.sql("select collect_list(tableName) from table_list").collect()[0][0]
# array values(list)
table_array= [u'survey',u'market',u'customer']
I want to create spark SQL queries for the table names stored in table_array. for example:
for i in table_array:
spark.sql("select * from survey_db.'i'")
I cant use shell scripting as i have to write a pyspark script for this. Please advice if spark.sql queries can be created using loop/map . Thanks everyone.
You can achieve the same as follows:
sql_list = [f"select * from survey_db.{table}" for table in table_array]
for sql in sql_list:
df = spark.sql(sql)
df.show()

How to call a stored proc in db2 database from spark scala

I have to call a stored proc in db2 that takes 3 arguments and return an integer . Can anyone help me to call this sp from spark Scala code.
Below is the stored proc in db2.
CREATE PROCEDURE TEST_PROC(IN V_DATE DATE,IN V_GROUP VARCHAR(20),IN V_FREQ
VARCHAR(20),IN V_RULE VARCHAR(20), OUT ID INTEGER)
LANGUAGE SQL
MODIFIES SQL DATA
BEGIN
LOCK TABLE CAL_LOG IN EXCLUSIVE MODE;
SET ID = (10+ COALESENCE((SELECT MAX(ID) FROM CAL_LOG WITH UR),0));
INSERT INTO CAL_RESULT(ID,P_DATE,GROUP,FREQ,RULE)
VALUES(ID,V-DATE,V_GROUP,V_FREQ,V_RULE);
COMMIT:
END;
PROC is created and it is working as expected.
Now i want to call this proc from spark scala code.
I am trying the below code
val result = spark.read.format("jdbc")
.options(Map(
"url"-> //the db2 url
"driver" - > // my db2 driver
"user name" - > // username
"password" -> // password
""dbtable" -> "(CALL TEST_PROC('2020-07-08','TEST',''TEST','TEST,?)) as proc_result;"
)).load()
but the code snippet is giving below error
DB# SQL Error: SQLCODE=-104, SQLSTATE=42601
com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, SQLSTATE=42601
I think you should use JDBC connection directly instead of Spark, as your stored procedure only returns an integer. If you need that value you can retrieve that from the call to the stored procedure, but using Scala without using Spark.
You can find a sample at https://www.ibm.com/support/knowledgecenter/SSEPEK_12.0.0/java/src/tpc/imjcc_tjvcscsp.html
That's the standard way to call it on any language:
If you need to replace parameters, you can use the prepareCall as described on the link above
assign the parameters values with the registerParameter (In or Out)
run the executeQuery as your sp returns an integer
close the connection
I recommend ScalikeJDBC
Maven coordinates (Scala 2.11): org.scalikejdbc:scalikejdbc_2.11:3.4.1
import scalikejdbc._
// Initialize JDBC driver & connection pool
Class.forName(<db2 driver>)
ConnectionPool.singleton(<url>, <user>, <password>)
// ad-hoc session provider on the REPL
implicit val session = AutoSession
// Now you can run anything you want
sql"""
CREATE PROCEDURE TEST_PROC(IN V_DATE DATE,IN V_GROUP VARCHAR(20),IN V_FREQ
VARCHAR(20),IN V_RULE VARCHAR(20), OUT ID INTEGER)
LANGUAGE SQL
MODIFIES SQL DATA
BEGIN
LOCK TABLE CAL_LOG IN EXCLUSIVE MODE;
SET ID = (10+ COALESENCE((SELECT MAX(ID) FROM CAL_LOG WITH UR),0));
INSERT INTO CAL_RESULT(ID,P_DATE,GROUP,FREQ,RULE)
VALUES(ID,V-DATE,V_GROUP,V_FREQ,V_RULE);
COMMIT:
END;""".execute.apply()
Fetch data as follows
sql"""(CALL TEST_PROC('2020-07-08','TEST',''TEST','TEST,?))
as proc_result;""".execute.apply()
The result can be turned into a dataframe again if needed.
You cannot call stored procedure using Apache Spark though load same data using Spark load
Load fron db2
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df= sqlContext.load("jdbc", Map(
"url" -> "jdbc:db2://xx.xx.xx.xx:50000/SQLDB:securityMechanism=9;currentSchema=vaquarkhan;user=<ur-username>;password=xxxxx;",
"driver" -> "com.ibm.db2.jcc.DB2Driver",
"dbtable" -> "scheam.TableName"))
Create temp table/df and add filters to get required response.

Unable to ingest JSON data with MemSQL PIPELINE INTO PROCEDURE

I am facing issue while ingesting a JSON data via PIPELINE to a table using Store Procedure.
I see NULL values are getting inserted in the table.
Stored Procedure SQL:
DELIMITER //
CREATE OR REPLACE PROCEDURE ops.process_users(GENERIC_BATCH query(GENERIC_JSON json)) AS
BEGIN
INSERT INTO ops.USER(USER_ID,USERNAME)
SELECT GENERIC_JSON::USER_ID, GENERIC_JSON::USERNAME
FROM GENERIC_BATCH;
END //
DELIMITER ;
MemSQL Pipeline Command used:
CREATE OR REPLACE PIPELINE ops.tweet_pipeline_with_sp AS LOAD DATA KAFKA ‘<KAFKA_SERVER_IP>:9092/user-topic’
INTO PROCEDURE ops.process_users FORMAT JSON ;
JSON Data Pushed to Kafka topic: {“USER_ID”:“111”,“USERNAME”:“Test_User”}
Table DDL Statement: CREATE TABLE ops.USER (USER_ID INTEGER, USERNAME VARCHAR(255));
It looks like you're getting help in the MemSQL Forums at https://www.memsql.com/forum/t/unable-to-ingest-json-data-with-pipeline-into-procedure/1702/3 In particular it looks like a difference of :: (which yields JSON) and ::$ (which converts to SQL types).
Got solution from MemSQL forum!
Below are the Pipeline and Stored Procedure scripts that worked for me,
CREATE OR REPLACE PIPELINE OPS.TEST_PIPELINE_WITH_SP
AS LOAD DATA KAFKA '<KAFKA_SERVER_IP>/TEST-TOPIC'
INTO PROCEDURE OPS.PROCESS_USERS(GENERIC_JSON <- %) FORMAT JSON ;
DELIMITER //
CREATE OR REPLACE PROCEDURE ops.process_users(GENERIC_BATCH query(GENERIC_JSON json)) AS
BEGIN
INSERT INTO ops.USER(USER_ID,USERNAME)
SELECT GENERIC_JSON::USER_ID, json_extract_string(GENERIC_JSON,'USERNAME')
FROM GENERIC_BATCH;
END //
DELIMITER ;