Unable to ingest JSON data with MemSQL PIPELINE INTO PROCEDURE - apache-kafka

I am facing issue while ingesting a JSON data via PIPELINE to a table using Store Procedure.
I see NULL values are getting inserted in the table.
Stored Procedure SQL:
DELIMITER //
CREATE OR REPLACE PROCEDURE ops.process_users(GENERIC_BATCH query(GENERIC_JSON json)) AS
BEGIN
INSERT INTO ops.USER(USER_ID,USERNAME)
SELECT GENERIC_JSON::USER_ID, GENERIC_JSON::USERNAME
FROM GENERIC_BATCH;
END //
DELIMITER ;
MemSQL Pipeline Command used:
CREATE OR REPLACE PIPELINE ops.tweet_pipeline_with_sp AS LOAD DATA KAFKA ‘<KAFKA_SERVER_IP>:9092/user-topic’
INTO PROCEDURE ops.process_users FORMAT JSON ;
JSON Data Pushed to Kafka topic: {“USER_ID”:“111”,“USERNAME”:“Test_User”}
Table DDL Statement: CREATE TABLE ops.USER (USER_ID INTEGER, USERNAME VARCHAR(255));

It looks like you're getting help in the MemSQL Forums at https://www.memsql.com/forum/t/unable-to-ingest-json-data-with-pipeline-into-procedure/1702/3 In particular it looks like a difference of :: (which yields JSON) and ::$ (which converts to SQL types).

Got solution from MemSQL forum!
Below are the Pipeline and Stored Procedure scripts that worked for me,
CREATE OR REPLACE PIPELINE OPS.TEST_PIPELINE_WITH_SP
AS LOAD DATA KAFKA '<KAFKA_SERVER_IP>/TEST-TOPIC'
INTO PROCEDURE OPS.PROCESS_USERS(GENERIC_JSON <- %) FORMAT JSON ;
DELIMITER //
CREATE OR REPLACE PROCEDURE ops.process_users(GENERIC_BATCH query(GENERIC_JSON json)) AS
BEGIN
INSERT INTO ops.USER(USER_ID,USERNAME)
SELECT GENERIC_JSON::USER_ID, json_extract_string(GENERIC_JSON,'USERNAME')
FROM GENERIC_BATCH;
END //
DELIMITER ;

Related

Azure Synapse Upsert Record into Dedicated Sql Pool

We have requirement that we need to fetch json data from the datalake storage and insert/update data into synapse tables based on the lastmodified field in the source json and table column.
we need to perform either insert/update record based on following conditions.
if(sourceJson.id==table.id) //assume record already exists
{
if (SourceJson.lastmodified > table.lastmodified){
//update existing record
}
else if(SourceJson.lastmodified<table.lastmodified){
//ignore record
}
}
else{
//insert record
}
is there any way to achieve this, if there please help me on this by sharing any sample flow.
Thanks
The copy data activity and azure dataflows both have an option to Upsert. But they would not help your requirement.
Since you have a key column id and also have a special condition based on which you want to either insert or ignore a record, you can create a stored procedure first in your azure synapse dedicated pool.
The following is the data available in my table:
The following is the data available in my JSON:
[
{
"id":1,
"first_name":"Ana",
"lastmodified":"2022-09-10 07:00:00"
},
{
"id":2,
"first_name":"Cassy",
"lastmodified":"2022-09-07 07:00:00"
},
{
"id":5,
"first_name":"Topson",
"lastmodified":"2022-09-10 07:00:00"
}
]
Use lookup to read the input JSON file. Create a dataset, uncheck first row only and run it. The following is my debug output:
Now, create a stored procedure. If I have created it directly in my Synapse pool (You can use use script activity to create it).
CREATE PROCEDURE mymerge
#array varchar(max)
AS
BEGIN
--inserting records whose id are not present in table
insert into demo1 SELECT * FROM OPENJSON(#array) WITH ( id int,first_name varchar(30),lastmodified datetime) where id not in (select id from demo1);
--using MERGE to update records based on matching id and lastmodified column condition
MERGE into demo1 as tgt
USING (SELECT * FROM OPENJSON(#array) WITH ( id int,first_name varchar(30),lastmodified datetime) where id in (select id from demo1)) as ip
ON (tgt.id = ip.id and ip.lastmodified>tgt.lastmodified)
WHEN MATCHED THE
UPDATE SET tgt.first_name = ip.first_name, tgt.lastmodified = ip.lastmodified;
END
Create a stored procedure activity. Select the above created Stored procedure and pass the lookup output array as a string parameter to stored procedure to get the required result.
#string(activity('Lookup1').output.value)
Running this would give the required result.

Using PostgreSQL comments as descriptions in dbt docs

We've been adding comments to the columns in postgres as column descriptions. Similarly, there are descriptions in dbt that can be written.
How would I go about writing SQL to automatically setting the same descriptions in postgres into dbt docs?
Here's how I often do it.
Take a look at this answer on how to pull descriptions from the pg.catalog.
From there, you want to write a BQ query that generates a json which you can then convert to a yaml file you can use directly in dbt.
BQ link - save results as JSON file.
Use a json2yaml tool.
Save yaml file to an appropriate place in your project tree.
Code sample:
-- intended to be saved as JSON and converted to YAML
-- ex. cat script_job_id_1.json | python3 json2yaml.py | tee schema.yml
-- version will be created as version:'2' . Remove quotes after conversion
DECLARE database STRING;
DECLARE dataset STRING;
DECLARE dataset_desc STRING;
DECLARE source_qry STRING;
SET database = "bigquery-public-data";
SET dataset = "census_bureau_acs";
SET dataset_desc = "";
SET source_qry = CONCAT('''CREATE OR REPLACE TEMP TABLE tt_master_table AS ''',
'''(''',
'''SELECT cfp.table_name, ''',
'''cfp.column_name, ''',
'''cfp.description, ''',
'''FROM `''', database, '''`.''', dataset, '''.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS cfp ''',
''')''');
EXECUTE IMMEDIATE source_qry;
WITH column_info AS (
SELECT table_name as name,
ARRAY_AGG(STRUCT(column_name AS name, COALESCE(description,"") AS description)) AS columns
FROM tt_master_table
GROUP by table_name
)
, table_level AS (
SELECT CONCAT(database, ".", dataset) AS name,
database,
dataset,
dataset_desc AS `description`,
ARRAY_AGG(
STRUCT(name, columns)) AS tables
FROM column_info
GROUP BY database,
dataset,
dataset_desc
LIMIT 1)
SELECT CAST(2 AS INT) AS version,
ARRAY_AGG(STRUCT(name, database, dataset, description, tables)) AS sources
FROM table_level
GROUP BY version

Get data of multiple inserted rows in one object using trigger in postgres

I am trying to write a trigger which gets data from the table attribute in which multiple rows are inserted corresponding to one actionId at one time and group all that data into the one object:
Table Schema
actionId
key
value
I am firing trigger on rows insertion,SO how can I handle this multiple row insertion and how can I collect all the data.
CREATE TRIGGER attribute_changes
AFTER INSERT
ON attributes
FOR EACH ROW
EXECUTE PROCEDURE log_attribute_changes();
and the function,
CREATE OR REPLACE FUNCTION wflowr222.log_task_extendedattribute_changes()
RETURNS trigger AS
$BODY$
DECLARE
_message json;
_extendedAttributes jsonb;
BEGIN
SELECT json_agg(tmp)
INTO _extendedAttributes
FROM (
-- your subquery goes here, for example:
SELECT attributes.key, attributes.value
FROM attributes
WHERE attributes.actionId=NEW.actionId
) tmp;
_message :=json_build_object('actionId',NEW.actionId,'extendedAttributes',_extendedAttributes);
INSERT INTO wflowr222.irisevents(message)
VALUES(_message );
RETURN NULL;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
and data format is,
actionId key value
2 flag true
2 image http:test.com/image
2 status New
I tried to do it via Insert trigger, but it is firing on each row inserted.
If anyone has any idea about this?
I expect that the problem is that you're using a FOR EACH ROW trigger; what you likely want is a FOR EACH STATEMENT trigger - ie. which only fires once for your multi-line INSERT statement. See the description at https://www.postgresql.org/docs/current/sql-createtrigger.html for a more through explanation.
AFAICT, you will also need to add REFERENCING NEW TABLE AS NEW in this mode to make the NEW reference available to the trigger function. So your CREATE TRIGGER syntax would need to be:
CREATE TRIGGER attribute_changes
AFTER INSERT
ON attributes
REFERENCING NEW TABLE AS NEW
FOR EACH STATEMENT
EXECUTE PROCEDURE log_attribute_changes();
I've read elsewhere that the required REFERENCING NEW TABLE ... syntax is only supported in PostgreSQL 10 and later.
Considering the version of postgres you have, and therefore keeping in mind that you can't use a trigger defined FOR EACH STATEMENT for your purpose, the only alternative I see is
using a trigger after insert in order to collect some information about changes in a utility table
using a unix cron that execute a pl/sql that do the job on data set
For example:
Your utility table
CREATE TABLE utility (
actionid integer,
createtime timestamp
);
You can define a trigger FOR EACH ROW with a body that do something like this
INSERT INTO utilty values(NEW.actionid, curent_timestamp);
And, finally, have a crontab UNIX that execute a file or a procedure that to something like this:
SELECT a.* FROM utility u JOIN yourtable a ON a.actionid = u.actionid WHERE u.createtime < current_timestamp;
// do something here with records selected above
TRUNCATE table utility;
If you had postgres 9.5 you could have used pg_cron instead of unix cron...

Oracle 12c: use JSON_QUERY in a trigger

I have a CLOB Json in a column of a table with the following structure:
{"sources": [1,2,4]}
I'm trying to write a trigger that read the array [1,2,4] and performs some checks:
I'm trying using:
DECLARE
TYPE source_type IS TABLE OF NUMBER;
SOURCES source_type;
[...]
json_query(:NEW.COL, '$.sources') BULK COLLECT INTO SOURCES FROM dual;
but I got the error:
Row 1: ORA-01722: invalid number
Any ideas?

Converting Postgres Function to Impala UDF or a function in Spark

I have a postgres function that is called in a query. Its similar to this sample:
CREATE OR REPLACE FUNCTION test_function(id integer, dt date, days int[], accts text[], flag boolean) RETURNS float[] AS $$
DECLARE
pt_dates date[];
pt_amt integer[];
amt float[];
BEGIN
if cleared then
pt_dates := array(select dt from tabl);
pt_amt := array(select amt from tab1);
if array_upper(days, 1) is not null then
for j in 1 .. array_upper(days, 1)
loop
amt+=amt;
end loop;
end if;
return amt;
END;
$$ LANGUAGE plpgsql;
If I wish to convert this to in to the Data Lake Environment, which is the best way to do it? Impala UDF? or Spark UDF ? or Hive UDF? In Impala UDF, how do I access the impala database? if I write Spark UDF can I use it in the impala-shell?
Please advise.
There's a lot lot lot of questions in your 1 post. So i'll choose just the Spark related question.
You have this SQL query that represents the data processing you wish to perform.
Here is a general formula to do this with Spark:
Take some amount of data, move it to S3
Go into AWS EMR and create a new cluster
SSH into the master node, and run pyspark console
once it has started, you can read in your S3 data via rdd = sc.readText("s3://path/to/your/s3/buckets/")
apply a schema to it with a map function rdd2 = rdd.map(..add schema..)
convert that rdd2 into a dataframe and store that as a new var. rdd2DF = rdd2.toDF()
perform a rdd2DF.registerTempTable('newTableName') on that
write a SQL query and store the result: output = sqlContext.sql("SELECT a,b,c FROM newTableName")
show the output: output.show()
Now i know this is literally too high level to be a specific answer to your question, but everything i just said is very google'able.
And this is an example of a separated Compute and Storage scenario leveraging EMR with Spark and SparkSQL to process a lot of data with SQL queries.