Flink Table JDBC lookup.cache properties and related properties does not working on streaming environment - streaming

When a SQL query is triggered on Streaming environment with joining both Streaming data and jdbc table, jdbc table related task finished immediately after reading all table records. When I add properties for jdbc table as lookup.cache , lookup.partial-cache.max-rows, lookup.partial-cache.expire-after-write, It will not affect on task lifecycle. It means lookup.cache mechanizm not working as expected.
I have created table as
CREATE TABLE U_ZRB_C_RISKLI_MCC_0 (RISKLIMCC STRING ,PRIMARY KEY (RISKLIMCC) NOT ENFORCED)
WITH ('connector' = 'jdbc'
,'url' = 'jdbc:oracle:thin:#localhost:1521/orcl'
,'table-name' = 'U_ZRB_C_RISKLI_MCC_0'
,'driver' = 'oracle.jdbc.OracleDriver'
,'username' = 'username'
,'password' = 'password'
,'lookup.cache'='PARTIAL'
,'lookup.partial-cache.expire-after-write'='10s')
related job stops after reading table records

Thanks to #Xianxun Ye , When I added " FOR SYSTEM_TIME AS OF" clause after table name , lookup functionality runs as expected as defined at link https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#lookup-join
I think flink documentation page
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/jdbc/#lookup-cache
should refer the the lookup-join page also.

Related

How to query parquet data files from Azure Synapse when data may be structured and exceed 8000 bytes in length

I am having trouble reading, querying and creating external tables from Parquet files stored in Datalake Storage gen2 from Azure Synapse.
Specifically I see this error while trying to create an external table through the UI:
"Error details
New external table
Previewing the file data failed. Details: Failed to execute query. Error: Column 'members' of type 'NVARCHAR' is not compatible with external data type 'JSON string. (underlying parquet nested/repeatable column must be read as VARCHAR or CHAR)'. File/External table name: [DELETED] Total size of data scanned is 1 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes.
. If the issue persists, contact support and provide the following id :"
My main hunch is that since a couple columns were originally JSON types, and some of the rows are quite long (up to 9000 characters right now, which could increase at any point in time during my ETL), this is some kind of conflict with some possible default limit's I have seen referenced in the documentation (enter link description here). Data appears internally like the following example, please bear in mind sometimes this would be way longer
["100.001", "100.002", "100.003", "100.004", "100.005", "100.006", "100.023"]
If I try to manually create the external table (which has worked every other time I have tried following code similar to this
CREATE EXTERNAL TABLE example1(
[id] bigint,
[column1] nvarchar(4000),
[column2] nvarchar(4000),
[column3] datetime2(7)
)
WITH (
LOCATION = 'location/**',
DATA_SOURCE = [datasource],
FILE_FORMAT = [SynapseParquetFormat]
)
GO
the table is created with no error nor warnings but trying to make a very simple select
SELECT TOP (100) [id] bigint,
[column1] nvarchar(4000),
[column2] nvarchar(4000),
[column3] datetime2(7)
FROM [schema1].[example1]
The following error is shown:
"External table 'dbo' is not accessible because content of directory cannot be listed."
It can also show the equivalent:
"External table 'schema1' is not accessible because content of directory cannot be listed."
This error persists even when creating external table with the argument "max" as it appears in this doc
Summary: How to create external table from parquet files with fields exceeding 4000, 8000 bytes or even up to 2gb, which would be the maximum size according to this
Thank you all in advance

Redshift COPY throws error but 'stl_load_errors' system table does not provide details

When I attempt to copy a CSV from S3 into a new table in Redshift (which normally works for other tables) I get this error
ERROR: Load into table 'table_name' failed. Check 'stl_load_errors'
system table for details.
But, when I run the standard query to investigate stl_load_errors
SELECT errors.tbl, info.table_id::integer, info.table_id, *
FROM stl_load_errors errors
INNER JOIN svv_table_info info
ON errors.tbl = info.table_id
I don't see any results related to this COPY. I see errors from previous failed COPY commands, but none related to the most recent one that I am interested in.
Please make sure that you are querying stl_load_errors table with same user you are performing COPY command. You can also try to avoid using ssv_table_info table in query or change INNER to LEFT join.

Is there a way to describe an external/spectrum table via redshift?

In AWS Athena you can write
SHOW CREATE TABLE my_table_name;
and see a SQL-like query that describes how to build the table's schema. It works for tables whose schema are defined in AWS Glue. This is very useful for creating tables in a regular RDBMS, for loading and exploring data views.
Interacting with Athena in this way is manual, and I would like to automate the process of creating regular RDBMS tables that have the same schema as those in Redshift Spectrum.
How can I do this through a query that can be run via psql? Or is there another way to get this via the aws-cli?
Redshift Spectrum does not support SHOW CREATE TABLE syntax, but there are system tables that can deliver same information. I have to say, it's not as useful as the ready to use sql returned by Athena though.
The tables are
svv_external_schemas - gives you information about glue database mapping and IAM roles bound to it
svv_external_tables - gives you the location information, and also data format and serdes used
svv_external_columns - gives you the column names, types and order information.
Using that data, you could reconstruct the table's DDL.
For example to get the list of columns and their types in the CREATE TABLE format one can do:
select distinct
listagg(columnname || ' ' || external_type, ',\n')
within group ( order by columnnum ) over ()
from svv_external_columns
where tablename = '<YOUR_TABLE_NAME>'
and schemaname = '<YOUR_SCHEM_NAME>'
the query give you the output similar to:
col1 int,
col2 string,
...
*) I am using listagg window function and not the aggregate function, as apparently listagg aggregate function can only be used with user defined tables. Bummer.
I had been doing something similar to #botchniaque's answer in the past, but recently stumbled across a solution in the AWS-Labs' amazon-redshift-utils code package that seems to be more reliable than my hand-spun queries:
amazon-redshift-utils: v_generate_external_tbl_ddl
If you don't have the ability to create a view backed with the ddl listed in that package, you can run it manually by removing the CREATE statement from the start of the query. Assuming you can create it as a view, usage would be:
SELECT ddl
FROM admin.v_generate_external_tbl_ddl
WHERE schemaname = '<external_schema_name>'
-- Optionally include specific table references:
-- AND tablename IN ('<table_name_1>', '<table_name_2>', ..., '<table_name_n>')
ORDER BY tablename, seq
;
They added show external table now.
SHOW EXTERNAL TABLE external_schema.table_name [ PARTITION ]
SHOW EXTERNAL TABLE my_schema.my_table;
https://docs.aws.amazon.com/redshift/latest/dg/r_SHOW_EXTERNAL_TABLE.html

Hortonworks 2.5 ACID Insert in hive hangs

I am using HDP 2.5. I want to perform a delete operation from a Hive table. I have created a table with the following command:
hive> create table test (x int, y string) clustered by (x) into 2 buckets stored as ORC tblproperties ("transactional" = "true");
OK
Time taken: 0.148 seconds
Further, I have set the following Hive properties:
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1;
set hive.optimize.sort.dynamic.partition=false;
Then on, I am doing a test insert into the hive table leveraging the following command:
INSERT INTO TEST VALUES (1,a);
Unfortunately my Hive CLI shell is getting hung and I have to issue Ctrl + C command to get out of the shell. May I know why is this happening please? Is it due to the fact that my Hive schema / database contains a mixture of ACID and non-ACID tables ? Any suggestion to resolve this problem will be very helpful.
Thanks in advance !

Postgresql, query results to new table

Windows/NET/ODBC
I would like to get query results to new table on some handy way which I can see through data adapter but I can't find a way to do it.
There is no much examples around to satisfy beginner's level on this.
Don't know temporary or not but after seeing results that table is no more needed so I can delete it 'by hand' or it can be deleted automatically.
This is what I try:
mCmd = New OdbcCommand("CREATE TEMP TABLE temp1 ON COMMIT DROP AS " & _
"SELECT dtbl_id, name, mystr, myint, myouble FROM " & myTable & " " & _
"WHERE myFlag='1' ORDER BY dtbl_id", mCon)
n = mCmd.ExecuteNonQuery
This run's without error and in 'n' I get correct number of matched rows!!
But with pgAdmin I don't see those table no where?? No matter if I look under opened transaction or after transaction is closed.
Second, should I define columns for temp1 table first or they can be made automatically based on query results (that would be nice!).
Please minimal example to illustrate me what to do based on upper code to get new table filled with query results.
A shorter way to do the same thing your current code does is with CREATE TEMPORARY TABLE AS SELECT ... . See the entry for CREATE TABLE AS in the manual.
Temporary tables are not visible outside the session ("connection") that created them, they're intended as a temporary location for data that the session will use in later queries. If you want a created table to be accessible from other sessions, don't use a TEMPORARY table.
Maybe you want UNLOGGED (9.2 or newer) for data that's generated and doesn't need to be durable, but must be visible to other sessions?
See related: Is there a way to access temporary tables of other sessions in PostgreSQL?