How does Spark SQL execute SQL query with joining operation? - postgresql

For the following operation to run sql statement in spark sql to join two tables in PostgreSQL:
val df = spark.read.jdbc(url, 'select * from table_1 join table_2 on a where x', connproperties);
Will Database Engine execute the joining operation and sends back the joined results? Or will the Database send all records of table_1 and table_2 to spark job and spark job do the joining? Are there some documentation to explain this operation? Thanks!

The PostgreSQL database will only return a single resultset from a single query. If you would use valid SQL, that could be the joined result. Or nothing, in case no records match your conditions.

Related

How to use temporal table join in batch mode Flink SQL?

In order to revise data T+1, for the reason of data delay, i want to execute temporal table join using flink sql in batch mode. And flink official document show join support running in batch mode, but i got error when execute sql: org.apache.flink.table.api.TableException: unexpected correlate variable $cor1 in the plan
set table.dynamic-table-options.enabled=true;
with ods as (select ts,id,name from tbl_ods /*+options('read.streaming.enabled'='false')/* where pdate=xxx),
dim as (select ts,id,city from tbl_dim /*+options('read.streaming.enabled'='false')/* where pdate=xxx)
select
ods.id,
ods.name,
dim.city
from ods
left join dim for system_time as of dim.ts as dm on ods.id=dm.id;
Flink version:1.15.2
Hudi:0.12.2
Whether temporal table join not support in flink sql batch mode? Or i need to change the sql?

Spark Sql query is taking longer time in scala

I am new to Spark sql, I have a spark sql query running inside a for loop. Example,
val sQuery = "select distinct col1, col2, col3 from HiveDB.HiveTableName"
The query is being executed sequentially for different hive db and table in the loop, almost 200 db and tables. There is a performance hit because the query has to run for entire tables. I tried to rewrite/optimize the query like
val sQuery = "select * from (select col1, col2, col3, dense_rank() over (order by `date` desc) as rnk from HiveDB.HiveTableName ) b where b.rnk =1"
But still I see it is taking same time as compared to first query. Can anyone suggest to optimize the query.
Spark version used 2.3.2
Tables are external ORC format and it is partitioned by yyyy-mm.

Can't we join two tables and fetch data in Kafka?

I have joined two tables and fetched data using Postgres source connector. But every time it gave the same issue i.e.
I have run the same query in Postgres and it runs without any issue. Is fetching data by joining tables not possible in Kafka?
I solve this issue by using the concept of the subquery. The problem was that
when I use an alias, the alias column is interpreted as a whole as a column name and therefore the problem occurs. Here goes my query:
select * from (select p.\"Id\" as \"p_Id\", p.\"CreatedDate\" as p_createddate, p.\"ClassId\" as p_classid, c.\"Id\" as c_id, c.\"CreatedDate\" as c_createddate, c.\"ClassId\" as c_classid from \"PolicyIssuance\" p left join \"ClassDetails\" c on p.\"DocumentNumber\" = c.\"DocumentNo\") as result"

flink Table SQL Api

I want to know can we write query using two tables ( join) in flink Table and SQL api.
I am new to flik, I want to create two table from two different data set and query them and produce other dataset.
my query would be like select... from table1, table 2... so can we write like this query which querying two tables or more?
Thanks
Flink's Table API supports join operations (full, left, right, inner joins) on batch tables (e.g. those created from a DataSet).
SELECT c, g FROM Table3, Table5 WHERE b = e
For streaming tables (e.g. those created from a DataStream), Flink does not yet support join operations. But the Flink community is actively working to add them in the near future.

Most efficient way to select and process data from a dataframe

I would like to load and process data from a dataframe in Spark using Scala.
The raw SQL Statement looks like this:
INSERT INTO TABLE_1
(
key_attribute,
attribute_1,
attribute_2
)
SELECT
MIN(TABLE_2.key_attribute),
CURRENT_TIMESTAMP as attribute_1,
'Some_String' as attribute_2
FROM TABLE_2
LEFT OUTER JOIN TABLE_1
ON TABLE_2.key_attribute = TABLE_1.key_attribute
WHERE
TABLE_1.key_attribute IS NULL
AND TABLE_2.key_attribute IS NOT NULL
GROUP BY
attribute_1,
attribute_2,
TABLE_2.key_attribute
What I've done so far:
I created a DataFrame from the Select Statement and joined it with the TABLE_2 DataFrame.
val table_1 = spark.sql("Select key_attribute, current_timestamp() as attribute_1, 'Some_String' as attribute_2").toDF();
table_2.join(table_1, Seq("key_attribute"), "left_outer");
Not really much progress because I face to many difficulties:
How do I handle the SELECT with processing data efficiently? Keep everything in seperate DataFrames?
How do I insert the WHERE/GROUP BY clause with attributes from several sources?
Is there any other/better way except Spark SQL?
Few steps in handling are -
First create the dataframe with your raw data
Then save it as temp table.
You can use filter() or "where condition in sparksql" and get the
resultant dataframe
Then as you used - you can make use of jons with datframes. You can
think of dafaframes as a representation of table.
Regarding efficiency, since the processing will be done in parallel, its being taken care. If you want anything more regarding efficiency, please mention it.