How to pull data by inner joining two DB2 table using pyspark? - db2

I'm using following query to pull data from DB2
SELECT A.Emp_ID,B1.Manager_Name,B1.Manager_Phone,B1.Manager_mail
FROM Employee A
INNER JOIN Manager_DETAIL B1
ON (B1.EMP_ID = A.EMP_ID
OR B1.Manager_mail = A.SuperVisor_mail
AND B1.Join_year = '2017' AND B1.QTR = 'Q1'
AND B1.Dept_Name
IN ('support')
How do I do same thing using pyspark?
I tried with this code
tab_A= spark.read.jdbc("My Connection String","Employee",
properties={"user": "my user id",
"password": "my passwore",
'driver' : 'com.ibm.db2.jcc.DB2Driver'})
tab_A.registerTempTable('data_table')
# query to get columns necessary to create indexes
sql = "SELECT * FROM data_table"
A = spark.sql(sql)
tab_B= spark.read.jdbc("My Connection String","Manager_DETAIL",
properties={"user": "my user id",
"password": "my passwore",
'driver' : 'com.ibm.db2.jcc.DB2Driver'})
tab_B.registerTempTable('data_table1')
# query to get columns necessary to create indexes
sql = "SELECT * FROM data_table1"
B1 = spark.sql(sql)
C=spark.sql("SELECT A.Emp_ID,B1.Manager_Name,B1.Manager_Phone,B1.Manager_mail
FROM A
INNER JOIN B1
ON (B1.EMP_ID == A.EMP_ID) |\
(B1.Manager_mail == A.SuperVisor_mail) \
& (B1.Join_year == '2017' & B1.QTR == 'Q1') \
& B1.Dept_Name IN ('support')")
But I'm getting invalid syntax error

Related

Page result returns different values than the generated SQL

After upgrading to spring-data-jpa 3.0.0 a JPQL query that uses Pageable is returning less elements than expected.
I executed the generated SQL from the console and that returns the correct number of elements.
I don't see a count SQL query being generated when using spring-data-jpa 3.0.0
JPQL Query:
#query(value = "select fsd from FeeScheduleDrugEntity fsd "
+ "left join DrugNdcEntity ndc on fsd.drug.id = ndc.drug.id and fsd.drug.noc = true "
+ "left join FeeScheduleSourceEntity fsse on fsse.id = fsd.drugFeeScheduleSource.id "
+ "where fsd.feeSchedule.id = :feeScheduleId ")
Generated SQL:
select f1_0.fee_schedule_drug_id,
f1_0.allowable_per_billing_unit,
f1_0.fee_schedule_drug_source,
f1_0.created_at,
f1_0.created_by,
f1_0.drug_id,
f1_0.fee_schedule_item_source_id,
f1_0.fee_schedule_id,
f1_0.modified_at,
f1_0.modified_by
from core.fee_schedule_drug f1_0
join core.drug d2_0 on d2_0.drug_id = f1_0.drug_id
left join core.drug_ndc d1_0 on f1_0.drug_id = d1_0.drug_id and d2_0.is_noc = true
left join core.fee_schedule_item_source f2_0
on f2_0.fee_schedule_item_source_id = f1_0.fee_schedule_item_source_id
where f1_0.fee_schedule_id=?
order by d1_0.ndc asc
offset ? rows fetch first ? rows only
I am expecting the same number of results from my query

Querying Partitioned Table in Lambda

I have one partitioned table 'table1' and I'm able to run select clause on this table in Athena and it gives result as well.
However, when I try to run the query on this Table 'table1' using Lambda Function, it gives me the following Error
'SYNTAX_ERROR: line 1:8: SELECT * not allowed from relation that has no columns'
Below is the python script of Lambda
client = boto3.client('athena')
#Setup and Perform query
response = client.start_query_execution(
QueryString = 'Select * FROM table1',
QueryExecutionContext = {
'Database' : 'test'
},
ResultConfiguration = {
'OutputLocation': 's3://test/'
}
)

Delete in Apache Hudi - Glue Job

I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?
The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

Orientdb sql , select from multiple tables

dont run "select from multiple tables" commands in orientdb 3.0 (centos)
i tested like this following commands
SELECT *
FROM Employee A, City B
WHERE A.city = B.id
Error Codes ; "Error parsing query: ^ Encountered " "SELECT "" at line 1, column 1. Was expecting one of: ... ... ";" ... DB name="
The most important difference between OrientDB and a Relational Database is that relationships are represented by LINKS instead of JOINs.
For this reason, the classic JOIN syntax is not supported. OrientDB uses the "dot (.) notation" to navigate LINKS. Example 1 : In SQL you might create a join such as:
SELECT *
FROM Employee A, City B
WHERE A.city = B.id
AND B.name = 'Rome'
In OrientDB an equivalent operation would be:
SELECT * FROM Employee WHERE city.name = 'Rome'
For more information: https://orientdb.com/docs/2.2.x/SQL.html#joins
Hope it helps
Regards

AS400 index configuration table

How can I view index of particular table in AS400? In which table index description of table is stored?
If your "index" is really a logical file, you can see a list of these using:
select * from qsys2.systables
where table_schema = 'YOURLIBNAME' and table_type = 'L'
To complete the previous answers: if your AS400/IBMi's files are "IBM's old style" Physical and Logical files, the qsys2.syskeys and qsys2.sysindexes are empty.
==> you retrieve index infos in QADBKFLD (for "indexes" info) and QADBXREF(for fields list) tables
select * from QSYS.QADBXREF where DBXFIL = 'YOUR_LOGICAL_FILE_NAME' and DBXLIB = 'YOUR_LIBRARY'
select * from QSYS.QADBKFLD where DBKFIL = 'YOUR_LOGICAL_FILE_NAME' and DBKLB2 = 'YOUR_LIBRARY'
WARNING: YOUR_LOGICAL_FILE_NAME is not your "table name", but the name of the file ! You have to join another table QSYS.QADBFDEP to match LOGICAL_FILE_NAME / TABLE_NAME :
To found indexes from your table's name:
Select r.*
from QSYS.QADBXREF r, QSYS.QADBFDEP d
where d.DBFFDP = r.DBXFIL and d.DBFLIB=r.DBXLIB
and d.DBFFIL = 'YOUR_TABLE_NAME' and d.DBFLIB = 'YOUR_LIBRARY'
To found all indexes' fields from your table:
Select DBXFIL , f.DBKFLD, DBKPOS , t.DBXUNQ
from QSYS.QADBXREF t
INNER JOIN QSYS.QADBKFLD f on DBXFIL = DBKFIL and DBXLIB = DBKLIB
INNER JOIN QSYS.QADBFDEP d on d.DBFFDP = t.DBXFIL and d.DBFLIB=t.DBXLIB
where d.DBFFIL = 'YOUR_TABLE_NAME' and d.DBFLIB = 'YOUR_LIBRARY'
order by DBXFIL, DBKPOS
if your indexes is create with SQL you can see liste of index in sysindexes system view
SELECT * FROM qsys2.sysindexes WHERE TABLE_SCHEMA='YOURLIBNAME' and
TABLE_NAME = 'YOURTABLENAME'
if you want detail columns for index you can join syskeys tables
SELECT KEYS.INDEX_NAME, KEYS.COLUMN_NAME
FROM qsys2.syskeys KEYS
JOIN qsys2.sysindexes IX ON KEYS.ixname = IX.name
WHERE TABLE_SCHEMA='YOURLIBNAME' and TABLE_NAME = 'YOURTABLENAME'
order by INDEX_NAME
You could also use commands to get the information. Command DSPDBR FILE(LIBNAME/FILENAME) will show a list of the objects dependent on a physical file. The objects that show a data dependency can then be further explored by running DSPFD FILE(LIBNAME/FILENAME). This will show the access paths of the logical file.