Apache Airflow - Slow to parse SQL queries on AWS MWAA - postgresql

I'm trying to build a DAG on AWS MWAA, this DAG will export data from Postgres (RDS) to S3, but it's getting an issue once the MWAA tries to parse all queries from my task, in total it will export 385 tables, but the DAG gets stuck on running mode and does not start my task.
Basically, this process will:
Load the table schema
Rename Some Columns
Export data to S3
Function
def export_to_s3(dag, conn, db, pg_hook, export_date, s3_bucket, schemas):
tasks = []
run_queries = []
for schema, features in schemas.items():
t = features.get("tables")
if t:
tables = t
else:
tables = helper.get_tables(pg_hook, schema).table_name.tolist()
is_full_export = features.get("full")
for table in tables:
columns = helper.get_table_schema(
pg_hook, table, schema
).column_name.tolist()
masked_columns = helper.masking_pii(columns, pii_columns=PII_COLS)
masked_columns_str = ",\n".join(masked_columns)
if is_full_export:
statement = f'select {masked_columns_str} from {db}.{schema}."{table}"'
else:
statement = f'select {masked_columns_str} from {db}.{schema}."{table}" order by random() limit 10000'
s3_bucket_key = export_date + "_" + schema + "_" + table + ".csv"
sql_export = f"""
SELECT * from aws_s3.query_export_to_s3(
'{statement}',
aws_commons.create_s3_uri(
'{s3_bucket}',
'{s3_bucket_key}',
'ap-southeast-2'),
options := 'FORMAT csv, DELIMITER $$|$$'
)""".strip()
run_queries.append(sql_export)
def get_table_schema(pg_hook, table_name, table_schema):
""" Gets the schema details of a given table in a given schema."""
query = """
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = '{0}'
AND table_name = '{1}'
order by ordinal_position
""".format(table_schema, table_name)
df_schema = pg_hook.get_pandas_df(query)
return df_schema
def get_tables(pg_hook, schema):
query = """
select table_name from information_schema.tables
where table_schema = '{}' and table_type = 'BASE TABLE' and table_name != '_sdc_rejected' """.format(schema)
df_schema = pg_hook.get_pandas_df(query)
return df_schema
Task
task = PostgresOperator(
sql=run_queries,
postgres_conn_id=conn,
task_id="export_to_s3",
dag=dag,
autocommit=True,
)
tasks.append(task)
return tasks
Airflow list_dags output
DAGS
-------------------------------------------------------------------
mydag
-------------------------------------------------------------------
DagBag loading stats for /usr/local/airflow/dags
-------------------------------------------------------------------
Number of DAGs: 1
Total task number: 3
DagBag parsing time: 159.94030800000002
-----------------------------------------------------+--------------------+---------+----------
file | duration | dag_num | task_num
-----------------------------------------------------+--------------------+---------+----------
/mydag.py | 159.05215199999998 | 1 | 3
/ActivationPriorityCallList/CallList_Generator.py | 0.878734 | 0 | 0
/ActivationPriorityCallList/CallList_Preprocessor.py | 0.00744 | 0 | 0
/ActivationPriorityCallList/CallList_Emailer.py | 0.001154 | 0 | 0
/airflow_helperfunctions.py | 0.000828 | 0 | 0
-----------------------------------------------------+--------------------+---------+----------
Observation
If I enable only one table to be loaded in the task, it works well, but fails if all tables are enabled to be loaded.
This behavior is the same if execute Airflow from docker pointing out to RDS
Screenshot from the airflow list_dags:

The issue was solved when I changed those values on MWAA.
webserver.web_server_master_timeout
webserver.web_server_worker_timeout
The default value is 30, I changed it to 480.
Link with documentation.
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html

Related

Querying jsonb field with #> through Postgrex adapter

I'm trying to query jsonb field via Postgrex adapter, however I receive errors I cannot understand.
Notification schema
def all_for(user_id, external_id) do
from(n in __MODULE__,
where: n.to == ^user_id and fragment("? #> '{\"external_id\": ?}'", n.data, ^external_id)
)
|> order_by(desc: :id)
end
it generates the following sql
SELECT n0."id", n0."data", n0."to", n0."inserted_at", n0."updated_at" FROM "notifications"
AS n0 WHERE ((n0."to" = $1) AND n0."data" #> '{"external_id": $2}') ORDER BY n0."id" DESC
and then I receive the following error
↳ :erl_eval.do_apply/6, at: erl_eval.erl:680
** (Postgrex.Error) ERROR 22P02 (invalid_text_representation) invalid input syntax for type json. If you are trying to query a JSON field, the parameter may need to be interpolated. Instead of
p.json["field"] != "value"
do
p.json["field"] != ^"value"
query: SELECT n0."id", n0."data", n0."to", n0."inserted_at", n0."updated_at" FROM "notifications" AS n0 WHERE ((n0."to" = $1) AND n0."data" #> '{"external_id": $2}') ORDER BY n0."id" DESC
Token "$" is invalid.
(ecto_sql 3.9.1) lib/ecto/adapters/sql.ex:913: Ecto.Adapters.SQL.raise_sql_call_error/1
(ecto_sql 3.9.1) lib/ecto/adapters/sql.ex:828: Ecto.Adapters.SQL.execute/6
(ecto 3.9.2) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
(ecto 3.9.2) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
however if I just copypaste generated sql to psql console and run it, it will succeed.
SELECT n0."id", n0."data", n0."to", n0."inserted_at", n0."updated_at" FROM "notifications" AS n0 WHERE ((n0."to" = 233) AND n0."data" #> '{"external_id": 11}') ORDER BY n0."id" DESC
notifications-# ;
id | data | to | inserted_at | updated_at
----+---------------------+-----+---------------------+---------------------
90 | {"external_id": 11} | 233 | 2022-12-15 14:07:44 | 2022-12-15 14:07:44
(1 row)
data is jsonb column
Column | Type | Collation | Nullable | Default
-------------+--------------------------------+-----------+----------+-------------------------------------------
data | jsonb | | | '{}'::jsonb
What am I missing in my elixir notification query code?
Searching for solution I came across only using raw sql statement, as I couldn't figure out what's wrong with my query when it gets passed through Postgrex
so as a solution I found the following:
def all_for(user_id, external_ids) do
{:ok, result} =
Ecto.Adapters.SQL.query(
Notifications.Repo,
search_by_external_id_query(user_id, external_ids)
)
Enum.map(result.rows, &Map.new(Enum.zip(result.columns, &1)))
end
defp search_by_external_id_query(user_id, external_id) do
"""
SELECT * FROM "notifications" AS n0 WHERE ((n0."to" = #{user_id})
AND n0.data #> '{\"external_id\": #{external_id}}')
ORDER BY n0."id" DESC
"""
end
But as a result I'm receiving Array with Maps inside not with Ecto.Schema as if I've been using Ecto.Query through Postgrex, so be aware.

RedShift copy command return

can we get the number of row inserted through copy command? Some records might fail, then what is the no of records successfully inserted?
I have a file with json object in Amazon S3 and trying to load data into Redshift using copy command. How do I know how many of records successfully got inserted and how many failed?
Loading some example data:
db=# copy test from 's3://bucket/data' credentials '' maxerror 5;
INFO: Load into table 'test' completed, 4 record(s) loaded successfully.
COPY
db=# copy test from 's3://bucket/err_data' credentials '' maxerror 5;
INFO: Load into table 'test' completed, 1 record(s) loaded successfully.
INFO: Load into table 'test' completed, 2 record(s) could not be loaded. Check 'stl_load_errors' system table for details.
COPY
Then the following query:
with _successful_loads as (
select
stl_load_commits.query
, listagg(trim(filename), ', ') within group(order by trim(filename)) as filenames
from stl_load_commits
left join stl_query using(query)
left join stl_utilitytext using(xid)
where rtrim("text") = 'COMMIT'
group by query
),
_unsuccessful_loads as (
select
query
, count(1) as errors
from stl_load_errors
group by query
)
select
query
, filenames
, sum(stl_insert.rows) as rows_loaded
, max(_unsuccessful_loads.errors) as rows_not_loaded
from stl_insert
inner join _successful_loads using(query)
left join _unsuccessful_loads using(query)
group by query, filenames
order by query, filenames
;
Giving:
query | filenames | rows_loaded | rows_not_loaded
-------+------------------------------------------------+-------------+-----------------
45597 | s3://bucket/err_data.json | 1 | 2
45611 | s3://bucket/data1.json, s3://bucket/data2.json | 4 |
(2 rows)

Hyphen as blend_char doesn't seem to work

MariaDB> select id,name from t where type='B' and name='Foo-Bar';
+----------------+---------+
| item_source_id | name |
+----------------+---------+
| 2000245 | Foo-Bar |
+----------------+---------+
1 row in set (0.00 sec)
index base_index { # Don't use this directly; it's for inheritance only.
blend_chars = +, &, U+23, U+22, U+27, -, /
blend_mode = trim_none, trim_head, trim_tail, trim_both
}
source b_source : base_source {
sql_query = select id,name from t where type='B'
sql_field_string = name
}
index b_index_lemma : base_index {
source = b_source
path = /path/b_index_lemma
morphology = lemmatize_en_all
}
SphinxQL> select * from b_index_lemma where match('Foo-Bar');
Empty set (0.00 sec)
Other Sphinx queries have results, so the problem isn't e.g. that the index is empty. Yet the hyphenated form does not, and I'd like it to. Am I misusing blend_chars-cum-blend_mode?

How to Check for Two Columns and Query Every Table When They Exist?

I'm interested in doing a COUNT(*), SUM(LENGTH(blob)/1024./1024.), and ORDER BY SUM(LENGTH(blob)) for my entire database when column 'blob' exists. For tables where synchlevels does not exist, I still want the output. I'd like to GROUP BY that column:
Example
+--------+------------+--------+-----------+
| table | synchlevel | count | size_mb |
+--------+------------+--------+-----------+
| tableA | 0 | 924505 | 3013.47 |
| tableA | 7 | 981 | 295.33 |
| tableB | 6 | 1449 | 130.50 |
| tableC | 1 | 64368 | 68.43 |
| tableD | NULL | 359 | .54 |
| tableD | NULL | 778 | .05 |
+--------+------------+--------+-----------+
I would like to do a pure SQL solution, but I'm having a bit of difficulty with that. Currently, I'm wrapping some SQL into BASH.
#!/bin/bash
USER=$1
DBNAME=$2
function psql_cmd(){
cmd=$1
prefix='\pset border 2 \\ '
echo $prefix $cmd | psql -U $USER $DBNAME | grep -v "Border\| row"
}
function synchlevels(){
echo "===================================================="
echo " SYNCH LEVEL STATS "
echo "===================================================="
tables=($(psql -U $USER -tc "SELECT table_name FROM information_schema.columns
WHERE column_name = 'blob';" $DBNAME))
for table in ${tables[#]}; do
count_size="SELECT t.synchlevel,
COUNT(t.blob) AS count,
to_char(SUM(LENGTH(t.blob)/1024./1024.),'99999D99') AS size_mb
FROM $table AS t
GROUP BY t.synchlevel
ORDER BY SUM(LENGTH(t.blob)) DESC;"
echo $table
psql_cmd "$count_size"
done
echo "===================================================="
}
I could extend this by creating a second tables BASH array of tables which have the 'synchlevel' column, compare and use that list to run through the SQL, but I was wondering if there was a way I could just do the SQL portion purely in SQL without resorting to making these lists in BASH and doing the comparisons externally. i.e. I want to avoid needing to externally loop through the tables and making numerous queries in tables=($(psql -U $USER....
I've tried the following SQL to test on a table where I know the column doesn't exist...
SELECT
CASE WHEN EXISTS(SELECT * FROM information_schema.columns
WHERE column_name = 'synchlevel'
AND table_name = 'archivemetadata')
THEN synchlevel
END,
COUNT(blob) AS count,
to_char(SUM(LENGTH(blob)/1024./1024.),'99999D99') AS size_mb
FROM archivemetadata, information_schema.columns AS info
WHERE info.column_name = 'blob'
However, it fails on THEN synchlevel for tables where it doesn't exist. It seems really simple to do, but I just can't seem to find a way to do this which doesn't require either:
Resorting to external array comparisons in BASH.
Can be done, but I'd like to simplify my solution rather than add another layer.
Creating PL/PGSQL functions.
This script is really just to help with some database data analysis for improving performance in a third-party software. We are not a shop of DB Admins, so I would prefer not to dive into PL/PGSQL as that would require more folks from our shop to also become acquainted with the language in order to support the script. Again, simplicity is the motivation here.
Postgresql 8.4 is the engine. (We cannot upgrade due to security constraints by an overseeing IT body.)
Thanks for any suggestions you might have!
The following is untested, but how about creating some dynamic sql in one psql session and piping it to another?
psql -d <yourdb> -qtAc "
select 'select ' || (case when info.column_name = 'synchlevel' then 'synchlevel,' else '' end) ||
'count(*) as cnt,' ||
'to_char(SUM(LENGTH(blob)::NUMERIC/1024/1024),''99999D99'') AS size_mb' ||
'from ' || info.table_name ||
(case when info.column_name = 'synchlevel' then ' group by synchlevel order by synchlevel' else '' end)
from information_schema.columns as info
where info.table_name IN (select distinct table_name from information_schema.columns where column_name = 'blob')" | psql -d <yourdb>

Split a string and populate a table for all records in table in SQL Server 2008 R2

I have a table EmployeeMoves:
| EmployeeID | CityIDs
+------------------------------
| 24 | 23,21,22
| 25 | 25,12,14
| 29 | 1,2,5
| 31 | 7
| 55 | 11,34
| 60 | 7,9,21,23,30
I'm trying to figure out how to expand the comma-delimited values from the EmployeeMoves.CityIDs column to populate an EmployeeCities table, which should look like this:
| EmployeeID | CityID
+------------------------------
| 24 | 23
| 24 | 21
| 24 | 22
| 25 | 25
| 25 | 12
| 25 | 14
| ... and so on
I already have a function called SplitADelimitedList that splits a comma-delimited list of integers into a rowset. It takes the delimited list as a parameter. The SQL below will give me a table with split values under the column Value:
select value from dbo.SplitADelimitedList ('23,21,1,4');
| Value
+-----------
| 23
| 21
| 1
| 4
The question is: How do I populate EmployeeCities from EmployeeMoves with a single (even if complex) SQL statement using the comma-delimited list of CityIDs from each row in the EmployeeMoves table, but without any cursors or looping in T-SQL? I could have 100 records in the EmployeeMoves table for 100 different employees.
This is how I tried to solve this problem. It seems to work and is very quick in performance.
INSERT INTO EmployeeCities
SELECT
em.EmployeeID,
c.Value
FROM EmployeeMoves em
CROSS APPLY dbo.SplitADelimitedList(em.CityIDs) c;
UPDATE 1:
This update provides the definition of the user-defined function dbo.SplitADelimitedList. This function is used in above query to split a comma-delimited list to table of integer values.
CREATE FUNCTION dbo.fn_SplitADelimitedList1
(
#String NVARCHAR(MAX)
)
RETURNS #SplittedValues TABLE(
Value INT
)
AS
BEGIN
DECLARE #SplitLength INT
DECLARE #Delimiter VARCHAR(10)
SET #Delimiter = ',' --set this to the delimiter you are using
WHILE len(#String) > 0
BEGIN
SELECT #SplitLength = (CASE charindex(#Delimiter, #String)
WHEN 0 THEN
datalength(#String) / 2
ELSE
charindex(#Delimiter, #String) - 1
END)
INSERT INTO #SplittedValues
SELECT cast(substring(#String, 1, #SplitLength) AS INTEGER)
WHERE
ltrim(rtrim(isnull(substring(#String, 1, #SplitLength), ''))) <> '';
SELECT #String = (CASE ((datalength(#String) / 2) - #SplitLength)
WHEN 0 THEN
''
ELSE
right(#String, (datalength(#String) / 2) - #SplitLength - 1)
END)
END
RETURN
END
Preface
This is not the right way to do it. You shouldn't create comma-delimited lists in SQL Server. This violates first normal form, which should sound like an unbelievably vile expletive to you.
It is trivial for a client-side application to select rows of employees and related cities and display this as a comma-separated list. It shouldn't be done in the database. Please do everything you can to avoid this kind of construction in the future. If at all possible, you should refactor your database.
The Right Answer
To get the list of cities, properly expanded, from a table containing lists of cities, you can do this:
INSERT dbo.EmployeeCities
SELECT
M.EmployeeID,
C.CityID
FROM
EmployeeMoves M
CROSS APPLY dbo.SplitADelimitedList(M.CityIDs) C
;
The Wrong Answer
I wrote this answer due to a misunderstanding of what you wanted: I thought you were trying to query against properly-stored data to produce a list of comma-separated CityIDs. But I realize now you wanted the reverse: to query the list of cities using existing comma-separated values already stored in a column.
WITH EmployeeData AS (
SELECT
M.EmployeeID,
M.CityID
FROM
dbo.SplitADelimitedList ('23,21,1,4') C
INNER JOIN dbo.EmployeeMoves M
ON Convert(int, C.Value) = M.CityID
)
SELECT
E.EmployeeID,
CityIDs = Substring((
SELECT ',' + Convert(varchar(max), CityID)
FROM EmployeeData C
WHERE E.EmployeeID = C.EmployeeID
FOR XML PATH (''), TYPE
).value('.[1]', 'varchar(max)'), 2, 2147483647)
FROM
(SELECT DISTINCT EmployeeID FROM EmployeeData) E
;
Part of my difficulty in understanding is that your question is a bit disorganized. Next time, please clearly label your example data and show what you have, and what you're trying to work toward. Since you put the data for EmployeeCities last, it looked like it was what you were trying to achieve. It's not a good use of people's time when questions are not laid out well.