Implementing Hive UNION in Pyspark - pyspark

I am trying to read a SQL from a file and run it inside a Pyspark job. The SQL is structured as below:
select <statements>
sort by rand()
limit 333333
UNION ALL
select <statements>
sort by rand()
limit 666666
here is the error I am getting when I run it:
pyspark.sql.utils.ParseException: u"\nmismatched input 'UNION'
expecting {, '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE',
RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*',
'/', '%', 'DIV', '&', '|', '^'}
Is this because UNION ALL/UNION is not supported by spark SQL or something to do with the with parsing gone wrong?

PySpark and Hive supports UNION in the sql statement.
I am able to run the following hive statement
(SELECT * from x ORDER BY rand() LIMIT 50)
UNION
(SELECT * from y ORDER BY rand() LIMIT 50)
In pyspark you can also do this
df1=spark.sql('SELECT * from x ORDER BY rand() LIMIT 50')
df2=spark.sql('SELECT * from y ORDER BY rand() LIMIT 50')
df=df1.union(df2)

Related

If I have comma seperated values in a column , to use it in where condition I have used split_part. But it is taking more time in pgAdmin 4

In PostgreSQL, I have a column receipt_id which will have comma separated values or a single value .
Eg:
I need to use the values in the third column with another table called Voucher in where condition .
I have used split_part.
select ap.document_no AS invoice_number,
ap.curr_date AS invoice_date,ap.receipt_id,split_part(ap.receipt_id::text, ','::text, 1),
split_part(ap.receipt_id::text, ','::text, 2)
from ap_invoice_creation ap , voucher v
where (v.voucher_id::text IN
( SELECT split_part(ap_invoice_creation.receipt_id::text, ','::text, 1) AS parts
FROM ap_invoice_creation
WHERE ap_invoice_creation.receipt_id::text = ap.receipt_id::text
UNION
SELECT split_part(ap_invoice_creation.receipt_id::text, ','::text, 2) AS parts
FROM ap_invoice_creation
WHERE ap_invoice_creation.receipt_id::text = ap.receipt_id::text
UNION
SELECT split_part(ap_invoice_creation.receipt_id::text, ','::text, 3) AS parts
FROM ap_invoice_creation
WHERE ap_invoice_creation.receipt_id::text = ap.receipt_id::text)) AND ap.status::text = 'Posted'::text
But this is a part of query, it is taking more time.
Because of this entire query is taking more time.
Is there any other way to handle this?
Ideally, you should not even be storing CSV like this. That being said, there is no need for SPLIT_PART() here and big ugly union. Consider this version:
SELECT
ap.document_no AS invoice_number,
ap.curr_date AS invoice_date,
ap.receipt_id,
SPLIT_PART(ap.receipt_id::text, ',', 1),
SPLIT_PART(ap.receipt_id::text, ',', 2)
FROM ap_invoice_creation ap
INNER JOIN voucher v
ON ',' || ap.receipt_id || ',' LIKE '%,' || v.voucher_id::text || ',%';
One way to improve performance would be to get rid of the dreaded CSV value, and store the data in a proper one-to-many relationship.
If you can't do that, you can use an EXISTS condition rather than a cross join with an IN condition that uses three queries:
select ap.document_no AS invoice_number,
ap.curr_date AS invoice_date,
ap.receipt_id,
split_part(ap.receipt_id, ',', 1),
split_part(ap.receipt_id, ',', 2)
from ap_invoice_creation ap
WHERE EXISTS (select *
from voucher v
where v.voucher_id = any (string_to_array(ap.receipt_id, ',')))
WHERE ap.status = 'Posted'

ERROR: syntax error at or near "BY"¶ Position: 321

How to convert below Oracle query to Postgres? Below is the error
ERROR: syntax error at or near "BY"¶ Position: 321
Query
SELECT listagg(app_rule_cd,',') within GROUP (
ORDER BY abc_cd) AS ERR_LST,
'1' AS JOIN1
FROM ABC_RULE
WHERE abc_cd IN
( WITH CTE AS
(SELECT VAL FROM config_server WHERE NAME = 'XXXXXXXXXX'
)
SELECT TRIM(REGEXP_SUBSTR( VAL, '[^,]+', 1, LEVEL))
FROM CTE
CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(VAL, '[^,]+')) + 1 // BY is position 321
);
You did not explain what this query does, but the convoluted connect by level and regexp_replace() is a typically pattern to split a comma separated string into elements in Oracle.
That can be done way easier in Postgres:
SELECT string_agg(app_rule_cd,',' ORDER BY abc_cd) AS ERR_LST,
'1' AS JOIN1
FROM ABC_Rule
WHERE abc_cd = ANY ( (SELECT string_to_array(val, ',')
FROM config_server WHERE NAME = 'XXXXXXXXXX') )
Note the duplicated parentheses around the sub-query are necessary. Another way is to use the IN operator:
SELECT string_agg(app_rule_cd,',' ORDER BY abc_cd) AS ERR_LST,
'1' AS JOIN1
FROM ABC_Rule
WHERE abc_cd IN (SELECT unnest(string_to_array(val, ','))
FROM config_server
WHERE NAME = 'XXXXXXXXXX')

Merge or Combine Multiple Rows Records to Single Column Record with Comma delimiters SNOWSQL

I know this question has been asked before here, but for SNOWSQL in particular, is there a function similar to 'STUFF' to combine two values into a single record? I basically want to be able to use this query:
SELECT ISSUE_ID,
STUFF((SELECT ', ' + AFFECTS_VERSION
FROM VW_JIRA_ISSUES
WHERE ISSUE_ID = T.ISSUE_ID
FOR XML PATH (''), type) ).value('.', 'varchar(max)'), 1, 1, '')
AS VERSIONS
FROM VW_JIRA_ISSUES
GROUP BY ISSUE_ID
How about Snowflake's INSERT() function? I understand it is basically the same as MySQL's INSERT() function which in turn is the equivalent of STUFF() in SQL Server.
References:
https://docs.snowflake.net/manuals/sql-reference/functions/insert.html
https://database.guide/whats-the-mysql-equivalent-of-stuff-in-sql-server/
select issue_id,
listagg(AFFECTS_VERSION, ', ') within group (order by issue_id desc)
FROM VW_JIRA_ISSUES
group by issue_id

PostgreSQL: how to get & set an enumerated list variable in a script?

I am using GUC style variables in an SQL script like this:
set mycustom.var = 5;
select current_setting('mycustom.var');
that works fine for strings and integers... but how do I get and set enumerated lists of integers?
Ideally, I'd like to populated the enumerated list with random unique values using this code:
SELECT num
FROM GENERATE_SERIES (1, 10) AS s(num)
ORDER BY RANDOM()
LIMIT 6
The problem to overcome: SET expects literal input. You can't feed the result of a query to it directly.
One way around it: dynamic SQL like:
DO
$$
BEGIN
EXECUTE format(
'SET mycustom.var = %L'
, ARRAY(
SELECT *
FROM generate_series(1, 10)
ORDER BY random()
LIMIT 6
)::text
);
END
$$;
Or use set_config():
SELECT set_config('mycustom.var'
, ARRAY(
SELECT *
FROM generate_series(1, 10)
ORDER BY random()
LIMIT 6
)::text
, false);
Then:
SELECT current_setting('mycustom.var')::int[];
db<>fiddle here
This returns an array of integer: int[].
A temporary function would be an alternative. Possibly with a built-in dynamic result (while this solution only stores the result, immutably):
Is there a way to define a named constant in a PostgreSQL query?
Use set_config()
select set_config(
'mycustom.list',
(
select array_agg(num)::text
from (
select num
from generate_series (1, 10) as s(num)
order by random()
limit 6
) s
),
false
);
Of course, the setting is of type text:
select current_setting('mycustom.list', true);
current_setting
-----------------
{2,6,1,3,10,8}
(1 row)
However, you can easily convert it to set of rows:
select *
from unnest(current_setting('mycustom.list', true)::int[])
unnest
--------
2
6
1
3
10
8
(6 rows)

How to perform "a UNION b" when a and b are CTEs?

If I try to UNION (or INTERSECT or EXCEPT) a common table expression I get a syntax error near the UNION. If instead of using the CTE I put the query into the union directly, everything works as expected.
I can work around this but for some more complicated queries using CTEs makes things much more readable. I also just don't like not knowing why something is failing.
As an example, the following query works:
SELECT *
FROM
(
SELECT oid, route_group
FROM runs, gpspoints
WHERE gpspoints.oid = runs.start_point_oid
UNION
SELECT oid, route_group
FROM runs, gpspoints
WHERE gpspoints.oid = runs.end_point_oid
) AS allpoints
;
But this one fails with:
ERROR: syntax error at or near "UNION"
LINE 20: UNION
WITH
startpoints AS
(
SELECT oid, route_group
FROM runs, gpspoints
WHERE gpspoints.oid = runs.start_point_oid
),
endpoints AS
(
SELECT oid, route_group
FROM runs, gpspoints
WHERE gpspoints.oid = runs.end_point_oid
)
SELECT *
FROM
(
startpoints
UNION
endpoints
) AS allpoints
;
The data being UNIONed together is identical but one query fails and the other does not.
I'm running PostgreSQL 9.3 on Windows 7.
The problem is because CTEs are not direct text-substitutions and a UNION b is invalid SELECT syntax. The SELECT keyword is a mandatory part of the parsing and the syntax error is raised before the CTEs are even taken into account.
This is why
SELECT * FROM a
UNION
SELECT * FROM b
works; the syntax is valid, and then the CTEs (represented by a and b) are then used at the table-position (via with_query_name).
At least in SQL Server, I can easily do this - create two CTE's, and do a SELECT from each, combined with a UNION:
WITH FirstNames AS
(
SELECT DISTINCT FirstName FROM Person
), LastNames AS
(
SELECT DISTINCT LastName FROM Person
)
SELECT * FROM FirstNames
UNION
SELECT * FROM LastNames
Not sure if this works in Postgres, too - give it a try!