Does spark supports the below cascaded query? - postgresql

I have one requirement to run some queries against some tables in the postgresql database to populate a dataframe. Tables are as following.
table 1 has the below data.
QueryID, WhereClauseID, Enabled
1 1 true
2 2 true
3 3 true
...
table 2 has the below data.
WhereClauseID, WhereClauseString
1 a>b
2 a>c
3 a>b && a<c
...
table 3 has the below data.
a, b, c, value
30, 20, 30, 100
20, 10, 40, 200
...
I want to query in the following way. For table 1, I want to pick up the rows when Enabled is true. Based on the WhereClauseID in each row, I want to pick up the rows in table 2. Based on the WhereClause condition picked up from table 2, I want to run the query using Where Clause to query table 3 to get the Value. Finally, I want to get all records in table 3 meeting the WhereClauses enabled in table 1.
I know I can go through table 1 row by row, and use the parameterized string to build sql query to query table 3. But I think the efficiency is very low to query row by row, especially if table 1 is big. Are there some better way to organize the query to improve the efficiency? Thanks a lot!

Depending on you usecase, but for pyspark databases, you'd might be able to solve it using the .when statement in pyspark.
Here is a suggestion.
import pyspark.sql.functions as F
tbl1 = spark.table("table1")
tbl3 = spark.table("table3")
tbl3 = (
tbl3
.withColumn("WhereClauseID",
## You can do some fancy parsing of your tbl2
## here if you want this to be evaluated programatically from your table2.
(
F.when( F.col("a") > F.col("b"), 1)
.when( F.col("a") > F.col("b"), 2)
.otherwise(-1)
)
)
)
tbl1_with_tbl_3 = tbl1.join(tbl3, "WhereClauseID", "left")

Related

Query table by a value in the second dimension of a two dimensional array column

WHAT I HAVE
I have a table with the following definition:
CREATE TABLE "Highlights"
(
id uuid,
chunks numeric[][]
)
WHAT I NEED TO DO
I need to query the data in the table using the following predicate:
... WHERE id = 'some uuid' and chunks[????????][1] > 10 chunks[????????][3] < 20
What should I put instead of [????????] in order to scan all items in the first dimension of the array?
Notes
I'm not entirely sure that chunks[][1] even close to something I need.
All I need is to test a row, whether its chunks column contains a two dimensional array, that has in any of its tuples some specific values.
May be there's better alternative, but this might do - you just go over first dimension of each array and testing your condition:
select *
from highlights as h
where
exists (
select
from generate_series(1, array_length(h.chunks, 1)) as tt(i)
where
-- your condition goes here
h.chunks[tt.i][1] > 10 and h.chunks[tt.i][3] < 20
)
db<>fiddle demo
update as #arie-r pointed out, it'd be better to use generate_subscripts function:
select *
from highlights as h
where
exists (
select *
from generate_subscripts(h.chunks, 1) as tt(i)
where
h.chunks[tt.i][3] = 6
)
db<>fiddle demo

pyspark processing & compare 2 dataframes

I am working on pyspark (Spark 2.2.0) with 2 dataframes that have common columns. Requirement I am dealing with is as below: Join the 2 frames as per rule below.
frame1 = [Column 1, Column 2, Column 3....... column_n] ### dataframe
frame2 = [Column 1, Column 2, Column 3....... column_n] ### dataframe
key = [Column 1, Column 2] ### is an array
If frame1.[Column1, column2] == frame1.[Column1, column2]
if frame1.column_n == frame2.column_n
write to a new data frame DF_A using values from frame 2 as is
if frame1.column_n != frame2.column_n
write to a new data frame DF_A using values from frame 1 as is
write to a new data frame DF_B using values from frame 2 but with column3, & column 5 hard coded values
To do this, I am first creating 2 temp views and constructing 3 SQLs dynamically.
sql_1 = select frame1.* from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n=frame2.column_n
DFA = sqlContext.sql(sql_1)
sql_2 = select [all columns from frame1] from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n != frame2.column_n
DF_A = DF_A.union(sqlContext.sql(sql_2))
sql_3 = select [all columns from frame2 except for column3 & column5 to be hard coded] from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n != frame2.column_n
DF_B = sqlContext.sql(sql_1)
Question1: is there better way to dynamically pass key columns for joining? I am currently doing this by maintaining key columns in arrays (is working) and constructing SQL.
Question2: is there better way to dynamically pass selection columns without changing sequence of columns? I am currently doing this by maintaining column names in array and performing concatenation.
I did consider one single full outer join option but since column names are same I thought it will have more overhead of renaming.
For question#1 and #2, I went with getting the column names form dataframe schema (df.schema.names and df.columns) and string processing inside the loop.
For the logic, I went with minimal of 2 SQLs - one with full outer join.

filtering on a range of values in a db column with sqlalchemy orm

I have a postgresql database and in one particular table, with many rows. One column in this table, called data, is a float array, REAL[], and gets filled with an array of ~4500 elements. I want to access this table through some query via SQLAlchemy and the ORM.
How do I select all rows in the table where a subset of this column satisfies some condition, e.g.contains a range of values? Like I want to select all rows where the data contains values >= 10, or values between >=10 and <=20.
Can I do this with a straight session query like
rows = session.query(Table).filter(Table.data.(some conditional)).all()
where my conditional is something like "VALUES >= 10 and VALUES <= 20"?
Or do I need to define some special methods, or setup, when I'm defining my SQLAlchemy table class. For example, I have my table set up as
class Table(Base):
__tablename__ = 'table'
__table_args__ = {'autoload' : True, 'schema' : 'testdb', 'extend_existing':True}
data = deferred(Column(ARRAY(Float)))
def __repr__(self):
return '<Table (pk={0})>'.format(self.pk)
Ideally I'd like to set it up so I can just do simple filtering in my session.query calls. Is this possible? I'm not super familiar with the ORM, so maybe it is?
I've had a look at the ARRAY Comparator sqlalchemy docs but those only seem to work on exact values. My data is precise to 6 sigfigs, and I don't know the exact values ahead of time.
What's the best way to do this? Thanks.
EDIT:
Based on the below comment, here is the code I'm using in attempting to select all rows (out of 1000) that have data (from 1 column) >= 1.0. There should be 537 rows.
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This gives the correct subset number. len(rows) = 537. However, I don't understand the logic of with this operator, where to select data >=1.0 , I use the le operator? Also, along those same lines, there should be 234 rows that have data between the values >=1.0 and <1.0, but this statement fails to give the correct subset..
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
* EDIT 2 *
Here's an example of my database Table with a few rows. pk is an integer, and data is a real[].
db datadb
schema Table
pk data
0 [0.0,0.0,0.5,0.3,1.3,1.9,0.3,0.0,0.0]
1 [0.1,0.0,1.0,0.7,1.1,1.5,1.2,0.3,1.4]
2 [0.0,0.6,0.4,0.3,1.6,1.7,0.4,1.3,0.0]
3 [0.0,0.1,0.2,0.4,1.0,1.1,1.2,0.9,0.0]
4 [0.0,0.0,0.5,0.3,0.2,0.1,0.7,0.3,0.1]
I have 5 rows, 4 of them have data with values >= 1.0, while just 2 have values in the range >= 1.0 and <= 1.2. The query I would do to grab the rows is in the first case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This should return the 4 rows, at pk=0,1,2,3. This query does what I expect. The second case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
and should return the 2 rows at pk=1,3. However this query just returns the 4 rows from the first query. For the second query, I also tried
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le),datadb.Table.data.any(1.2,operator=operators.ge)).all()
which also didn't work.
Please read documentation on ARRAY.Comparator, according to which you should be able to do the following:
rows = (session.query(Table)
.filter(Table.data.any(10, operator=operators.le))
.filter(Table.data.any(20, operator=operators.ge)
).all()
EDIT:
# combined filter does not work,
# but applying one or the other is still useful as it reduces the result set
q = (session.query(MyTable)
.filter(MyTable.data.any(1.0, operator=operators.le))
# .filter(MyTable.data.any(1.2, operator=operators.ge))
)
# filter in memory
items = [_row for _row in q.all()
if any(1.0 <= item <= 1.2 for item in _row.data)]
for item in items:
print(item)

Using two different rows from the same table in an expression

I'm using PostgreSQL + PostGIS.
In table I have a point and line geometry in the same column of the same table, in different rows. To get the line I run:
SELECT the_geom
FROM filedata
WHERE id=3
If i want to take point I run:
SELECT the_geom
FROM filedata
WHERE id=4
I want take point and line together, like they're shown in this WITH expression, but using a real query against the table instead:
WITH data AS (
SELECT 'LINESTRING (50 40, 40 60, 50 90, 30 140)'::geometry AS road,
'POINT (60 110)'::geometry AS poi)
SELECT ST_AsText(
ST_Line_Interpolate_Point(road, ST_Line_Locate_Point(road, poi))) AS projected_poi
FROM data;
You see in this example data comes from a hand-created WITH expression. I want take it from my filedata table. My problem is i dont know how to work with data from two different rows of one table at the same time.
One possible way:
A subquery to retrieve another value from a different row.
SELECT ST_AsText(
ST_Line_Interpolate_Point(
the_geom
,ST_Line_Locate_Point(
the_geom
,(SELECT the_geom FROM filedata WHERE id = 4)
)
)
) AS projected_poi
FROM filedata
WHERE id = 3;
Use a self-join:
SELECT ST_AsText(
ST_Line_Interpolate_Point(fd_road.the_geom, ST_Line_Locate_Point(
fd_road.the_geom,
fd_poi.the_geom
)) AS projected_poi
FROM filedata fd_road, filedata fd_poi
WHERE fd_road.id = 3 AND fd_poi.id = 4;
Alternately use a subquery to fetch the other row, as Erwin pointed out.
The main options for using multiple rows from one table in a single expression are:
Self-join the table with two different aliases as shown above, then filter the rows;
Use a subquery expression to get a value for all but one of the rows, as Erwin's answer shows;
Use a window function like lag() and lead() to get a row relative to the current row within the query result; or
JOIN on a subquery that returns a table
The latter two are more advanced options that solve problems that're difficult or inefficient to solve with the simpler self-join or subquery expression.

Long running query on a self joined table

I try to improve the performance of a query which updates a coloumn on each row of a table, by comparing the actual row's values with all other rows in the same table. Here is the query:
update F set
PartOfPairRC = 1
from RangeChange F
where Reject=0
and exists(
select 1 from RangeChange S
where F.StoreID = S.StoreID
and F.ItemNo = S.ItemNo
and F.Reject = S.Reject
and F.ChangeDateEnd = S.ChangeDate - 1)
The query's performance degrades rapidly as the number of rows in the table increases. I have 50 millon rows in the table.
Is there a better way to do this? Would SSIS be able to handle such an operation better?
Any help much appreciated, thanks Robert
You can try to create a index on that table:
create index idx_test on RangeChange(StoreID, ItemNo, Reject, ChangeDateEnd) where reject = 0
--when you are not using the SQL enterprise get rid of the where condition in the index and put the reject column as included column in the index
--make sure you have a clustered index already on the table (when not you can create the index above as clustered)
-- I would write the query as a join:
update F set
F.PartOfPairRC = 1
from RangeChange F
join RangeChange S
on F.StoreID = S.StoreID
and F.ItemNo = S.ItemNo
and F.Reject = S.Reject
and F.ChangeDateEnd = S.ChangeDate - 1
where F.Reject=0 and S.Reject = 0