Does spark supports the below cascaded query?

I have one requirement to run some queries against some tables in the postgresql database to populate a dataframe. Tables are as following.
table 1 has the below data.
QueryID, WhereClauseID, Enabled
1 1 true
2 2 true
3 3 true
table 2 has the below data.
WhereClauseID, WhereClauseString
1 a>b
2 a>c
3 a>b && a<c
table 3 has the below data.
a, b, c, value
30, 20, 30, 100
20, 10, 40, 200
I want to query in the following way. For table 1, I want to pick up the rows when Enabled is true. Based on the WhereClauseID in each row, I want to pick up the rows in table 2. Based on the WhereClause condition picked up from table 2, I want to run the query using Where Clause to query table 3 to get the Value. Finally, I want to get all records in table 3 meeting the WhereClauses enabled in table 1.
I know I can go through table 1 row by row, and use the parameterized string to build sql query to query table 3. But I think the efficiency is very low to query row by row, especially if table 1 is big. Are there some better way to organize the query to improve the efficiency? Thanks a lot!

Depending on you usecase, but for pyspark databases, you'd might be able to solve it using the .when statement in pyspark.
Here is a suggestion.
import pyspark.sql.functions as F
tbl1 = spark.table("table1")
tbl3 = spark.table("table3")
tbl3 = (
## You can do some fancy parsing of your tbl2
## here if you want this to be evaluated programatically from your table2.
F.when( F.col("a") > F.col("b"), 1)
.when( F.col("a") > F.col("b"), 2)
tbl1_with_tbl_3 = tbl1.join(tbl3, "WhereClauseID", "left")


Query table by a value in the second dimension of a two dimensional array column

I have a table with the following definition:
CREATE TABLE "Highlights"
id uuid,
chunks numeric[][]
I need to query the data in the table using the following predicate:
... WHERE id = 'some uuid' and chunks[????????][1] > 10 chunks[????????][3] < 20
What should I put instead of [????????] in order to scan all items in the first dimension of the array?
I'm not entirely sure that chunks[][1] even close to something I need.
All I need is to test a row, whether its chunks column contains a two dimensional array, that has in any of its tuples some specific values.
May be there's better alternative, but this might do - you just go over first dimension of each array and testing your condition:
select *
from highlights as h
exists (
from generate_series(1, array_length(h.chunks, 1)) as tt(i)
-- your condition goes here
h.chunks[tt.i][1] > 10 and h.chunks[tt.i][3] < 20
db<>fiddle demo
update as #arie-r pointed out, it'd be better to use generate_subscripts function:
select *
from highlights as h
exists (
select *
from generate_subscripts(h.chunks, 1) as tt(i)
h.chunks[tt.i][3] = 6
db<>fiddle demo

pyspark processing & compare 2 dataframes

I am working on pyspark (Spark 2.2.0) with 2 dataframes that have common columns. Requirement I am dealing with is as below: Join the 2 frames as per rule below.
frame1 = [Column 1, Column 2, Column 3....... column_n] ### dataframe
frame2 = [Column 1, Column 2, Column 3....... column_n] ### dataframe
key = [Column 1, Column 2] ### is an array
If frame1.[Column1, column2] == frame1.[Column1, column2]
if frame1.column_n == frame2.column_n
write to a new data frame DF_A using values from frame 2 as is
if frame1.column_n != frame2.column_n
write to a new data frame DF_A using values from frame 1 as is
write to a new data frame DF_B using values from frame 2 but with column3, & column 5 hard coded values
To do this, I am first creating 2 temp views and constructing 3 SQLs dynamically.
sql_1 = select frame1.* from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n=frame2.column_n
DFA = sqlContext.sql(sql_1)
sql_2 = select [all columns from frame1] from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n != frame2.column_n
DF_A = DF_A.union(sqlContext.sql(sql_2))
sql_3 = select [all columns from frame2 except for column3 & column5 to be hard coded] from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n != frame2.column_n
DF_B = sqlContext.sql(sql_1)
Question1: is there better way to dynamically pass key columns for joining? I am currently doing this by maintaining key columns in arrays (is working) and constructing SQL.
Question2: is there better way to dynamically pass selection columns without changing sequence of columns? I am currently doing this by maintaining column names in array and performing concatenation.
I did consider one single full outer join option but since column names are same I thought it will have more overhead of renaming.
For question#1 and #2, I went with getting the column names form dataframe schema (df.schema.names and df.columns) and string processing inside the loop.
For the logic, I went with minimal of 2 SQLs - one with full outer join.

filtering on a range of values in a db column with sqlalchemy orm

I have a postgresql database and in one particular table, with many rows. One column in this table, called data, is a float array, REAL[], and gets filled with an array of ~4500 elements. I want to access this table through some query via SQLAlchemy and the ORM.
How do I select all rows in the table where a subset of this column satisfies some condition, e.g.contains a range of values? Like I want to select all rows where the data contains values >= 10, or values between >=10 and <=20.
Can I do this with a straight session query like
rows = session.query(Table).filter( conditional)).all()
where my conditional is something like "VALUES >= 10 and VALUES <= 20"?
Or do I need to define some special methods, or setup, when I'm defining my SQLAlchemy table class. For example, I have my table set up as
class Table(Base):
__tablename__ = 'table'
__table_args__ = {'autoload' : True, 'schema' : 'testdb', 'extend_existing':True}
data = deferred(Column(ARRAY(Float)))
def __repr__(self):
return '<Table (pk={0})>'.format(
Ideally I'd like to set it up so I can just do simple filtering in my session.query calls. Is this possible? I'm not super familiar with the ORM, so maybe it is?
I've had a look at the ARRAY Comparator sqlalchemy docs but those only seem to work on exact values. My data is precise to 6 sigfigs, and I don't know the exact values ahead of time.
What's the best way to do this? Thanks.
Based on the below comment, here is the code I'm using in attempting to select all rows (out of 1000) that have data (from 1 column) >= 1.0. There should be 537 rows.
rows = session.query(datadb.Table).filter(,operator=operators.le)).all()
This gives the correct subset number. len(rows) = 537. However, I don't understand the logic of with this operator, where to select data >=1.0 , I use the le operator? Also, along those same lines, there should be 234 rows that have data between the values >=1.0 and <1.0, but this statement fails to give the correct subset..
rows = session.query(datadb.Table).filter(,operator=operators.le)).filter(,
* EDIT 2 *
Here's an example of my database Table with a few rows. pk is an integer, and data is a real[].
db datadb
schema Table
pk data
0 [0.0,0.0,0.5,0.3,1.3,1.9,0.3,0.0,0.0]
1 [0.1,0.0,1.0,0.7,1.1,1.5,1.2,0.3,1.4]
2 [0.0,0.6,0.4,0.3,1.6,1.7,0.4,1.3,0.0]
3 [0.0,0.1,0.2,0.4,1.0,1.1,1.2,0.9,0.0]
4 [0.0,0.0,0.5,0.3,0.2,0.1,0.7,0.3,0.1]
I have 5 rows, 4 of them have data with values >= 1.0, while just 2 have values in the range >= 1.0 and <= 1.2. The query I would do to grab the rows is in the first case
rows = session.query(datadb.Table).filter(,operator=operators.le)).all()
This should return the 4 rows, at pk=0,1,2,3. This query does what I expect. The second case
rows = session.query(datadb.Table).filter(,operator=operators.le)).filter(,
and should return the 2 rows at pk=1,3. However this query just returns the 4 rows from the first query. For the second query, I also tried
rows = session.query(datadb.Table).filter(,operator=operators.le),,
which also didn't work.
Please read documentation on ARRAY.Comparator, according to which you should be able to do the following:
rows = (session.query(Table)
.filter(, operator=operators.le))
# combined filter does not work,
# but applying one or the other is still useful as it reduces the result set
q = (session.query(MyTable)
.filter(, operator=operators.le))
# .filter(,
# filter in memory
items = [_row for _row in q.all()
if any(1.0 <= item <= 1.2 for item in]
for item in items:

Using two different rows from the same table in an expression

I'm using PostgreSQL + PostGIS.
In table I have a point and line geometry in the same column of the same table, in different rows. To get the line I run:
SELECT the_geom
FROM filedata
WHERE id=3
If i want to take point I run:
SELECT the_geom
FROM filedata
WHERE id=4
I want take point and line together, like they're shown in this WITH expression, but using a real query against the table instead:
WITH data AS (
SELECT 'LINESTRING (50 40, 40 60, 50 90, 30 140)'::geometry AS road,
'POINT (60 110)'::geometry AS poi)
ST_Line_Interpolate_Point(road, ST_Line_Locate_Point(road, poi))) AS projected_poi
FROM data;
You see in this example data comes from a hand-created WITH expression. I want take it from my filedata table. My problem is i dont know how to work with data from two different rows of one table at the same time.
One possible way:
A subquery to retrieve another value from a different row.
,(SELECT the_geom FROM filedata WHERE id = 4)
) AS projected_poi
FROM filedata
WHERE id = 3;
Use a self-join:
ST_Line_Interpolate_Point(fd_road.the_geom, ST_Line_Locate_Point(
)) AS projected_poi
FROM filedata fd_road, filedata fd_poi
WHERE = 3 AND = 4;
Alternately use a subquery to fetch the other row, as Erwin pointed out.
The main options for using multiple rows from one table in a single expression are:
Self-join the table with two different aliases as shown above, then filter the rows;
Use a subquery expression to get a value for all but one of the rows, as Erwin's answer shows;
Use a window function like lag() and lead() to get a row relative to the current row within the query result; or
JOIN on a subquery that returns a table
The latter two are more advanced options that solve problems that're difficult or inefficient to solve with the simpler self-join or subquery expression.

Long running query on a self joined table

I try to improve the performance of a query which updates a coloumn on each row of a table, by comparing the actual row's values with all other rows in the same table. Here is the query:
update F set
PartOfPairRC = 1
from RangeChange F
where Reject=0
and exists(
select 1 from RangeChange S
where F.StoreID = S.StoreID
and F.ItemNo = S.ItemNo
and F.Reject = S.Reject
and F.ChangeDateEnd = S.ChangeDate - 1)
The query's performance degrades rapidly as the number of rows in the table increases. I have 50 millon rows in the table.
Is there a better way to do this? Would SSIS be able to handle such an operation better?
Any help much appreciated, thanks Robert
You can try to create a index on that table:
create index idx_test on RangeChange(StoreID, ItemNo, Reject, ChangeDateEnd) where reject = 0
--when you are not using the SQL enterprise get rid of the where condition in the index and put the reject column as included column in the index
--make sure you have a clustered index already on the table (when not you can create the index above as clustered)
-- I would write the query as a join:
update F set
F.PartOfPairRC = 1
from RangeChange F
join RangeChange S
on F.StoreID = S.StoreID
and F.ItemNo = S.ItemNo
and F.Reject = S.Reject
and F.ChangeDateEnd = S.ChangeDate - 1
where F.Reject=0 and S.Reject = 0