Sqlalchemy asyncio translate postgres query for GROUP_BY clause - postgresql

I want to translate the below postgres query into Sqlalchemy asyncio format, but so far, I could only retrieve the first column only, or the whole row at once, while I need only to retrieve only two columns per record:
SELECT
table.xml_uri,
max(table.created_at) AS max_1
FROM
table
GROUP BY
table.xml_uri
ORDER BY
max_1 DESC;
I reach out to the below translation, but this only returns the first column xml_uri, while I need both columns. I left the order_by clause commented out for now as it generates also the below error when commented in:
Sqlalchemy query:
from sqlalchemy.ext.asyncio import AsyncSession
query = "%{}%".format(query)
records = await session.execute(
select(BaseModel.xml_uri, func.max(BaseModel.created_at))
.order_by(BaseModel.created_at.desc())
.group_by(BaseModel.xml_uri)
.filter(BaseModel.xml_uri.like(query))
)
# Get all the records
result = records.scalars().all()
Error generated when commenting in order_by clause:
column "table.created_at" must appear in the GROUP BY clause or be used in an aggregate function

The query is returning a resultset consisting of two-element tuples. session.scalars() is taking the first element of each tuple. Using session.execute instead will provide the desired behaviour.
It's not permissable to order by the date field directly as it isn't part of the projection, but you can give the max column a label and use that to order.
Here's an example script:
import sqlalchemy as sa
from sqlalchemy import orm
Base = orm.declarative_base()
class MyModel(Base):
__tablename__ = 't73018397'
id = sa.Column(sa.Integer, primary_key=True)
code = sa.Column(sa.String)
value = sa.Column(sa.Integer)
engine = sa.create_engine('postgresql:///test', echo=True, future=True)
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
Session = orm.sessionmaker(engine, future=True)
with Session.begin() as s:
for i in range(10):
# Split values based on odd or even
code = 'AB'[i % 2 == 0]
s.add(MyModel(code=code, value=i))
with Session() as s:
q = (
sa.select(MyModel.code, sa.func.max(MyModel.value).label('mv'))
.group_by(MyModel.code)
.order_by(sa.text('mv desc'))
)
res = s.execute(q)
for row in res:
print(row)
which generates this query:
SELECT
t73018397.code,
max(t73018397.value) AS mv
FROM t73018397
GROUP BY t73018397.code
ORDER BY mv desc

Related

Flask SQLAlchemy Filter On A Postgres JSON List Object Based on a Single String

I have a model called Testing. The field called alias is a JSON field (a list really) and has values such as ["a", "b"] or ["d", "e"] and so on.
class Testing(db.Model):
__tablename__ = 'testing'
id = db.Column(db.Integer, primary_key=True)
name = db.Column(db.String(25))
alias = db.Column(JSON)
def __init__(self, name, alias):
self.name = name
self.alias = alias
In my flask view I grab a url parameter that I want to use to filter Testing to get all Testing objects in which the parameter value is in the alias json list. So for example if url_param_value="a" I want all the Testing objects where "a" is in alias. So the alias value of ["a", "b"] would be a hit in this example.
Here is my approach but its not working and I assume it has to do with seralization.
Testing.query.filter(Testing.alias.contains(url_param_val)).all()
I am getting the below error
NotImplementedError: Operator 'contains' is not supported on this expression
The name field is a JSON type, not an array type. JSON columns don't have a contains method, even if you happen to be storing array data (how would the database know?)
In Postgres, you can use json_array_elements to expand a JSON array to a set of JSON values; this will return one row per element:
select id, json_array_elements(alias) as val from testing;
id | val
---------+--------------------
1 | "a"
2 | "b"
You can use that as a subquery to select records that contain a matching value:
select t.id, t.name, t.alias, cast(q.val as varchar)
from testing t, (
select id, json_array_elements(alias) as val
from testing
) q
where q.id=t.id and cast(q.val as varchar) = '"a"';
In SQLAlchemy syntax:
subq = session.query(
Testing.id,
func.json_array_elements(Testing.alias).label("val")
).subquery()
q = session.query(Testing).filter(
cast(subq.c.val, sa.Unicode) == '"a"',
subq.c.id == Testing.id)
Warning: this is going to be very inefficient for large tables; you're probably better off fixing the types to match your data, and then creating appropriate indexes.

Expressing Postgresql VALUES command in SQLAlchemy ORM?

How to express the query
VALUES ('alice'), ('bob') EXCEPT ALL SELECT name FROM users;
(i.e. "list all names in VALUES that are not in table 'users'") in SQLAlchemy ORM? In other words, what should the statement 'X' below be like?
def check_for_existence_of_all_users_in_list(list):
logger.debug(f"checking that each user in {list} is in the database")
query = X(list)
(There is sqlalchemy.values which could be used like this:
query = sa.values(sa.column('name', sa.String)).data(['alice', 'bob']) # .???
but it appears that it can only be used as argument to INSERT or UPDATE.)
I am using SQLAlchemy 1.4.4.
This should work for you:
user_names = ['alice', 'bob']
q = values(column('name', String), name="temp_names").data([(_,) for _ in user_names])
query = select(q).except_all(select(users.c.name)) # 'users' is Table instance

Replace rows based on a modified timestamp

I am looking for an efficient method (which I can reuse for similar situations) to drop rows which have been updated.
My table has many columns, but the important ones are:
creation_timestamp, id, last_modified_timestamp
My primary key is the creation_timestamp and the id. However, after and id has been created, it can be modified by other users which is indicated by the last_modified_timestamp.
1) Read a daily file and add any new rows (based on creation_timestamp and id)
2) Remove old rows which have a different last_modified_timestamp and replace them with the latest versions.
I typically do most of my operations with Pandas (python library) and pyscopg2, so I am not extremely familiar with PostgreSQL 9.6 which is the database I am using. My initial approach is to just add the last_modified_timestamp to the primary key, and then just use a view to SELECT DISTINCT based on the latest changes. However, it seems like that is 'cheating' and I will be wasting space since I do not need to retain previous versions.
EDIT:
def create_update_query(df, table=FACT_TABLE):
columns = ', '.join([f'{col}' for col in DATABASE_COLUMNS])
constraint = ', '.join([f'{col}' for col in PRIMARY_KEY])
placeholder = ', '.join([f'%({col})s' for col in DATABASE_COLUMNS])
updates = ', '.join([f'{col} = EXCLUDED.{col}' for col in DATABASE_COLUMNS])
query = f"""
INSERT INTO {table} ({columns})
VALUES ({placeholder})
ON CONFLICT ({constraint})
DO UPDATE SET {updates};"""
query.split()
query = ' '.join(query.split())
return query
def load_updates(df, connection=DATABASE):
conn = connection.get_conn()
cursor = conn.cursor()
df1 = df.where((pd.notnull(df)), None)
insert_values = df1.to_dict(orient='records')
for row in insert_values:
cursor.execute(create_update_query(df), row)
conn.commit()
cursor.close()
del cursor
conn.close()
This appears to work. I was running into some issues, so right now i am looping through each row of the DataFrame as a dictionary, then inserting that row. Also, I had to figure out a way to fill in the nan columns with None, because I was getting errors with Timestamp dtypes with blank values, etc.

Using many connection.cursor()

I want to fetch data from 3 tables in a single database at once. I used 3 conn.cursor() to it.. Are there any sophisticated ways to do it?
conn = psycopg2.connect(database="plottest", user="postgres")
self.statusbar.showMessage("Database opened Sucessfully", 1000)
cur = conn.cursor()
cur1 = conn.cursor()
cur2 = conn.cursor()
cur.execute("SELECT id ,actual from \"%s\" " % date)
rows = cur.fetchall()
cur1.execute("SELECT qty from DAILY where date = \'%s\'" % date)
dailyqty = cur1.fetchone()
cur2.execute("SELECT qty from MONTHLY where month = \'%s\'" % month)
monthqty = cur2.fetchone()
Awoogah awoogah, SQL injection warning! Don't write code using string interpolation. What happens if someone calls your code with the "date" ');-- DROP TABLE DAILY;-- ?
Use bind parameters. Always.
The only exception is for dynamic identifiers, like in the case above where you seem to use a table named after the current date. In that case you must "double quote" them and double any contained double-quotes. In your case that means that date should be date.replace('"', '""') where you substitute it into the SQL.
Now, back to our regular programming.
Since you fetchall from each cursor you can just re-use it. You don't need new cursors each time.
You can also combine the daily and monthly stats if you want, with a UNION ALL. I fixed your capitalisation and parameter binding in the process:
cur.execute("""SELECT 1, qty FROM daily WHERE date = %s
UNION ALL
SELECT 2, qty FROM monthly WHERE month = %s
ORDER BY 1""",
(date, month))
Note that string interpolation isn't used, instead a 2-tuple of parameters is passed to psycopg2 to bind directly. There's no need for quotes around the parameters, psycopg2 adds them if needed.
This avoids a client-server round trip by bundling the two queries. The extra column andORDER BY is technically needed so you can safely assume the first row is the daily results and second is the monthly. In practice PostgreSQL won't re-order them with UNION ALL though.
You can combine
SELECT a1 FROM t1 WHERE b1 = 'v1';
and
SELECT a2 FROM t2 WHERE b2 = 'v2';
to a single statement like this:
SELECT t1.a1, t2.a2 FROM t1, t2
WHERE t1.b1 = 'v1' AND t2.b2 = 'v2';
provided that both queries return exactly one row.

filtering on a range of values in a db column with sqlalchemy orm

I have a postgresql database and in one particular table, with many rows. One column in this table, called data, is a float array, REAL[], and gets filled with an array of ~4500 elements. I want to access this table through some query via SQLAlchemy and the ORM.
How do I select all rows in the table where a subset of this column satisfies some condition, e.g.contains a range of values? Like I want to select all rows where the data contains values >= 10, or values between >=10 and <=20.
Can I do this with a straight session query like
rows = session.query(Table).filter(Table.data.(some conditional)).all()
where my conditional is something like "VALUES >= 10 and VALUES <= 20"?
Or do I need to define some special methods, or setup, when I'm defining my SQLAlchemy table class. For example, I have my table set up as
class Table(Base):
__tablename__ = 'table'
__table_args__ = {'autoload' : True, 'schema' : 'testdb', 'extend_existing':True}
data = deferred(Column(ARRAY(Float)))
def __repr__(self):
return '<Table (pk={0})>'.format(self.pk)
Ideally I'd like to set it up so I can just do simple filtering in my session.query calls. Is this possible? I'm not super familiar with the ORM, so maybe it is?
I've had a look at the ARRAY Comparator sqlalchemy docs but those only seem to work on exact values. My data is precise to 6 sigfigs, and I don't know the exact values ahead of time.
What's the best way to do this? Thanks.
EDIT:
Based on the below comment, here is the code I'm using in attempting to select all rows (out of 1000) that have data (from 1 column) >= 1.0. There should be 537 rows.
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This gives the correct subset number. len(rows) = 537. However, I don't understand the logic of with this operator, where to select data >=1.0 , I use the le operator? Also, along those same lines, there should be 234 rows that have data between the values >=1.0 and <1.0, but this statement fails to give the correct subset..
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
* EDIT 2 *
Here's an example of my database Table with a few rows. pk is an integer, and data is a real[].
db datadb
schema Table
pk data
0 [0.0,0.0,0.5,0.3,1.3,1.9,0.3,0.0,0.0]
1 [0.1,0.0,1.0,0.7,1.1,1.5,1.2,0.3,1.4]
2 [0.0,0.6,0.4,0.3,1.6,1.7,0.4,1.3,0.0]
3 [0.0,0.1,0.2,0.4,1.0,1.1,1.2,0.9,0.0]
4 [0.0,0.0,0.5,0.3,0.2,0.1,0.7,0.3,0.1]
I have 5 rows, 4 of them have data with values >= 1.0, while just 2 have values in the range >= 1.0 and <= 1.2. The query I would do to grab the rows is in the first case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This should return the 4 rows, at pk=0,1,2,3. This query does what I expect. The second case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
and should return the 2 rows at pk=1,3. However this query just returns the 4 rows from the first query. For the second query, I also tried
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le),datadb.Table.data.any(1.2,operator=operators.ge)).all()
which also didn't work.
Please read documentation on ARRAY.Comparator, according to which you should be able to do the following:
rows = (session.query(Table)
.filter(Table.data.any(10, operator=operators.le))
.filter(Table.data.any(20, operator=operators.ge)
).all()
EDIT:
# combined filter does not work,
# but applying one or the other is still useful as it reduces the result set
q = (session.query(MyTable)
.filter(MyTable.data.any(1.0, operator=operators.le))
# .filter(MyTable.data.any(1.2, operator=operators.ge))
)
# filter in memory
items = [_row for _row in q.all()
if any(1.0 <= item <= 1.2 for item in _row.data)]
for item in items:
print(item)