Remove duplicates from a query sqlalchemy postgres func.max group_by - postgresql

I'm having problem with one list.
Field X.1 is duplicated
I would like use group_by and func.max to leave only X.1 rows which have some max value.
There is not much to choose from. I would prefer timestamp.
But it is not working for me even with int field.
Would you know what I'm doing wrong.
q = self.session.query(
X.0, # object id
X.1, # these are duplicated and the output I want have max to show only max rows
X.2,
X.3,
X.3,
X.5,
X.6,
X.7,
X.8,
X.9,
X.10 # this is timestamp
).filter(x.3 == filter_value1)
q.group_by(
X.0, # object id
X.1, # these are duplicated and the output I want have max to show only max rows
X.2,
X.3,
X.3,
X.5,
X.6,
X.7,
X.8,
X.9,
func.max(X.10)) # timestamp
q.all()

Related

date matching between fields

/*
I have some records I need to filter through. The 2nd field must match the first field of the next record and the max hier_record_starting must be the final record. so I need a query that pulls records 1,2,4,6 and I need the row_number to be dynamic. Any help would be appreciated as I'm having trouble conceptualizing this one
*/
create temp table data (
hier_record_starting date
,hier_record_ending date
,row_number smallint
)
;
insert into data(hier_record_starting,hier_record_ending,row_number)
values('2013-09-16','2013-10-08','1') -- 09/16/13 thru 10/08/13
,('2013-10-08','2021-10-31','2') -- 10/08/13 thru 10/31/21
,('2021-10-31','2021-11-27','3') -- invalid as 2nd field value does not match 1st field value of next record
,('2021-10-31','2021-12-25','4') -- 10/31/21 thru 12/25/21
,('2021-11-27','9999-01-01','5') -- invalid as 2nd field value does not match 1st field value of next record
,('2021-12-25','9999-01-01','6') -- 12/25/21 thru 01/01/99

Tableau count number of Records that have the Max value for that field

I have a field where I'd like to count the number of instances the field has the max number for that given column. For example, if the max value for a given column is 20, I want to know how many 20's are in that column. I've tried the following formula but I have received a "Cannot mix aggregate and non-aggregate arguments with this function."
IF [Field1] = MAX([Field1])
THEN 1
ELSE 0
END
Try
IF ATTR([Field1]) = MAX(['Field1'])
THEN 1
ELSE 0
END
ATTR() is an aggreation which will allow you to compare aggregate and non aggregate values. As long as the value you are aggregating with ATTR() contains unique values then this won't have an impact on your data.

druid query count for multiple columns

I have a query to count null values in a column. How can I adapt this to return count of null values across multiple columns? I have tried adding a list of fields e.g. [‘ip_address’,’user_agent’] to the dimension field but this didn’t work.
{"intervals":["2019-05-26T00:00:00.000Z/2019-06-25T00:00:00.000Z"],
"granularity":"all",
"context":{"timeout":60000,
"queryId":"71fe66b2-e654-45dc-8a8c-38ed160e79f5"},
"queryType":"timeseries",
"dataSource":"dataset-tablename”,
"aggregations":[{"type":"count",
"name":"count"}],
"filter":{"type":"and",
"fields":[{"type":"selector",
"dimension":"ip_address",
"value":"null"}]}}
this returns two columns,
Timestamp | Count
2019-04-27T04:55:01.000Z | 246,933
which is the count of ip_address records with null values in the timeframe. How can I return the counts for other additional fields?
You can use filtered aggregators:
{"intervals":["2019-05-26T00:00:00.000Z/2019-06-25T00:00:00.000Z"],
"granularity":"all",
"context":{"timeout":60000, "queryId":"71fe66b2-e654-45dc-8a8c-38ed160e79f5"},
"queryType":"timeseries",
"dataSource":"dataset-tablename",
"aggregations":[
{"type":"filtered", "filter":{"type":"selector", "dimension":"ip_address", "value":"null"},
"aggregator": {"type":"count", "name":"null_ip_address_count"}},
{"type":"filtered", "filter":{"type":"selector", "dimension":"user_agent", "value":"null"},
"aggregator": {"type":"count", "name":"null_user_agent_count"}}]
}
That is, instead of applying the filter to the entire query, apply the filter to individual aggregators.

Min value with GROUP BY in Power BI Desktop

id datetime new_column datetime_rankx
1 12.01.2015 18:10:10 12.01.2015 18:10:10 1
2 03.12.2014 14:44:57 03.12.2014 14:44:57 1
2 21.11.2015 11:11:11 03.12.2014 14:44:57 2
3 01.01.2011 12:12:12 01.01.2011 12:12:12 1
3 02.02.2012 13:13:13 01.01.2011 12:12:12 2
3 03.03.2013 14:14:14 01.01.2011 12:12:12 3
I want to make new column, which will have minimum datetime value for each row in group by id.
How could I do it in Power BI desktop using DAX query?
Use this expression:
NewColumn =
CALCULATE(
MIN(
Table[datetime]),
FILTER(Table,Table[id]=EARLIER(Table[id])
)
)
In Power BI using a table with your data it will produce this:
UPDATE: Explanation and EARLIER function usage.
Basically, EARLIER function will give you access to values of different row context.
When you use CALCULATE function it creates a row context of the whole table, theoretically it iterates over every table row. The same happens when you use FILTER function it will iterate on the whole table and evaluate every row against the filter condition.
So far we have two row contexts, the row context created by CALCULATE and the row context created by FILTER. Note FILTER use the EARLIER to get access to the CALCULATE's row context. Having said that, in our case for every row in the outer (CALCULATE's row context) the FILTER returns a set of rows that correspond to the current id in the outer context.
If you have a programming background it could give you some sense. It is similar to a nested loop.
Hope this Python code points the main idea behind this:
outer_context = ['row1','row2','row3','row4']
inner_context = ['row1','row2','row3','row4']
for outer_row in outer_context:
for inner_row in inner_context:
if inner_row == outer_row: #this line is what the FILTER and EARLIER do
#Calculate the min datetime using the filtered rows
...
...
UPDATE 2: Adding a ranking column.
To get the desired rank you can use this expression:
RankColumn =
RANKX(
CALCULATETABLE(Table,ALLEXCEPT(Table,Table[id]))
,Table[datetime]
,Hoja1[datetime]
,1
)
This is the table with the rank column:
Let me know if this helps.

filtering on a range of values in a db column with sqlalchemy orm

I have a postgresql database and in one particular table, with many rows. One column in this table, called data, is a float array, REAL[], and gets filled with an array of ~4500 elements. I want to access this table through some query via SQLAlchemy and the ORM.
How do I select all rows in the table where a subset of this column satisfies some condition, e.g.contains a range of values? Like I want to select all rows where the data contains values >= 10, or values between >=10 and <=20.
Can I do this with a straight session query like
rows = session.query(Table).filter(Table.data.(some conditional)).all()
where my conditional is something like "VALUES >= 10 and VALUES <= 20"?
Or do I need to define some special methods, or setup, when I'm defining my SQLAlchemy table class. For example, I have my table set up as
class Table(Base):
__tablename__ = 'table'
__table_args__ = {'autoload' : True, 'schema' : 'testdb', 'extend_existing':True}
data = deferred(Column(ARRAY(Float)))
def __repr__(self):
return '<Table (pk={0})>'.format(self.pk)
Ideally I'd like to set it up so I can just do simple filtering in my session.query calls. Is this possible? I'm not super familiar with the ORM, so maybe it is?
I've had a look at the ARRAY Comparator sqlalchemy docs but those only seem to work on exact values. My data is precise to 6 sigfigs, and I don't know the exact values ahead of time.
What's the best way to do this? Thanks.
EDIT:
Based on the below comment, here is the code I'm using in attempting to select all rows (out of 1000) that have data (from 1 column) >= 1.0. There should be 537 rows.
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This gives the correct subset number. len(rows) = 537. However, I don't understand the logic of with this operator, where to select data >=1.0 , I use the le operator? Also, along those same lines, there should be 234 rows that have data between the values >=1.0 and <1.0, but this statement fails to give the correct subset..
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
* EDIT 2 *
Here's an example of my database Table with a few rows. pk is an integer, and data is a real[].
db datadb
schema Table
pk data
0 [0.0,0.0,0.5,0.3,1.3,1.9,0.3,0.0,0.0]
1 [0.1,0.0,1.0,0.7,1.1,1.5,1.2,0.3,1.4]
2 [0.0,0.6,0.4,0.3,1.6,1.7,0.4,1.3,0.0]
3 [0.0,0.1,0.2,0.4,1.0,1.1,1.2,0.9,0.0]
4 [0.0,0.0,0.5,0.3,0.2,0.1,0.7,0.3,0.1]
I have 5 rows, 4 of them have data with values >= 1.0, while just 2 have values in the range >= 1.0 and <= 1.2. The query I would do to grab the rows is in the first case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This should return the 4 rows, at pk=0,1,2,3. This query does what I expect. The second case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
and should return the 2 rows at pk=1,3. However this query just returns the 4 rows from the first query. For the second query, I also tried
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le),datadb.Table.data.any(1.2,operator=operators.ge)).all()
which also didn't work.
Please read documentation on ARRAY.Comparator, according to which you should be able to do the following:
rows = (session.query(Table)
.filter(Table.data.any(10, operator=operators.le))
.filter(Table.data.any(20, operator=operators.ge)
).all()
EDIT:
# combined filter does not work,
# but applying one or the other is still useful as it reduces the result set
q = (session.query(MyTable)
.filter(MyTable.data.any(1.0, operator=operators.le))
# .filter(MyTable.data.any(1.2, operator=operators.ge))
)
# filter in memory
items = [_row for _row in q.all()
if any(1.0 <= item <= 1.2 for item in _row.data)]
for item in items:
print(item)