SqlAlchemy: count of distinct over multiple columns - postgresql

I can't do:
>>> session.query(
func.count(distinct(Hit.ip_address, Hit.user_agent)).first()
TypeError: distinct() takes exactly 1 argument (2 given)
I can do:
session.query(
func.count(distinct(func.concat(Hit.ip_address, Hit.user_agent))).first()
Which is fine (count of unique users in a 'pageload' db table).
This isn't correct in the general case, e.g. will give a count of 1 instead of 2 for the following table:
col_a | col_b
----------------
xx | yy
xxy | y
Is there any way to generate the following SQL (which is valid in postgresql at least)?
SELECT count(distinct (col_a, col_b)) FROM my_table;

distinct() accepts more than one argument when appended to the query object:
session.query(Hit).distinct(Hit.ip_address, Hit.user_agent).count()
It should generate something like:
SELECT count(*) AS count_1
FROM (SELECT DISTINCT ON (hit.ip_address, hit.user_agent)
hit.ip_address AS hit_ip_address, hit.user_agent AS hit_user_agent
FROM hit) AS anon_1
which is even a bit closer to what you wanted.

The exact query can be produced using the tuple_() construct:
session.query(
func.count(distinct(tuple_(Hit.ip_address, Hit.user_agent)))).scalar()

Looks like sqlalchemy distinct() accepts only one column or expression.
Another way around is to use group_by and count. This should be more efficient than using concat of two columns - with group by database would be able to use indexes if they do exist:
session.query(Hit.ip_address, Hit.user_agent).\
group_by(Hit.ip_address, Hit.user_agent).count()
Generated query would still look different from what you asked about:
SELECT count(*) AS count_1
FROM (SELECT hittable.user_agent AS hittableuser_agent, hittable.ip_address AS sometable_column2
FROM hittable GROUP BY hittable.user_agent, hittable.ip_address) AS anon_1

You can add some variables or characters in concat function in order to make it distinct. Taking your example as reference it should be:
session.query(
func.count(distinct(func.concat(Hit.ip_address, "-", Hit.user_agent))).first()

Related

Full outer join with different WHERE clauses in Knex.js for PostgreSQL

I try to get a single row with two columns showing aggregation results: one column should show the total sum based on one WHERE-clause while the other column should show the total sum based on a different WHERE clause.
Desired output:
amount_vic amount_qld
100 70
In raw PostgreSQL I could write something like that:
select
sum(a.amount) as amount_vic,
sum(b.amount) as amount_qld
from mytable a
full outer join mytable b on 1=1
where a.state='vic' and b.state= 'qld'
Question: How do I write this or a similar query that returns the desired outcome in knex.js? For example: the 'on 1=1' probably needs knex.raw() and I think the table and column aliases do not work for me and it always returns some errors.
One of my not-working-attempts in knex.js:
knex
.sum({ amount_vic: 'a.amount' })
.sum({ amount_qld: 'b.amount' })
.from('mytable')
.as('a')
.raw('full outer join mytable on 1=1')
.as('b')
.where({
a.state: 'vic',
b.state: 'qld'
})
Thank you for your help.
Disclaimer: this does not answer the Knex part of the question - but it is too long for a comment.
Although your current query does what you want, the way it is phrased seems suboptimal. There is not need to generate a self-cartesian product here - which is what full join ... on 1 = 1 does. You can just use conditional aggregation.
In Postgres, you would phrase this as:
select
sum(amount) filter(where state = 'vic') amount_vic,
sum(amount) filter(where state = 'qld') amount_qld
from mytable
where state in ('vic', 'qld')
I don't know Knex so I cannot tell how to translate the query to it. Maybe this query is easier for you to translate.

Casting rows to arrays in PostgreSQL

I need to query a table as in
SELECT *
FROM table_schema.table_name
only each row needs to be a TEXT[] with array values corresponding to column values casted to TEXT coming in the same order as in SELECT * so assuming the table has columns a, b and c I need the result to look like
SELECT ARRAY[a::TEXT, b::TEXT, c::TEXT]
FROM table_schema.table_name
only it shouldn't explicitly list columns by name. Ideally it should look like
SELECT as_text_array(a)
FROM table_schema.table_name AS a
The best I came up with looks ugly and relies on "hstore" extension
WITH columnz AS ( -- get ordered column name array
SELECT array_agg(attname::TEXT ORDER BY attnum) AS column_name_array
FROM pg_attribute
WHERE attrelid = 'table_schema.table_name'::regclass AND attnum > 0 AND NOT attisdropped
)
SELECT hstore(a)->(SELECT column_name_array FROM columnz)
FROM table_schema.table_name AS a
I am having a feeling there must be a simpler way to achieve that
UPDATE 1
Another query that achieves the same result but arguably as ugly and inefficient as the first one is inspired by the answer by #bspates. It may be even less efficient but doesn't rely on extensions
SELECT r.text_array
FROM table_schema.table_name AS a
INNER JOIN LATERAL ( -- parse ROW::TEXT presentation of a row
SELECT array_agg(COALESCE(replace(val[1], '""', '"'), NULLIF(val[2], ''))) AS text_array
FROM regexp_matches(a::text, -- parse double-quoted and simple values separated by commas
'(?<=\A\(|,) (?: "( (?:[^"]|"")* )" | ([^,"]*) ) (?=,|\)\Z)', 'xg') AS t(val)
) AS r ON TRUE
It is still far from ideal
UPDATE 2
I tested all 3 options existing at the moment
Using JSON. It doesn't rely on any extensions, it is short to write, easy to understand and the speed is ok.
Using hstore. This alternative is the fastest (>10 times faster than JSON approach on a 100K dataset) but requires an extension. hstore in general is very handy extension to have through.
Using regex to parse TEXT presentation of a ROW. This option is really slow.
A somewhat ugly hack is to convert the row to a JSON value, then unnest the values and aggregate it back to an array:
select array(select (json_each_text(to_json(t))).value) as row_value
from some_table t
Which is to some extent the same as your hstore hack.
If the order of the columns is important, then using json and with ordinality can be used to keep that:
select array(select val
from json_each_text(to_json(t)) with ordinality as t(k,val,idx)
order by idx)
from the_table t
The easiest (read hacky-est) way I can think of is convert to a string first then parse that string into an array. Like so:
SELECT string_to_array(table_name::text, ',') FROM table_name
BUT depending on the size and type of the data in the table, this could perform very badly.

Why am I getting the Aggregate Function error with the following SELECT?

I'm attempting to count the number of instances of a value that includes "WMH" and then sort the result (like a "Top 10 Most Popular" list). But I am getting the aggregate function error.
Any thoughts about this:
SELECT TrackingLabel, Count(TrackingLabel) AS CountOfTrackingLabel
FROM CTM_Export
GROUP BY TrackingLabel
HAVING TrackingLabel Like 'WMH'
Having clause refers to the same columns/aggregate functions used in select list, for example
SELECT column_name, aggregate_function(column_name)
FROM table_name
GROUP BY column_name
HAVING aggregate_function(column_name) operator value;
with regards to your query, if I understand your requirements, you could rewrite the query like:
SELECT TrackingLabel, Count(TrackingLabel) AS CountOfTrackingLabel
FROM CTM_Export
WHERE TrackingLabel Like 'WMH'
GROUP BY TrackingLabel;
regards, Michael

Order by custom named rows

I’d like to sort my postgres results by some fancy ranking function, but for sake of simplicity, let’s say that I’d like to add two custom rows and sort by them.
SELECT my_table.*,
extract(epoch from (age(current_date, '2012-09-12 10:43:40'::date)))/3600 AS age_in_hours
Fancy_function_counting_distance() AS distance
FROM my_table
ORDER BY distance + age_in_hours;
However, it doesn’t work, since I’m getting error: ERROR: column "distance" does not exist.
Is it possible to order my results by that custom named rows?
I’m running postgres 9.1.x
As per the SQL standard, aliases in the SELECT list are not visible in ORDER BY.
You can use column-position specification (eg ORDER BY 1,2), but that doesn't accept an expression; you cannot ORDER BY 1+2, for example. So you need to use a subquery to generate the result set then sort it in an outer query:
SELECT *
FROM (
SELECT my_table.*,
extract(epoch from (age(current_date, '2012-09-12 10:43:40'::date)))/3600 AS age_in_hours
Fancy_function_counting_distance() AS distance
FROM my_table
) x
ORDER BY distance + age_in_hours;

Parameterize set-returning function with column in the same query

Essentially, what I want to do is:
SELECT set_returning_func(id) FROM my_table;
However, the result will be a single column in record syntax, e.g.
set_returning_func
---------------------------------------------
(old,17,"August 2, 2011 at 02:54:59 PM")
(old,28,"August 4, 2011 at 08:03:12 AM")
(2 rows)
I want it to be unpacked into columns. If I write the query this way instead:
SELECT srf.* FROM my_table, set_returning_func(my_table.id);
I get an error message:
ERROR: function expression in FROM cannot refer to other relations of same query level
How, then, do I get a result set, while also supplying the set-returning function with an argument?
The syntax I was looking for is:
SELECT (set_returning_func(id)).* FROM my_table;
Details
set_returning_func(id) is of composite type. Just as the * syntax can be used on tables:
SELECT my_table.* FROM my_table, my_other_table
It can also be used on composite values (though they must be wrapped in parentheses). Intuitively, one can also select individual columns from a composite-returning function:
SELECT (set_returning_func(id)).time FROM my_table;
Some set-returning functions have a scalar rather than composite return type. In these cases, the (expr).* syntax doesn't make sense, and produces an error:
> SELECT (generate_series(1,5)).*;
ERROR: type integer is not composite
The correct syntax is simply:
> SELECT generate_series(1,5);
generate_series
-----------------
1
2
3
4
5
(5 rows)