Counting all entries with KSQL - apache-kafka

Is it possible to use KSQL to not only count entries of a specific column via GROUP BY but instead get an aggregate over all the entries that stream through the application?
I'm searching for something like this:
| Count all | Count id1 | count id2 |
| ---245----|----150----|----95-----|
Or more like this in KSQL:
[some timestamp] | Count all | 245
[some timestamp] | Count id1 | 150
[some timestamp] | Count id2 | 95
.
.
.
Thank you
- Tim

You cannot have both counts for the all and count for each key in the same query. You can have two queries here, one for counting each value in the given column and another for counting all values in the given column.
Let's assume you have a stream with two columns, col1 and col2.
To count each value in col1 with infinite window size you can use the following query:
SELECT col1, count(*) FROM mystream1 GROUP BY col1;
To count all the rows you need to write two queries since KSQL always needs GROUP BY clause for aggregation. First you create a new column with constant value and then you can count the values in new column and since it is a constant, the count will represent the count of all rows. Here is an example:
CREATE STREAM mystream2 AS SELECT 1 AS col3 FROM mystream1;
SELECT col3, count(*) FROM mystream2 GROUP BY col3;

This works too to get total rows count for a table:
ksql> SELECT COUNT(*) FROM `mytable` GROUP BY 1 EMIT CHANGES;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|KSQL_COL_0 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2298

you can do a extended describe on the stream or table to see the total messages
ksql> describe extended <stream or table name>
sample output
Local runtime statistics
------------------------
messages-per-sec: 0 total-messages: 2415888 last-message: 2019-12-06T02:29:43.005Z

Related

Querying for the distinct count used to create a grouped, aggregated, and filtered row set

I have a table that looks like this:
control=# select * from animals;
age_range | weight | species
-----------+--------+---------
0-9 | 1 | lion
0-9 | 2 | lion
10-19 | 2 | tiger
10-19 | 3 | horse
20-29 | 2 | tiger
20-29 | 2 | zebra
I perform a query that summarizes weights of animals within age range groups, and I only want to return rows that have aggregated weights above
a certain number.
Summary Query:
SELECT
age_range,
SUM(animals.weight) AS weight,
COUNT(DISTINCT animals.species) AS distinct_species
FROM animals
GROUP BY age_range
HAVING SUM(animals.weight) > 3;
Summary Results:
age_range | weight | distinct_species
-----------+--------+------------------
10-19 | 5 | 2
20-29 | 4 | 2
Now here's the rub. Along with this summary, I want to report the distinct number of species used to create the above summary row set as a whole. For simplicity, let's refer to this number as the 'Distinct Species Total'. In this simple example, since only 3 species (tiger, zebra, horse) were used in yielding the 2 rows of this summary, and not 'lion', the 'Distinct Species Total' should be 3. But I can't figure out how to successfully query for that number. Since the summary query must use a having clause in order to apply a filter to an already grouped and aggregated row set, this presents problems in trying to query for the 'Distinct Species Total'.
This returns the wrong number, 2, because it is incorrectly a distinct count of a distinct count:
SELECT
COUNT(DISTINCT distinct_species) AS distinct_species_total
FROM (
SELECT
age_range,
SUM(animals.weight) AS weight,
COUNT(DISTINCT animals.species) AS distinct_species
FROM animals
GROUP BY age_range
HAVING SUM(animals.weight) > 3
) x;
And of course this returns the wrong number, 4, because it does not consider filtering the grouped and aggregated summary result using a having clause:
SELECT
COUNT(DISTINCT species) AS distinct_species_total
FROM animals;
Any help at all in getting leading me on the right path here is appreciated, and will hopefully help others with a similar problem, but in the end I do need a solution that will work with Amazon Redshift.
Join the result set with the original animals table and count the distinct species.
select distinct x.age_range,x.weight,count(distinct y.species) as distinct_species_total
from
(
select age_range,sum(animals.weight) as weight
from animals
group by age_range
having sum(animals.weight) > 3
) x
join animals y on x.age_range=y.age_range

Find all multipolygons from one table within another

So, I've got two tables - PLUTO (pieces of land), and NYZMA (rezoning boundaries). They look like:
pluto nyzma
id | geom name | geom
-------------------- -------------------
1 | MULTIPOLYGON(x) A | MULTIPOLYGON(a)
2 | MULTIPOLYGON(y) B | MULTIPOLYGON(b)
And I want it to spit out something like this, assuming that PLUTO record 1 is in multipolygons A and B, and PLUTO record 2 is in neither:
pluto_id | nyzma_id
-------------------
1 | [A, B]
2 |
How do I, for every PLUTO record's corresponding geometry, cycle through each NYZMA record, and print the names of any whose geometry matches?
Join the two tables using the spatial function ST_Contains. Than use GROUP BY and ARRAY_AGG in the main query:
WITH subquery AS (
SELECT pluto.id, nyzma.name
FROM pluto LEFT OUTER JOIN nyzma
ON ST_Contains(nyzma.geom, pluto.geom)
)
SELECT id, array_agg(name) FROM subquery GROUP BY id;

Finding exact matches to a requested set of values

Hi I'm facing a challenge. There is a table progress.
User_id | Assesment_id
-----------------------
1 | Test_1
2 | Test_1
3 | Test_1
1 | Test_2
2 | Test_2
1 | Test_3
3 | Test_3
I need to pull out the user_id who have completed only Test_1 & test_2 (i.e User_id:2). The input parameters would be the list of Assesment id.
Edit:
I want those who have completed all the assessments on the list, but no others.
User 3 did not complete Test_2, and so is excluded.
User 1 completed an extra test, and is also excluded.
Only User 2 has completed exactly those assessments requested.
You don't need a complicated join or even subqueries. Simply use the INTERSECT operator:
select user_id from progress where assessment_id = 'Test_1'
intersect
select user_id from progress where assessment_id = 'Test_2'
I interpreted your question to mean that you want users who have completed all of the tests in your assessment list, but not any other tests. I'll use a technique called common table expressions so that you can follow step by step, but it is all one query statement.
Let's say you supply your assessment list as rows in a table called Checktests. We can count those values to find out how many tests are needed.
If we use a LEFT OUTER JOIN then values from the right-side table will be null. So the test_matched column will be null if an assessment is not on your list. COUNT() ignores null values, so we can use this to find out how many tests were taken that were on the list, and then compare this to the number of all tests the user took.
with x as
(select count(assessment_id) as tests_needed
from checktests
),
dtl as
(select p.user_id,
p.assessment_id as test_taken,
c.assessment_id as test_matched
from progress p
left join checktests c on p.assessment_id = c.assessment_id
),
y as
(select user_id,
count(test_taken) as all_tests,
count(test_matched) as wanted_tests -- count() ignores nulls
from dtl
group by user_id
)
select user_id
from y
join x on y.wanted_tests = x.tests_needed
where y.wanted_tests = y.all_tests ;

How to rank in postgres query

I'm trying to rank a subset of data within a table but I think I am doing something wrong. I cannot find much information about the rank() feature for postgres, maybe I'm looking in the wrong place. Either way:
I'd like to know the rank of an id that falls within a cluster of a table based on a date. My query is as follows:
select cluster_id,feed_id,pub_date,rank
from (select feed_id,pub_date,cluster_id,rank()
over (order by pub_date asc) from url_info)
as bar where cluster_id = 9876 and feed_id = 1234;
I'm modeling this after the following stackoverflow post: postgres rank
The reason I think I am doing something wrong is that there are only 39 rows in url_info that are in cluster_id 9876 and this query ran for 10 minutes and never came back. (actually re-ran it for quite a while and it returned no results, yet there is a row in cluster 9876 for id 1234) I'm expecting this will tell me something like "id 1234 was 5th for the criteria given). It will return a relative rank according to my query constraints, correct?
This is postgres 8.4 btw.
By placing the rank() function in the subselect and not specifying a PARTITION BY in the over clause or any predicate in that subselect, your query is asking to produce a rank over the entire url_info table ordered by pub_date. This is likely why it ran so long as to rank over all of url_info, Pg must sort the entire table by pub_date, which will take a while if the table is very large.
It appears you want to generate a rank for just the set of records selected by the where clause, in which case, all you need do is eliminate the subselect and the rank function is implicitly over the set of records matching that predicate.
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876 and feed_id = 1234;
If what you really wanted was the rank within the cluster, regardless of the feed_id, you can rank in a subselect which filters to that cluster:
select ranked.*
from (
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876
) as ranked
where feed_id = 1234;
Sharing another example of DENSE_RANK() of PostgreSQL.
Find top 3 students sample query.
Reference taken from this blog:
Create a table with sample data:
CREATE TABLE tbl_Students
(
StudID INT
,StudName CHARACTER VARYING
,TotalMark INT
);
INSERT INTO tbl_Students
VALUES
(1,'Anvesh',88),(2,'Neevan',78)
,(3,'Roy',90),(4,'Mahi',88)
,(5,'Maria',81),(6,'Jenny',90);
Using DENSE_RANK(), Calculate RANK of students:
;WITH cteStud AS
(
SELECT
StudName
,Totalmark
,DENSE_RANK() OVER (ORDER BY TotalMark DESC) AS StudRank
FROM tbl_Students
)
SELECT
StudName
,Totalmark
,StudRank
FROM cteStud
WHERE StudRank <= 3;
The Result:
studname | totalmark | studrank
----------+-----------+----------
Roy | 90 | 1
Jenny | 90 | 1
Anvesh | 88 | 2
Mahi | 88 | 2
Maria | 81 | 3
(5 rows)

PostgreSQL: Can't use DISTINCT for some data types

I have a table called _sample_table_delme_data_files which contains some duplicates. I want to copy its records, without duplicates, into data_files:
INSERT INTO data_files (SELECT distinct * FROM _sample_table_delme_data_files);
ERROR: could not identify an ordering operator for type box3d
HINT: Use an explicit ordering operator or modify the query.
Problem is, PostgreSQL can not compare (or order) box3d types. How do I supply such an ordering operator so I can get only the distinct into my destination table?
Thanks in advance,
Adam
If you don't add the operator, you could try translating the box3d data to text using its output function, something like:
INSERT INTO data_files (SELECT distinct othercols,box3dout(box3dcol) FROM _sample_table_delme_data_files);
Edit The next step is: cast it back to box3d:
INSERT INTO data_files SELECT othercols, box3din(b) FROM (SELECT distinct othercols,box3dout(box3dcol) AS b FROM _sample_table_delme_data_files);
(I don't have box3d on my system so it's untested.)
The datatype box3d doesn't have an operator for the DISTINCT-operation. You have to create the operator, or ask the PostGIS-project, maybe somebody has already fixed this problem.
Finally, this was solved by a colleague.
Let's see how many dups are there:
SELECT COUNT(*) FROM _sample_table_delme_data_files ;
count
-------
12728
(1 row)
Now, we shall add another column to the source table to help us differentiate similar rows:
ALTER TABLE _sample_table_delme_data_files ADD COLUMN id2 serial;
We can now see the dups:
SELECT id, id2 FROM _sample_table_delme_data_files ORDER BY id LIMIT 10;
id | id2
--------+------
198748 | 6449
198748 | 85
198801 | 166
198801 | 6530
198829 | 87
198829 | 6451
198926 | 88
198926 | 6452
199062 | 6532
199062 | 168
(10 rows)
And remove them:
DELETE FROM _sample_table_delme_data_files
WHERE id2 IN (SELECT max(id2) FROM _sample_table_delme_data_files
GROUP BY id
HAVING COUNT(*)>1);
Let's see it worked:
SELECT id FROM _sample_table_delme_data_files GROUP BY id HAVING COUNT(*)>1;
id
----
(0 rows)
Remove the auxiliary column:
ALTER TABLE _sample_table_delme_data_files DROP COLUMN id2;
ALTER TABLE
Insert the remaining rows into the destination table:
INSERT INTO data_files (SELECT * FROM _sample_table_delme_data_files);
INSERT 0 6364