How to count the number of times a certain value occurs in a column postgresql - postgresql

I have a table with two columns. The first is called 'animal' and lists different types of animals such as cat, dog, bird etc. The second column 'num' lists different numbers.
I want to count how many times the word 'cat' appears in the 'animal' column so that I be able to output 'cat' and then it's frequency.
SELECT animal, count(*) as Total FROM mydata GROUP BY animal;
However, this is outputting all animals and their frequencies, but I would just like to to limit my output to cat and its frequency only.
When I try to do:
SELECT count(animal) FROM mydata WHERE animal = 'cat';
I am given an output that says the total is 0.

This should satisfy your requirements.
SELECT 'cat' as animal, count(*) as Total FROM mydata where animal = 'cat';

Related

In Apache Pig, select DISTINCT rows based on a single column

Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
002 http://example.com/beth?extra=blah
003 http://example.com/charlie
I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID should return something like the following:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
003 http://example.com/charlie
The Pig GROUP BY operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).
The Pig DISTINCT operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.
For my purposes, I do not care which of the rows with ID 002 are returned.
I found one way to do this, using the GROUP BY and the TOP operators:
my_table = LOAD 'my_table_file' AS (A, B);
my_table_grouped = GROUP my_table BY A;
my_table_distinct = FOREACH my_table_grouped {
-- For each group $0 refers to the group name, (A)
-- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
-- Here, we take only the first (top 1) row in the bag:
result = TOP(1, 0, $1);
GENERATE FLATTEN(result);
}
DUMP my_table_distinct;
This results in one distinct row per ID column:
(001,http://example.com/adam)
(002,http://example.com/beth?extra=blah)
(003,http://example.com/charlie)
I don't know if there is a better approach, but this works for me. I hope this helps others starting out with Pig.
(Reference: http://pig.apache.org/docs/r0.12.1/func.html#topx)
I have found that you can do this with a nested grouping and using LIMIT So using Arel's example:
my_table = LOAD 'my_table_file' AS (A, B);
-- Nested foreach grouping generates bags with same A,
-- limit bags to 1
my_table_distinct = FOREACH (GROUP my_table BY A) {
result = LIMIT my_table 1;
GENERATE FLATTEN(result);
}
DUMP my_table_distinct;
You can use
Apache DataFu™ (incubating)
FirstTupleFrom Bag
register datafu-pig-incubating-1.3.1.jar
define FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();
my_table_grouped = GROUP my_table BY A;
my_table_grouped_first_tuple = foreach my_table_grouped generate flatten(FirstTupleFromBag(my_table,null));

Why does usage of lower() changes the order of resultset?

I have a table where I store information about users. The table has the following structure:
CREATE TABLE PERSONS
(
ID NUMBER(20, 0) NOT NULL,
FIRSTNAME VARCHAR2(40),
LASTNAME VARCHAR2(40),
BIRTHDAY DATE,
CONSTRAINT PERSONEN_PK PRIMARY KEY
(ID)
ENABLE
);
After inserting some test data:
SET DEFINE OFF;
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('1','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('2','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('3','Carl','Carlchen',to_date('01.01.12','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('4','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('5','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('6','Carl','Carlchen',to_date('01.01.12','DD.MM.RR'));
I want to select all duplicates of a given user. Let's use "Max Mustermann" for example:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname;
This gives me a result like this:
id first last birthday
=================================
1 Max Mustermann 31.10.89
2 Max Mustermann 31.10.89
4 Max Mustermann 31.10.89
5 Max Mustermann 31.10.89
I want to do a case insensitive compare, so I change the query using lower (and trim) like this:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE lower(trim(p.firstname)) = lower(trim('mAx '))
AND lower(trim(p.lastname)) = lower(trim(' musteRmann '))
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.lastname,p.firstname;
Now surprise the order has changed!
id first last birthday
=================================
1 Max Mustermann 31.10.89
5 Max Mustermann 31.10.89
4 Max Mustermann 31.10.89
2 Max Mustermann 31.10.89
Why does the order change, just by using lower() (same result when using without trim())!? I can get a stable ordering by adding the id column to the ORDER BY. But shouldn't the lower() have no affect to the ordering?
Workaround by also using id column for ORDER BY:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname,p.id;
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE lower(trim(p.firstname)) = lower(trim('mAx '))
AND lower(trim(p.lastname)) = lower(trim(' musteRmann '))
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.lastname,p.firstname,p.id;
If the values to be ordered by are identical, then the DBMS is free to choose any order it feels correct (the same way it is free to choose any order if no order by is specified alltogether).
Because all values of the columns in the order by are identical the resulting order is not stable. The only way to get a stable order is to include a unique column as an additional order criteria for ties - exactly what you did when you added the id column.
Why does the order change, just by using lower()
From a technical point, I'd guess that applying the lower() changed the execution plan and therefor the access path to the data.
But again (just to make sure): ordering on identical values never guarantees a stable order!
There is no ordering without an order by clause. Sometimes it looks like there might be (group by fooled a lot of people in older releases`, but it's only coincidental, and must not be relied upon. In your case you're ordering by some columns, but you expect duplicates within that ordering to be further ordered implicitly, which won't happen - or at least cannot be relied on.
In this case Oracle probably happens to be retrieving the rows for your first query in the order you inserted them purely as a side effect of how it's reading data from the blocks, and the order by sorts them within that set without actually changing them (or quite likely it's skipping the order by step internally if it realises it's pointless; the explain plan would tell you that).
If you change the order the order the records are created:
...
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values
('5','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values
('4','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
...
then the result 'order' changes too:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname;
ID FIRSTNAME LASTNAME BIRTHDAY
---------- -------------------- -------------------- ---------
1 Max Mustermann 31-OCT-89
2 Max Mustermann 31-OCT-89
5 Max Mustermann 31-OCT-89
4 Max Mustermann 31-OCT-89
Once you have the function things are changing enough for that happy accident to go out of the window, even if the records are inserted in id order (which has no relevance to the DB internally). lower() isn't changing the ordering, you just aren't getting lucky any more.
You cannot expect or rely on an order unless you fully specify it in the order by clause.

SQLite like query for multiple rows

I have the following table
---------------------------------------
id, type, keyword, default
---------------------------------------
1, 1, mcdonalds, 1
2 1, food, 0
3, 1, drinks, 0
4, 2, vending machine, 1
5, 2, drinks, 0
6, 3, station, 1
7, 3, travel, 0
8 3, train, 0
The idea behind this is that I want a serach query returing 'a unique' type for keywords (the default row), so when I search mcdonalds, I get the first row, but when I search food or drinks I allso get the first row.
I have done this using a subselect.
SELECT type, keyword FROM keywords WHERE type IN (SELECT DISTINCT type FROM keywords WHERE keyword like '%?%') AND `default`=1
;
This works like a charm, now however I want to be able to give multiple keywords, for example "drinks & food" I have tried
SELECT type, keyword FROM keywords WHERE type IN (SELECT DISTINCT type FROM keywords WHERE keyword like '%?%' OR keyword like'%?%') AND `default`=1
;
But then when I search for "food & drinks" in this case I would get both "the vending machine" and the "mcdonalds". However the vending machine only has an assosiated keyword "drinks" (it serves no food) So I don't want that one in my results.
When I do 'AND' instead of OR like, I get no results at all (since one row cant have both values at the same time).
Does anyone have any suggestions on how I could solve this?
Use two independent subqueries, then combine the lookups with AND:
SELECT type,
keyword
FROM keywords
WHERE type IN (SELECT DISTINCT type FROM keywords WHERE keyword like '%drinks%')
AND type IN (SELECT DISTINCT type FROM keywords WHERE keyword like '%food%')
AND "default"=1
/*
t = keywords
*/
SELECT *
FROM t
WHERE type IN (
SELECT type
FROM (
SELECT COUNT(id) AS count_matches,
type
FROM t
WHERE keyword LIKE '%drinks%'
OR keyword LIKE '%food%'
GROUP BY type
)
q
ORDER BY count_matches DESC
LIMIT 1
)
AND "default" = 1
Well, thanks for all your help. It has been figured out! :)
SELECT type, keyword FROM keywords WHERE type IN
(SELECT type FROM keywords WHERE (`keyword` LIKE '%drinks%' OR keyword like '%food%') AND `default`!= 1 GROUP BY type HAVING COUNT(id) >= 2)
AND `default`=1
The query can be built dynamically so what this does is use COUNT. The number of rows for count returned in the subquery must be at least as much as the number of keywords the user entered.
[b]edit does not seem to work, the or and like sometimes gives an higher count than expected[/b]

postgres full text search word count

I've been messing around with full text search in postgres, and I was wondering, is it possible to return total word counts of all rows?
So, let's say you have
text_col
_______
'dog'
'dog cat'
'dog bird dog'
the count of 'dog' should be four, the count of 'cat' should be one and bird should also be one.
Right now I have all the tsvectors saved to a gin indexed column as well.
Of course this would be across all rows where you could say something like
select max(ts_count(text_col_tsvector)) from mytable;
(I made that up, but I hope you get the general idea)
is it only possible to return the count of the lexeme, and if so, how does one return the dictionary (or array) of the lexeme returned.
how about:
select * from ts_stat('select text_col_tsvector from mytable')
Edit:
You mean:
with words as (
select regexp_split_to_table(text_column , E'\\W+') as word
from mytable
)
select word, count(*) as cnt from words group by 1 order by 2 desc
?

Perl + PostgreSQL-- Selective Column to Row Transpose

I'm trying to find a way to use Perl to further process a PostgreSQL output. If there's a better way to do this via PostgreSQL, please let me know. I basically need to choose certain columns (Realtime, Value) in a file to concatenate certains columns to create a row while keeping ID and CAT.
First time posting, so please let me know if I missed anything.
Input:
ID CAT Realtime Value
A 1 time1 55
A 1 time2 57
B 1 time3 75
C 2 time4 60
C 3 time5 66
C 3 time6 67
Output:
ID CAT Time Values
A 1 time 1,time2 55,57
B 1 time3 75
C 2 time4 60
C 3 time5,time6 66,67
You could do this most simply in Postgres like so (using array columns)
CREATE TEMP TABLE output AS SELECT
id, cat, ARRAY_AGG(realtime) as time, ARRAY_AGG(value) as values
FROM input GROUP BY id, cat;
Then select whatever you want out of the output table.
SELECT id
, cat
, string_agg(realtime, ',') AS realtimes
, string_agg(value, ',') AS values
FROM input
GROUP BY 1, 2
ORDER BY 1, 2;
string_agg() requires PostgreSQL 9.0 or later and concatenates all values to a delimiter-separated string - while array_agg() (v8.4+) creates am array out of the input values.
About 1, 2 - I quote the manual on the SELECT command:
GROUP BY clause
expression can be an input column name, or the name or ordinal number
of an output column (SELECT list item), or ...
ORDER BY clause
Each expression can be the name or ordinal number of an output column
(SELECT list item), or
Emphasis mine. So that's just notational convenience. Especially handy with complex expressions in the SELECT list.