PostgreSQL - get records with null values - postgresql

I'm trying to get a query which would show distributors that haven't sell anything in 90 days, but the problem I get is with NULL values. It seems PostgreSQL ignores null values, even when I queried to show it (or maybe I did it in wrong way).
Let say there are 1000 distributors, but with this query I only get 1 distributor, but there should be more distributors that didn't sell anything, because if I write SQL query to show distributors that sold by any amount in the last 90 days, it shows about 500. So I wonder where are those other 499? If I understand correctly, those other 499, didn't have any sales, so all records are null and are not showed in query.
Does anyone know how to make it show null values of one table where in relation other table is not null? (like partners table (res_partner) is not null, but sale_order table (sales) or object is null? (I also tried to filter like so.id IS NULL, but in such way I get empty query)
Code of my query:
(
SELECT
min(f1.id) as id,
f1.partner as partner,
f1.sum1
FROM
(
SELECT
min(f2.id) as id,
f2.partner as partner,
sum(f2.null_sum) as sum1
FROM
(
SELECT
min(rp.id) as id,
rp.search_name as partner,
CASE
WHEN
sol.price_subtotal IS NULL
THEN
0
ELSE
sol.price_subtotal
END as null_sum
FROM
sale_order as so,
sale_order_line as sol,
res_partner as rp
WHERE
sol.order_id=so.id and
so.partner_id=rp.id
and
rp.distributor=TRUE
and
so.date_order <= now()::timestamp::date
and
so.date_order >= date_trunc('day', now() - '90 day'::interval)::timestamp::date
and
rp.contract_date <= date_trunc('day', now() - '90 day'::interval)::timestamp::date
GROUP BY
partner,
null_sum
)as f2
GROUP BY
partner
) as f1
WHERE
sum1=0
GROUP BY
partner,
sum1
)as fld
EDIT: 2012-09-18 11 AM.
I think I understand why Postgresql behaves like this. It is because of the time interval. It checks if there is any not null value in that inverval. So it only found one record, because that record had sale order with zero (it was not converted from null to zero) and part which checked for null values was just skipped. If I delete time interval, then I would see all distributors that didn't sell anything at all. But with time interval for some reason it stops checking null values and looks if there are only not null values.
So does anyone know how to make it check for null values too in given interval?.. (for the last 90 days to be exact)

Aggregates like sum() and and min() do ignore NULL values. This is required by the SQL standard and every DBMS I know behaves like that.
If you want to treat a NULL value as e.g. a zero, then use something like this:
sum(coalesce(f2.null_sum, 0)) as sum1
But as far as I understand you question and your invalid query you actually want an outer join between res_partner and the sales tables.
Something like this:
SELECT min(rp.id) as id,
rp.search_name as partner,
sum(coalesce(sol.price_subtotal,0)) as price_subtotal
FROM res_partner as rp
LEFT JOIN sale_order as so ON so.partner_id=rp.id and rp.distributor=TRUE
LEFT JOIN sale_order_line as sol ON sol.order_id=so.id
WHERE so.date_order <= CURRENT_DATE
and so.date_order >= date_trunc('day', now() - '90 day'::interval)::timestamp::date
and rp.contract_date <= date_trunc('day', now() - '90 day'::interval)::timestamp::date
GROUP BY rp.search_name
I'm not 100% sure I understood your problem correctly, but it might give you a headstart.

Try to name subqueries, and retrieve their columns with col.q1, col.q2 etc. to make sure which column from which query/subquery you're dealing with. Maybe it's somewhat simple, e.g. it unites some rows containing only NULLs into one row? Also, at least for debugging purposes, it's smart to add , count(*) at the end of each query/subquery to get implicit number of rows returned on result.. hard to guess what exactly happened..

Related

Count distinct users over n-days

My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID.
Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.
Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.
Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.
So I went the old fashioned way and tried with a scalar subquery:
SELECT
xx.*
,(
SELECT COUNT(DISTINCT UserID)
FROM data_table AS yy
WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay
In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.
I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.
This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.
SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday
UPDATE
Tested it with a lot of data. The above works but is slow. Much faster to do it like this:
SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
SELECT calday, COUNT(DISTINCT userid) AS daily
FROM data_table
GROUP BY calday
) t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2
So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).

PostgreSQL row diff timestamp, and calculate stddev for group

I have a table with an ID column called mmsi and another column of timestamp, with multiple timestamps per mmsi.
For each mmsi I want to calculate the standard deviation of the difference between consecutive timestamps.
I'm not very experienced with SQL but have tried to construct a function as follows:
SELECT
mmsi, stddev(time_diff)
FROM
(SELECT mmsi,
EXTRACT(EPOCH FROM (timestamp - lag(timestamp) OVER (ORDER BY mmsi ASC, timestamp ASC)))
FROM ais_messages.ais_static
ORDER BY mmsi ASC, timestamp ASC) AS time_diff
WHERE time_diff IS NOT NULL
GROUP BY mmsi;
Your query looks on the right track, but it has several problems. You labelled your subquery, which looks almost right, with an alias which you then select. But this subquery returns multiple rows and columns so this doesn't make any sense. Here is a corrected version:
SELECT
t.mmsi,
STDDEV(t.time_diff) AS std
FROM
(
SELECT
mmsi,
EXTRACT(EPOCH FROM (timestamp - LAG(timestamp) OVER
(PARTITION BY mmsi ORDER BY timestamp))) AS time_diff
FROM ais_messages.ais_static
ORDER BY mmsi, timestamp
) t
WHERE t.time_diff IS NOT NULL
GROUP BY t.mmsi
This approach should be fine but there is one edge case where it might not behave as expected. If a given mmsi group have only one record, then it would not even appear in the result set of standard deviations. This is because the LAG calculation would return NULL for that single record and it would be filtered off.

T-SQL Count of items based on date

To make the example super simple, lets say that I have a table with three rows, ID, Name, and Date. I need to find the count of all ID's belonging to a specific name where the ID does not belong to this month.
Using that example, I would want this output:
In other words, I want to count how many ID's that a name has that aren't this month/year.
I'm more into PowerShell and still fairly new to SQL. I tried doing a case statement, but because it's not a foreach it seems to be returning "If the Name has ANY date in this month, return NULL" which is not what I want. I want it to count how many ID's per name do not appear in this month.
SELECT NAME,
CASE
WHEN ( Month(date) NOT LIKE Month(Getdate())
AND Year(date) NOT LIKE Year(Getdate()) ) THEN Count(id)
END AS TotalCount
FROM dbo.table
GROUP BY NAME,
date
I really hope this makes sense, but if it doesn't please let me know and I can try to clarify more. I tried researching cursors, but I'm having a hard time grasping them to get them into my statement. Any help would be greatly appreciated!
You only want to group by the non-aggregated columns that are in the result set (in this case, Name). You totally don't need a cursor for this, it's a fairly straight-forward query.
select
Name,
Count(*) count
from
tbl
where
tbl.date > eomonth(getdate()) or
tbl.date <= eomonth(dateadd(mm, -1, getdate())
group by
Name
I did a little bit of trickery on the exclusion of rows that are in the current month. Generally, you want to avoid running functions on the columns you're comparing to if you can so that SQL Server can use an index to speed up its search. I assumed that the ID column is unique, if it's not, change count(*) to count(distinct ID).
Alternative where clause if you're using older versions of sql server. If the table is small enough, you can just do it directly (similar to what you tried originally, it just goes in the query where clause and not embedded in a case)
where
Month(date) <> Month(Getdate())
AND Year(date) <> Year(Getdate())
If you have a large table and sarging on the index is important, there some fun stuff you can build eomonth with dateadd and the date part functions, but it's a pain.
SELECT Name, COUNT(ID) AS TotalCount
FROM dbo.[table]
WHERE DATEPART(MONTH, [Date]) != DATEPART(MONTH, GETDATE()) OR DATEPART(YEAR, [Date]) != DATEPART(YEAR, GETDATE())
GROUP BY Name;
In T-SQL:
SELECT
NAME,
COUNT(id)
FROM dbo.table
WHERE MONTH(Date_M) <> MONTH(GETDATE())
GROUP BY NAME

postgresql: exclude data based on other incomplete data

In this data - there are multiple DATA_ID values associated with time-series data. I am trying to exclude all data from any DATA_ID values that return a NULL value for USE for any timestamp value.
In other words, I only want to return DATA_ID values (and their data) if they have complete (not any NULL) values for all timestamp values.
Sample query given below:
SELECT
My.Table.DATA_ID,
MY.Table.timestamp,
My.Table.USE
FROM
My.TABLE
WHERE timestamp BETWEEN '2012-06-01 00:00:00' AND '2012-06-02 23:59:59'
-- Something here that says exclude all data from DATA_ID(s)
-- with any missing USE data, i.e. USE=NULL
ORDER BY DATA_ID, timestamp
Assuming I understand your question correctly and you want to exclude whole batches of samples (determined by equal data_id and timestamp) that contain a null value.
SELECT
My.Table.DATA_ID,
MY.Table.timestamp,
My.Table.USE
FROM
My.TABLE o
WHERE timestamp BETWEEN '2012-06-01 00:00:00' AND '2012-06-02 23:59:59'
and not exists (select 1 from my_table i
where i.use is null
and i.data_id = o.data_id
and i.timestamp BETWEEN '2012-06-01 00:00:00' AND '2012-06-02 23:59:59')
ORDER BY DATA_ID, timestamp
The simple thing to do is something like this:
CREATE FUNCTION missing_info(MY.TABLE)
RETURNS BOOL
LANGUAGE SQL AS
$$ select $1.use is null -- chain conditions together with or.
-- no from clause needed. no where clause needed.
$$;
Then you can just add:
where (My.Table).missing_info is not true;
And as you need to change the logic as to what sorts of info is missing you can just change it in the function and everything still works.
This is the sort of encapsulation of derived information where ORDBMS's like PostgreSQL really shine.
Edit: Re-reading your example, it looks like what you are looking for is the IS NULL operator. However if you need to re-use some sort of logic, see the above example. NULL never "equals" NULL (because we can't say whether two unknown values are the same). But IS NULL tells you whether it is NULL or not.

Postgresql vlookup

Let's say I have a table "uservalue" with the following columns:
integer user_id
integer group_id
integer value
I can get the maximum value for each group easily:
select max(value) from uservalue group by group_id;
What I would like is for it to return the user_id in each group that had the highest value. The max function in matlab will also return the index of the maximum, is there some way to make postgresql do the same thing?
The proper way todo this is with a subquery.
select
u.user_id,
u.value
from
uservalue u
join
(select groupid, max(value) as max_value from uservalue group by group_id) mv
on u.value = mv.max_value and mv.group_id = u.group_id
However I sometimes prefer a simpler hack.
select max(value*100000 + user_id) - 100000, max(value) from user_value group by group_id
Making sure that number (100000) is higher than any userids you are expecting to have. This makes sure only one user_id is selected on the same values whilst the other one selects them both.
Seems you should be able to do this with a windowing query, something like:
SELECT DISTINCT
group_id,
first_value(user_id) OVER w AS user,
first_value(value) OVER w AS val
FROM
uservalue
WINDOW w AS (PARTITION BY group_id ORDER BY value DESC)
This query will also work if you have multiple users with the same value (unless you add a second column to ORDER BY you will not know which one you will get back though - but you will only get one row back per group)
Here are several ways to do this.
It's pretty much a FAQ.