Hey guys,
So I have this report that I am grouping into different age buckets. I want the count of an age bucket to be zero if there are no rows associated with this age bucket. So I did an outer join in my database select and that works fine. However, I need to add a group based on another column in my database.
When I add this group, the agebuckets that had no rows associated with them dissapear. I thought it might have been because the column that I was trying to group by was null for that row, so I added a row number to my select, and then grouped by that (I basically just need to group by each row and I can't just put it in the details... I can explain more about this if necessary). But after adding the row number the agebuckets that have no data are still null! When I remove this group that I added I get all age buckets.
Any ideas? Thanks!!
It's because the outer join to age group is not also an outer join to whatever your other
group is - you are only guaranteed to have one of each age group per data set, not one of each age group per [other group].
So if, for example, your other group is Region, you need a Cartesian / Cross join from your age range table to a Region table (so that you get every possible combination of age range and region), before outer joining to the rest of your dataset.
EDIT - based on the comments, a query like the following should work:
select date_helper.date_description, c.case_number, e.event_number
from
(select 0 range_start, 11 range_end, '0-10 days' date_description from dual union
select 11, 21, '11-20 days' from dual union
select 21, 31, '21-30 days' from dual union
select 31, 99999, '31+ days' from dual) date_helper
cross join case_table c
left outer join event_table e
on e.event_date <= date_helper.range_start*-1 + sysdate
and e.event_date > date_helper.range_end*-1 + sysdate
and c.case_number = e.case_number
(assuming that it's the event_date that needs to be grouped into buckets.)
I had trouble understanding your question.
I do know that Crystal Reports' NULL support is lacking in some pretty fundamental ways. So I usually try not to depend on it.
One way to approach this problem is to hard-code age ranges in the database query, e.g.:
SELECT p.person_type
, SUM(CASE WHEN
p.age <= 2
THEN 1 ELSE 0 END) AS "0-2"
, SUM(CASE WHEN
p.age BETWEEN 2 AND 18
THEN 1 ELSE 0 END) AS "3-17"
, SUM(CASE WHEN
p.age >= 18
THEN 1 ELSE 0 END) AS "18_and_over"
FROM person p
GROUP BY p.person_type
This way you are sure to get zeros where you want zeros.
I realize that this is not a direct answer to your question. Best of luck.
Related
UserID
CalMonth
ActiveFlag
Months_since_last_active
A
1/1/2021
1
0
A
2/1/2021
1
A
3/1/2021
2
A
4/1/2021
1
0
B
1/1/2021
1
0
B
2/1/2021
1
B
3/1/2021
1
0
Problem --> The first 3 colums are given. Generate the last one 'Months_since_last_active' by adding 1 until the use is active again
My Solution as below:
With active_sessions as (
Select
User_Id
, CalMonth
, active flag as current_flag
, LAG (ActiveFlag,1) over (partition by User_Id order by CalMonth) as previous_flag
)
Select User_Id, CalMonth, current_flag, sum(case when current_flag =1 then 0
when current_flag IS NULL then Months_since_last_active + 1
END
) as Months_since_last_active
from active_sessions
order by 1,2
I was asked the above question in an interview and told that my proposed solution would not work because:
When it comes to 3/1/2021 and beyond, the previous values of 'Months_since_last_active' are not in the table yet -- they are only in the code
If I wanted to use LAG function, then it'd take innumerable LAG functions to achieve what I was trying to achieve
I will appreciate if someone can comment on my solution.
Your solution has 3 major problems, 2 of them may be related to copy/past errors. The active_sessions CTE is missing the from clause, so there is no data source. Then the main portion uses the aggregate function SUM, however, the query has no group by which is required for the aggregate function. These are easily corrected. The other issue concerns the LAG function and your use of it.
First off in the CTE you alias the result as previous_flag, then in the main query you reference Months_since_last_active which does not exist yet. I think this is the source of the interviewer's first point.
The interviewer's second point also stems form the LAG function. As written it always looks back exactly 1 row, but from the current row yet it needs to look back 2 rows for (userid, calmonth) = ('A', 2021-03-01), and 3 rows for (A, 2021-04-01), etc. Basically you need to look back to to the last row with active_flag = 1. This leads directly to the it'd take innumerable LAG functions as you do not know how far beck you need to look. Suppose you had 30-40 or more inactive rows between active rows. You need a LAG(activeflag,n) ... for each possibility.
A solution. I dislike the problem statement it should not contain by adding 1 until the use is active again (is it yours or theirs). Either way this is an XY. If theirs they should be telling you what to solve, i.e. find number of months since last active. If yours you have created the problem for yourself. The problem statement should not say anything about how to solve the it. I will ignore that portion of the problem (And in a real interview I would/have ignored it, but be prepared to explain why).
What you have a a version of a Gaps And Islands (google it, you will find more that to think about). In this version lets consider each row with activeflag = 'Y' an as island, and anything else as a gap. Nor what you are looking for is the length of the gaps between islands. In the following the island_num CTE does 2 things. It assigns a sequence number to each row for a (userid, calmonth) and generates a boolean for each island. The `gap_points' then joins the results with itself, selecting the assigned for the max island whose calmonth is less than the current rows calmonth. In the main part the Months_since_last_active is assigned 0 if the current row is an island, and the difference between the generated row numbers if it is a gap. (see demo)
with island_num (userid, cal_month, active_flag, is_island, row_num) as
( select am.*
, case when am.activeflag = 1 then true else false end is_island
, row_number() over (partition by am.userid order by am.calmonth) rn
from active_month am
) -- select * from island_num
, gap_points(userid, cal_month, active_flag, is_island, row_num, island_row) as
( select *
from island_num i1
join lateral
(select max(row_num)
from island_num i2
where i1.userid = i2.userid
and i2.cal_month < i1.cal_month
and i2.is_island
) s0
on true
) --select * from gap_points;
select userid "User Id"
, cal_month "Cal Month"
, active_flag "Active Flag"
, case when is_island then 0
else row_num - island_row
end "Months_since_last_active"
from gap_points;
I have two tables: accounts and opportunities. One account has 0-n opportunities, but only 0 or 1 opportunities at any point of time (within the contract_from/contract_to range).
I want to report for the past 4 months which account had which opportunity in this month.
I came up with this query:
WITH numbers AS (SELECT 1 AS n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4)
SELECT * FROM
(
(SELECT id, name FROM accounts WHERE is_active) AS acc(acct_id, name)
CROSS JOIN
(SELECT dateadd(MONTH, -n,
date_trunc('month', current_date))::date AS start,
dateadd(DAY, -1, dateadd(MONTH, -n + 1,
date_trunc('month', current_date)))::date AS stop
FROM numbers) AS period(start, stop)
)
LEFT OUTER JOIN
(SELECT acct_id, subscription_type, contract_from, contract_to
FROM opportunities) AS opp(acct_id, subscription, start, stop)
ON (acc.acct_id = opp.acct_id AND
opp.start <= period.start AND
(opp.stop ISNULL OR
opp.stop > period.stop))
My problem is, that some of the accounts only have two resulting rows, even thou I did a left join so I expect them to always have four rows with having the months without active opportunity resulting in null values in columns subscription, start and stop.
Is mixing these joins not supported in Redshift?
After some more iterations on my query I found out that the left join indeed works, but the order gets mixed up. The rows with the nulls end up further down. Probably because Redshift first does the left join and then "fills" up the rows which don't have a corresponding right match.
Also: OUTER JOIN is the wrong choice here, because if there are more than 1 opportunity at a given date, then the additional opportunity cause more resulting rows.
In this postgres query
SELECT
q.*,
q.user_id = 1 mine,
first(u.username) username,
sum(case when v.answer=0 then 1 else 0 end) no_count,
sum(case when v.answer=1 then 1 else 0 end) yes_count
FROM question q
JOIN "user" u ON q.user_id=u.id
JOIN vote v ON v.question_id=q.id
GROUP BY q.id
ORDER BY q.date_created desc
LIMIT 20
OFFSET 0
It says I need an aggregate on username. In this case, I am getting questions, and each question is created by some user. I join on votes, but group by the question id, so before grouping the username field will be the same for each unique question. So that means when aggregating, really I can pick any value since it's all the same. However I can't find any aggregate function to use. I tried random, first, last, any, but none work or even defined.
Does anyone know how to handle this?
Ok, so problem is that sql engine doesn't know that username will be always the same. In other words it can't tell relation one-to-one from one-to-many between guestions and users. You could use string_agg() with DISTINCT like string_agg(DISTINCT u.username, ','::text). What that functions does is it's aggragetes text values into one big string seperated by specified delimiter (in my case comma). When you add DISTINCT it takes only unique values. Since you've said theres always only one username per question the output will be that single value.
The whole query:
SELECT
q.*,
q.user_id = 1 mine,
string_agg(DISTINCT u.username, ','::text) username,
sum(case when v.answer=0 then 1 else 0 end) no_count,
sum(case when v.answer=1 then 1 else 0 end) yes_count
FROM question q
JOIN "user" u ON q.user_id=u.id
JOIN vote v ON v.question_id=q.id
GROUP BY q.id
ORDER BY q.date_created desc
LIMIT 20;
Footnote: I dropped the OFFSET 0 because it's default value for LIMIT.
From the box postgresql do not allow or cant do such operations, but there is workaround.
You can extend aggregate functions First/Last by this method: https://wiki.postgresql.org/wiki/First/last_(aggregate)
There is plugin and portable sql version for postgresql.
Hello everyone I have to get data from and to date, I tried using between clause which fails to retrieve data what I need. Here is what I need.
I have table called hall_info which has following structure
hall_info
id | hall_name |address |contact_no
1 | abc | India |XXXX-XXXX-XX
2 | xyz | India |XXXX-XXXX-XX
Now I have one more table which is events, that contains data about when and which hall is booked on what date, the structure is as follows.
id |hall_info_id |event_date(booked_date)| event_name
1 | 2 | 2015-10-25 | Marriage
2 | 1 | 2015-10-28 | Marriage
3 | 2 | 2015-10-26 | Marriage
So what I need now is I wanna show hall_names that are not booked on selected dates, suppose if user chooses from 2015-10-23 to 2015-10-30 so I wanna list all halls that are not booked on selected dates. In above case both the halls of hall_info_id 1 and 2 ids booked in given range but still I wanna show them because they are free on 23,24,27 and on 29 date.
In second case suppose if user chooses date from 2015-10-25 and 2015-10-26 then only hall_info_id 2 is booked on both the dates 25 and 26 so in this case i wanna show only hall_info_id 1 as hall_info_id 2 is booked.
I tried using inner query and between clause but I am not getting required result to simply i have given only selected fields I have more tables to join so i cant paste my query please help with this. Thanks in advance for all who are trying.
Some changes in Yasen Zhelev's code:
SELECT * FROM hall_info
WHERE id not IN (
SELECT hall_info_id FROM events
WHERE event_date >= '2015-10-23' AND event_date <= '2015-10-30'
GROUP BY hall_info_id
HAVING COUNT(DISTINCT event_date) > DATE_PART('day', '2015-10-30'::timestamp - '2015-10-23'::timestamp))
I have not tried it but how about checking if the number of bookings per hall is less than the actual days in the selected period.
SELECT * FROM hall_info WHERE id NOT IN
(SELECT hall_info_id FROM events
WHERE event_date >= '2015-10-23' AND event_date <= '2015-10-30'
GROUP BY hall_info_id
HAVING COUNT(id) < DATEDIFF(day, '2015-10-30', '2015-10-23')
);
That will only work if you have one booking per day per hall.
To get the "available dates" for the hall returned, your query needs a row source of all possible dates. For example, if you had a calendar table populated with possible date values, e.g.
CREATE TABLE cal (dt DATE NOT NULL PRIMARY KEY) Engine=InnoDB
;
INSERT INTO cal (dt) VALUES ('2015-10-23')
,('2015-10-24'),('2015-10-25'),('2015-10-26'),('2015-10-27')
,('2015-10-28'),('2015-10-29'),('2015-10-30'),('2015-10-31')
;
The you could use a query that performs a cross join between the calendar table and hall_info... to get every hall on every date... and an anti-join pattern to eliminate rows that are already booked.
The anti-join pattern is an outer join with a restriction in the WHERE clause to eliminate matching rows.
For example:
SELECT cal.dt, h.id, h.hall_name, h.address
FROM cal cal
CROSS
JOIN hall_info h
LEFT
JOIN events e
ON e.hall_id = h.id
AND e.event_date = cal.dt
WHERE e.id IS NULL
AND cal.dt >= '2015-10-23'
AND cal.dt <= '2015-10-30'
The cross join between cal and hall_info gets all halls for all dates (restricted in the WHERE clause to a specified range of dates.)
The outer join to events find matching rows in the events table (matching on hall_id and event_date. The trick is the predicate (condition) in the WHERE clause e.id IS NULL. That throws out any rows that had a match, leaving only rows that don't have a match.
This type of problem is similar to other "sparse data" problems. e.g. How do you return a zero total for sales by a given store on a given date, when there are no rows with that store and date...
In your case, the query needs a source of rows with available date values. That doesn't necessarily have to be a table named calendar. (Other databases give us the ability to dynamically generate a row source; someday, MySQL may have similar features.)
If you want the row source to be dynamic in MySQL, then one approach would be to create a temporary table, and populate it with the dates, run the query referencing the temporary table, and then dropping the temporary table.
Another approach is to use an inline view to return the rows...
SELECT cal.dt, h.id, h.hall_name, h.address
FROM (
SELECT '2015-10-23'+INTERVAL 0 DAY AS dt
UNION ALL SELECT '2015-10-24'
UNION ALL SELECT '2015-10-25'
UNION ALL SELECT '2015-10-26'
UNION ALL SELECT '2015-10-27'
UNION ALL SELECT '2015-10-28'
UNION ALL SELECT '2015-10-29'
UNION ALL SELECT '2015-10-30'
) cal
CROSS
JOIN hall_info h
LEFT
JOIN events e
ON e.hall_id = h.id
AND e.event_date = c.dt
WHERE e.id IS NULL
FOLLOWUP: When this question was originally posted, it was tagged with mysql. The SQL in the examples above is for MySQL.
In terms of writing a query to return the specified results, the general issue is still the same in PostgreSQL. The general problem is "sparse data".
The SQL query needs a row source for the "missing" date values, but the specification doesn't provide any source for those date values.
The answer above discusses several possible row sources in MySQL: 1) a table, 2) a temporary table, 3) an inline view.
The answer also mentions that some databases (not MySQL) provide other mechanisms that can be used as a row source.
For example, PostgreSQL provides a nifty generate_series function (Reference: http://www.postgresql.org/docs/9.1/static/functions-srf.html.
It should be possible to use the generate_series function as a row source, to supply a set of rows containing the date values needed by the query to produced the specified result.
This answer demonstrates the approach to solving the "sparse data" problem.
If the specification is to return just the list of halls, and not the dates they are available, the queries above can be easily modified to remove the date expression from the SELECT list, and add a GROUP BY clause to collapse the rows into a distinct list of halls.
We have a table that is populated from information on multiple computers every day. The problem is sometimes it doesn't pull information from certain computers.
So for a rough example, the table columns would read computer_name, information_pulled, qty_pulled, date_pulled.
So Lets say it pulled every day in a week, except the 15th. A query will pull
Computer_name, Information_pulled, qty_pulled, date_pulled
computer1 infopulled 2 2014-06-14
computer2 infopulled 3 2014-06-14
computer3 infopulled 2 2014-06-14
computer1 infopulled 2 2014-06-15
computer3 infopulled 1 2014-06-15
computer1 infopulled 3 2014-06-16
computer2 infopulled 2 2014-06-16
computer3 infopulled 4 2014-06-16
As you can see, nothing pulled in for computer 2 on the 15th. I am looking to write a query that pulls up missing rows for a specific date.
For Example, after running it it says
computer 2 null null 20140615
or anything close to this. We're trying to catch it each morning when this table isn't populated that way we can be proactive and I am not positive I can even query for missing data w/o searching for null values.
You need to have a master list of all your computers somewhere, so that you know when a computer is not accounted for in your table. Say that you have a table called Computer that holds this.
Declare a variable to store the date you want to check:
declare #date date
set #date = '6/15/2014'
Then you can query for missing rows like this:
select c.Computer_name, null, null, #date
from Computer c
where not exists(select 1
from myTable t
where t.Computer_name = c.Computer_name
and t.date_pulled = #date)
SQL Fiddle
If you are certain that every computer_name already exists in your table at least once, you could skip creating a separate Computer table, and modify the query like this:
select c.Computer_name, null, null, #date
from (select distinct Computer_name from myTable) c
where not exists(select 1
from myTable t
where t.Computer_name = c.Computer_name
and t.date_pulled = #date)
This query isn't as robust because it will not show computers that do not already have a row in your table (e.g. a new computer, or a problematic computer that has never had its information pulled).
I think a cross-join will answer your problem.
In the query below, every computer will have to have successfully uploaded at least once and at least one every day.
This way you'll get every missing computer/date couple.
select
Compare.*
from Table_1 T1
right join (
select *
from
(select Computer_name from Table_1 group by Computer_name) CPUS,
(select date_pulled from Table_1 group by date_pulled) DAYs
) Compare
on T1.Computer_name=Compare.Computer_name
and T1.date_pulled=Compare.date_pulled
where T1.Computer_name is null
Hope this help.
If you join the table to itself by date and computer_name like the following, you should get a list of missing dates
SELECT t1.computer_name, null as information_pulled, null as qty_pulled,
DATEADD(day,1,t1.date_pulled) as missing_date
FROM computer_info t1
LEFT JOIN computer_info t2 ON t2.date_pulled = DATEADD(day,1,t1.date_pulled)
AND t2.computer_name = t1.computer_name
WHERE t1.date_pulled >= '2014-06-14'
AND t2.date_pulled IS NULL
This will also get the next date that hasn't been pulled yet, but that should be clear and you could add an additional condition to filter it out.
AND DATEADD(day,1,t1.date_pulled) < '2014-06-17'
Of course, this only works if you know each of the computer names already exist in the table for previous days. If not, #Jerrad's suggestion to create a separate computer table would help.
EDIT: if the gap is larger than a single day, you may want to see that
SELECT t1.computer_name, null as info, null as qty_pulled,
DATEADD(day,1,t1.date_pulled) as missing_date,
t3.date_pulled AS next_pulled_date
FROM computer_info t1
LEFT JOIN computer_info t2 ON t2.date_pulled = DATEADD(day,1,t1.date_pulled)
AND t2.computer_name = t1.computer_name
LEFT JOIN computer_info t3 ON t3.date_pulled > t1.date_pulled
AND t3.computer_name = t1.computer_name
LEFT JOIN computer_info t4 ON t4.date_pulled > t1.date_pulled
AND t4.date_pulled < t3.date_pulled
AND t4.computer_name = t1.computer_name
WHERE t1.date_pulled >= '2014-06-14'
AND t2.date_pulled IS NULL
AND t4.date_pulled IS NULL
AND DATEADD(day,1,t1.date_pulled) < '2014-06-17'
The 't3' join will join all dates over the first missing one and the 't4' join along with t4.pulled_date IS NULL will exclude all but the lowest of those dates.
You could do this with subqueries as well, but excluding joins have served me well in the past.