How to get date range between dates from the records in the same table? - tsql

I have a table with employment records. It has Employee code, status, and date when table was updated.
Like this:
Employee
Status
Date
001
termed
01/01/2020
001
rehired
02/02/2020
001
termed
03/03/2020
001
rehired
04/04/2021
Problem - I need to get period length when Employee was working for a company, and check if it was less than a year - then don't display that record.
There could be multiple hire-rehire cycles for each Employee. 10-20 is normal.
So, I'm thinking about two separate selects into two tables, and then looking for a closest date from hire in table 1, to termination in table 2. But it seems like overcomplicated idea.
Is there a better way?

Many approaches, but something like this could work:
SELECT
Employee,
SUM(DaysWorked)
FROM
(
SELECT
a1.employee,
IsNull(DateDiff(DD, a1.[Date],
(SELECT TOP 1 [Date] FROM aaa a2 WHERE a2.employee = a1.employee AND a2.[Date] > a1.[Date] and [status] <> 'termed' ORDER BY [Date] )
),DateDiff(DD, a1.[Date], getDate())) as DaysWorked
FROM
aaa a1
WHERE
[Status] = 'termed'
) Totals
GROUP BY
Totals.employee
HAVING SUM(DaysWorked) >= 365
Also using a CROSS JOIN is an option and perhaps more efficient. In this example, replace 'aaa' with the actual table name. The IsNull deals with an employee still working.

Related

How to create a working pivot table without killing the system

Let's say you have a Customer table, a simple customer table with just 4 columns:
customerCode numeric(7,0)
customerName char(50)
customerVATNumber char(11)
customerLocation char(35)
Keep in mind that the customers table contains 3 million rows because there are all the customers of the last 40 years, but the active ones are only 980000.
Suppose we then have a table called Sales structured in this way:
saleID integer
customerCode numeric(7,0)
agentID numeric(6,0)
productID char(2)
dateBeginSale date
dateEndSale date
There are about three and a half million rows in this table (here too we have stuff from 40 years ago), but the current supplies for the various products are a total of one million. The company only sells 4 products. Each customer can purchase up to 4 products with 4 different contracts even from 4 different agents. Most (90%) buy only one, the remaining from two to 4 (those who make the complete assortment are just 4 cats).
I was asked to build a pivot table showing for each customer with it's name and location all the product he purchased and from which agent.
The proposed layout for this pivot table is:
customerCode
customerName
customerLocation
productID1
agentID1
saleID1
dateBeginSale1
dateEndSale1
productID2
agentID2
saleID2
dateBeginSale2
dateEndSale2
productID3
agentID3
saleID3
dateBeginSale3
dateEndSale3
productID4
agentID4
saleID4
dateBeginSale4
dateEndSale4
I built the pivot with a view.
First I created 4 views, one for each product id on the Sales table, also useful for other statistical and reporting purposes
View1 as
customerCode1
productID1
agentID1
saleID1
dateBeginSale1
dateEndSale1
View2 as
customerCode2
productID2
agentID2
saleID2
dateBeginSale2
dateEndSale2
and so on till View4
Then i joined the 4 views with the customer table and created the PivotView i needed.
Now Select * from PivotView works perfectly.
Also Select * from PivotView Where customerLocation='NEW YORK CITY' too.
Any other request, for example: we select and count the customers residing in LOS ANGELES who have purchased the products from the same agent or from different sales agents, literally makes the machine sit down, I see the memory occupation grow (probably due to the construction of some temporary table or view) and often the execution of the query crashes.
However, if I create the same pivot on a table instead of a view the times of the various selections collapse and even if heavy (there are always about a million records to scan to verify the existence of the various conditions) they become acceptable.
For sure i am mistaking something and/or there must to be a better way to achieve the result: having a pivot built on on line data istead of one from data extracted nightly.
I'll be happy to read your comments and suggestion.
I don't clearly understand your data layout and what you need. But I'll say that the usual problem with pivoting data on Db2 for IBM i is that there's no built in way to dynamically pivot the data.
Given that you only have 4 products, the above limitation doesn't really apply.
Your problem would seem to be that by creating 4 views over the same table, you're processing records repeatedly. Instead, try to touch the data one time.
create view PivotSales as
select
customerCode,
-- product 1
max(case productID when '01' then productID end) as productID1,
max(case productID when '01' then agentID end) as agentID1,
max(case productID when '01' then saleID end) as saleID1,
max(case productID when '01' then dateBeginSale end) as dateBeginSale1,
max(case productID when '01' then dateEndSale end) as dateEndSale1,
-- product 2
max(case productID when '02' then productID end) as productID2,
max(case productID when '02' then agentID end) as agentID2,
max(case productID when '02' then saleID end) as saleID2,
max(case productID when '02' then dateBeginSale end) as dateBeginSale2,
max(case productID when '02' then dateEndSale end) as dateEndSale2,
-- repeat for product 3 and 4
from Sales
group by customerCode;
Now you can have a CustomerSales view:
create view CustomerSales as
select *
from Customers join SalesPivot using (customerCode);
Run your queries, using Visual Explain to see what indexes the system suggests are needed. At minimum, you should have an indexes:
Customer (customerCode)
Customer (location, customerCode)
Sales (customerCode)
I suspect that some Encoded Vector Indexes (EVI) over various columns in Sales and Customer would prove helpful. Especially since you mention "counting". An EVI keeps track of the counts of the symbols. So counting is "free". An example:
create encoded vector index customerLocEvi
on Customers (location);
-- this doesn't have to read any rows in customer
select count(*)
from customer
where location = 'LOS ANGELES';
For sure I am mistaking something and/or there must to be a better way
to achieve the result: having a pivot built on on line data istead of
one from data extracted nightly.
Don't be too sure about that. The DB structure that best supports Business Intelligence type queries usually doesn't match the typical transactional data structure. A periodic "extract, transform, load (ETL)" is pretty typical.
For your particular use case, you could turn CustomerSales into a Materalized Query Table (MQT), build some supporting indexes for it and just run queries directly over it. Nightly rebuild would be as simple as REFRESH CustomerSales;
Or if you wanted too, since Db2 for IBM i doesn't support SYSTEM MAINTAINED MQTs, a trigger over Sales could automatically propagate data to CustomerSales instead of rebuilding it nightly.

postgresql selecting the most representative value

I have a table in which objects have ids and they have names. The ids are correct by definition, the names are almost always correct, but sometimes dirty incoming data causes names to be null or even wrong.
So I do a query like
SELECT id, name, AGGR1(a) as a, AGGR2(b) as b, AGGR3(c) as c
FROM my_table
WHERE d = 3
GROUP BY id
I'd like to have name in the results, but of course the above is wrong. I'd have to group on id, name, in which case what should be one row sometimes becomes more than one -- say, id 2 has names 'John' (correct), 'Jon' (no, but only 1%), or NULL (also a small fraction).
Is there a construct or idiom in postgresql that lets me select what a human looking at the list would say is obviously the consensus name?
(I hear our postgres installation is finally being upgraded soon, if that matters here.)
sample output, in case prose wasn't clear
SELECT id, name, COUNT(id) as c
FROM my_table
WHERE d = 3
GROUP BY id
id name c
2 John 2000
2 Jon 3
2 (NULL) 5
vs
id name c
2 John 2008
You can get the names with
WITH names as (
SELECT
id,
name,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY COUNT(1) DESC) as rn
FROM my_table
GROUP BY id, name
)
SELECT id, name
FROM names
WHERE rn=1;
and then do your calculations by id only, joining names from this query.

number of lessons for a teacher SQL

I have four tables, one is teacher and 3 different kinds of lessons.
My task is to find the teacher that has given the most lessons during a year.
input:
lesson1.id lesson1.date teacher.id
1 2020-12-01 1
2 2020-04-01 1
lesson2.id lesson2.date teacher.id
1 2020-10-01 2
2 2020-05-01 3
lesson3.id lesson3.date teacher.id
1 2020-02-01 1
2 2020-06-01 3
teacher.id teacher.name
1 john
2 scott
3 david
output:
teacher.id teacher.name lessons_given
1 john 3
I tried to join them together with left join on teacher but its not working...
Hope you guys can help me out:)
Thanks
What you are attempting to build is a many-to-many (m:m) between Teacher and Lesson. Instead what you have is many one-to-many relationships. While that works for a small number of lessons (with some difficulty) think about the same requirement with with 50 or 500 or more lessons. What you actually need is 3 tables:
create table lessons( lesson_id integer generated always as identity
, name text
, subject text -- for example
-- other lesson related attributes
);
create table teachers( teacher_id integer generated always as identity
, name text
-- other related teacher attributes
);
create table teacher_lessons( teacher_id integer
, lesson_id integer
, lesson_date date
):
Now you have a structure that can handle any number of either teachers and/or lessons. And are further are available other uses as is, say perhaps students to lessons. See fiddle for current issue.
You could union all the three lesson tables to get a "flat" list of lessons, and then join that on the teachers table:
SELECT t.id, t.name, COUNT(*)
FROM teacher t
JOIN (SELECT teacher_id FROM lesson1 UNION ALL
SELECT teacher_id FROM lesson2 UNION ALL
SELECT teacher_id FROM lesson3) l ON t.id = l.teacher_id
The obvious way to do is to use Belayer's solution.
However if and only if you can not put all the data in a single table for some reason (for example if lesson1, lesson2 and lesson3 all have specific attributes), then another solution would be to use table inheritance.
For instance :
CREATE TABLE lesson (
id INT,
date TIMESTAMP,
teacher INT
);
ALTER TABLE lesson1 INHERIT lesson;
ALTER TABLE lesson2 INHERIT lesson;
ALTER TABLE lesson3 INHERIT lesson;
Now, in order to count the number of lessons each teacher is involved into, you can just use the lesson table:
SELECT teacher.id, teacher.name, COUNT(lesson.id)
FROM teacher
LEFT JOIN lesson ON lesson.teacher = teacher.id
GROUP BY teacher
ORDER BY COUNT(lesson.id) DESC
FETCH FIRST ROW WITH TIES;
You can replace the last line with LIMIT 1 if you are only interested in getting one of the most active teachers, but then your result is no longer deterministic.
Again, please do not use inheritance if there is no need to.

fetch data from and to date to get all matching results

Hello everyone I have to get data from and to date, I tried using between clause which fails to retrieve data what I need. Here is what I need.
I have table called hall_info which has following structure
hall_info
id | hall_name |address |contact_no
1 | abc | India |XXXX-XXXX-XX
2 | xyz | India |XXXX-XXXX-XX
Now I have one more table which is events, that contains data about when and which hall is booked on what date, the structure is as follows.
id |hall_info_id |event_date(booked_date)| event_name
1 | 2 | 2015-10-25 | Marriage
2 | 1 | 2015-10-28 | Marriage
3 | 2 | 2015-10-26 | Marriage
So what I need now is I wanna show hall_names that are not booked on selected dates, suppose if user chooses from 2015-10-23 to 2015-10-30 so I wanna list all halls that are not booked on selected dates. In above case both the halls of hall_info_id 1 and 2 ids booked in given range but still I wanna show them because they are free on 23,24,27 and on 29 date.
In second case suppose if user chooses date from 2015-10-25 and 2015-10-26 then only hall_info_id 2 is booked on both the dates 25 and 26 so in this case i wanna show only hall_info_id 1 as hall_info_id 2 is booked.
I tried using inner query and between clause but I am not getting required result to simply i have given only selected fields I have more tables to join so i cant paste my query please help with this. Thanks in advance for all who are trying.
Some changes in Yasen Zhelev's code:
SELECT * FROM hall_info
WHERE id not IN (
SELECT hall_info_id FROM events
WHERE event_date >= '2015-10-23' AND event_date <= '2015-10-30'
GROUP BY hall_info_id
HAVING COUNT(DISTINCT event_date) > DATE_PART('day', '2015-10-30'::timestamp - '2015-10-23'::timestamp))
I have not tried it but how about checking if the number of bookings per hall is less than the actual days in the selected period.
SELECT * FROM hall_info WHERE id NOT IN
(SELECT hall_info_id FROM events
WHERE event_date >= '2015-10-23' AND event_date <= '2015-10-30'
GROUP BY hall_info_id
HAVING COUNT(id) < DATEDIFF(day, '2015-10-30', '2015-10-23')
);
That will only work if you have one booking per day per hall.
To get the "available dates" for the hall returned, your query needs a row source of all possible dates. For example, if you had a calendar table populated with possible date values, e.g.
CREATE TABLE cal (dt DATE NOT NULL PRIMARY KEY) Engine=InnoDB
;
INSERT INTO cal (dt) VALUES ('2015-10-23')
,('2015-10-24'),('2015-10-25'),('2015-10-26'),('2015-10-27')
,('2015-10-28'),('2015-10-29'),('2015-10-30'),('2015-10-31')
;
The you could use a query that performs a cross join between the calendar table and hall_info... to get every hall on every date... and an anti-join pattern to eliminate rows that are already booked.
The anti-join pattern is an outer join with a restriction in the WHERE clause to eliminate matching rows.
For example:
SELECT cal.dt, h.id, h.hall_name, h.address
FROM cal cal
CROSS
JOIN hall_info h
LEFT
JOIN events e
ON e.hall_id = h.id
AND e.event_date = cal.dt
WHERE e.id IS NULL
AND cal.dt >= '2015-10-23'
AND cal.dt <= '2015-10-30'
The cross join between cal and hall_info gets all halls for all dates (restricted in the WHERE clause to a specified range of dates.)
The outer join to events find matching rows in the events table (matching on hall_id and event_date. The trick is the predicate (condition) in the WHERE clause e.id IS NULL. That throws out any rows that had a match, leaving only rows that don't have a match.
This type of problem is similar to other "sparse data" problems. e.g. How do you return a zero total for sales by a given store on a given date, when there are no rows with that store and date...
In your case, the query needs a source of rows with available date values. That doesn't necessarily have to be a table named calendar. (Other databases give us the ability to dynamically generate a row source; someday, MySQL may have similar features.)
If you want the row source to be dynamic in MySQL, then one approach would be to create a temporary table, and populate it with the dates, run the query referencing the temporary table, and then dropping the temporary table.
Another approach is to use an inline view to return the rows...
SELECT cal.dt, h.id, h.hall_name, h.address
FROM (
SELECT '2015-10-23'+INTERVAL 0 DAY AS dt
UNION ALL SELECT '2015-10-24'
UNION ALL SELECT '2015-10-25'
UNION ALL SELECT '2015-10-26'
UNION ALL SELECT '2015-10-27'
UNION ALL SELECT '2015-10-28'
UNION ALL SELECT '2015-10-29'
UNION ALL SELECT '2015-10-30'
) cal
CROSS
JOIN hall_info h
LEFT
JOIN events e
ON e.hall_id = h.id
AND e.event_date = c.dt
WHERE e.id IS NULL
FOLLOWUP: When this question was originally posted, it was tagged with mysql. The SQL in the examples above is for MySQL.
In terms of writing a query to return the specified results, the general issue is still the same in PostgreSQL. The general problem is "sparse data".
The SQL query needs a row source for the "missing" date values, but the specification doesn't provide any source for those date values.
The answer above discusses several possible row sources in MySQL: 1) a table, 2) a temporary table, 3) an inline view.
The answer also mentions that some databases (not MySQL) provide other mechanisms that can be used as a row source.
For example, PostgreSQL provides a nifty generate_series function (Reference: http://www.postgresql.org/docs/9.1/static/functions-srf.html.
It should be possible to use the generate_series function as a row source, to supply a set of rows containing the date values needed by the query to produced the specified result.
This answer demonstrates the approach to solving the "sparse data" problem.
If the specification is to return just the list of halls, and not the dates they are available, the queries above can be easily modified to remove the date expression from the SELECT list, and add a GROUP BY clause to collapse the rows into a distinct list of halls.

Creating a many to many in postgresql

I have two tables that I need to make a many to many relationship with. The one table we will call inventory is populated via a form. The other table sales is populated by importing CSVs in to the database weekly.
Example tables image
I want to step through the sales table and associate each sale row with a row with the same sku in the inventory table. Here's the kick. I need to associate only the number of sales rows indicated in the Quantity field of each Inventory row.
Example: Example image of linked tables
Now I know I can do this by creating a perl script that steps through the sales table and creates links using the ItemIDUniqueKey field in a loop based on the Quantity field. What I want to know is, is there a way to do this using SQL commands alone? I've read a lot about many to many and I've not found any one doing this.
Assuming tables:
create table a(
item_id integer,
quantity integer,
supplier_id text,
sku text
);
and
create table b(
sku text,
sale_number integer,
item_id integer
);
following query seems to do what you want:
update b b_updated set item_id = (
select item_id
from (select *, sum(quantity) over (partition by sku order by item_id) as sum from a) a
where
a.sku=b_updated.sku and
(a.sum)>
(select count(1) from b b_counted
where
b_counted.sale_number<b_updated.sale_number and
b_counted.sku=b_updated.sku
)
order by a.sum asc limit 1
);