filter data based on month on mysql - mysql-workbench

i just want to know the data who only appear in month apr - june 2018, and ignore the data who appear in month before that, idk why the result is 0 instead when i check manual on random data, it is exist. here's my syntax
SELECT DISTINCT
d1.buyer_id,
d1.tgl
FROM data_2018 d1
INNER JOIN data_2018 d2
ON d1.buyer_id = d2.buyer_id
INNER JOIN data_2017 d3
ON d1.buyer_id = d3.buyer_id
WHERE
MONTH(d1.tgl) IN (4, 5, 6) AND
MONTH(d2.tgl) NOT IN (1, 2, 3) AND
MONTH(d3.tgl) NOT IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12);

I recommend using date literals to specify the date range you want. Also, I suggest using EXISTS logic here:
SELECT DISTINCT
d1.buyer_id,
d1.tgl
FROM data_2018 d1
WHERE
d1.tgl >= '2018-04-01' AND d1.tgl < '2018-07-01' AND
NOT EXISTS (SELECT 1 FROM data_2018 d2
WHERE d2.buyer_id = d1.buyer_id AND d2.tgl < '2018-04-01');
This assumes that you only want to eliminate records which also appear before April 2018. If you also want to restrict to not appearing after June 2018, then we would need to add a check for that to the EXISTS subquery.

Related

PostgreSQL sum some values together and don't for other

SELECT
t.id,
sum(o.amount),
t.parent_id
FROM tab t
LEFT JOIN order o ON o.deal = t.id
GROUP BY t.id
Current output:
id
sum
parent_id
1
10
2
10
3
15
5
4
30
5
5
0
6
0
8
7
0
8
8
20
Desired logic, if the row contains parent_id then skip it but add everything together in the sum field so for id 3,4,5 the total would be 45 and only the id 5 would be shown. There can be cases when the sums are in the "sub tabs" or in the "main tab" but everything should be summed together.
Desired output:
id
sum
parent_id
1
10
2
10
5
45
8
20
What have I tried so far is to do sub-selects and played around with group by. Can someone point me to the right direction?
Use coalesce().
with the_data(id, sum, parent_id) as (
values
(1, 10, null),
(2, 10, null),
(3, 15, 5),
(4, 30, 5),
(5, 0, null),
(6, 0, 8),
(7, 0, 8),
(8, 20, null)
)
select coalesce(parent_id, id) as id, sum(sum)
from the_data
group by 1
order by 1
Read about the feature in the documentation.
Db<>fiddle.
Your query isn't valid in PostgreSQL:
SELECT
t.id,
sum(o.amount),
t.parent_id
FROM tab t
LEFT JOIN order o ON o.deal = t.id
GROUP BY t.id
Unlike MySQL, PostgreSQL doesn't have implicit GROUP BY columns (unless something changed recently).
Anyway, if you're using t.id in your GROUP BY clause, then each t.id will produce one row, so you'll always have 3 and 4 separated, for example.
It looks like you're trying to use the parent_id as the main criterion to group by, falling back on the id when the parent_id is NULL.
You could use COALESCE(t.parent_id, t.id) to get this value for each row, and then group using it.
For example:
SELECT
COALESCE(t.parent_id, t.id),
SUM(o.amount)
FROM tab t
LEFT JOIN order o ON o.deal = t.id
GROUP BY COALESCE(t.parent_id, t.id)

filter several data in several date mysql

so i want to filter my buyer data who doing transaction in month 1,2,3 (jan-mar) 2019 who do the transaction too in month 4,5,6 (apr-june) 2017, so if the buyer doing transaction before apr 2017, the the buyer didnt appear in list, i've tried my syntax but idk why the result is so many, here's my syntax
SELECT DISTINCT
d1.buyer_id
FROM data_2019 d1
WHERE
MONTH (d1.tgl) IN (1, 2, 3) AND
NOT EXISTS (SELECT 1 FROM data_2017 d2
WHERE d2.buyer_id = d1.buyer_id AND d2.tgl < '2017-04-01')
GROUP BY
buyer_id;
Can you tell me guys which the wrong at?
I suspect there are two other problems, beyond what Tim Biegeleisen noted.
First, every sale by every buyer in data_2019 will result in tests in data_2017. I suggest querying from a table of all buyers, with an EXISTS() clause on data_2019. This should also eliminate the need for the DISTINCT clause.
Second, partitioning the data into different tables by year will be a serious headache as time passes. Why not put it all into a single table?
Thus:
SELECT
b.buyer_id
FROM buyer b
WHERE
EXISTS (SELECT 1 FROM data_all d
WHERE d.buyer_id = b.buyer_id AND
d.tgl >= '2019-01-01' AND d.tgl < '2019-04-01') AND
EXISTS (SELECT 1 FROM data_all d
WHERE d.buyer_id = b.buyer_id AND
d.tgl >= '2017-04-01' AND d.tgl < '2017-07-01') AND
NOT EXISTS (SELECT 1 FROM data_all d
WHERE d.buyer_id = b.buyer_id AND
d.tgl >= '2017-01-01' AND d.tgl < '2017-04-01');
At this point, if you wanted to extend the "not before april 2017" clause to all years, you just remove the d.tgl >= '2017-01-01' clause, where you might otherwise need many NOT EXISTS classes for each year.
I would express this using two EXISTS clauses:
SELECT DISTINCT
d1.buyer_id
FROM data_2019 d1
WHERE
d1.tgl >= '2019-01-01' AND d1.tgl < '2019-04-01' AND
EXISTS (SELECT 1 FROM data_2017 d2
WHERE d2.buyer_id = d1.buyer_id AND
d2.tgl >= '2017-04-01' AND d2.tgl < '2017-07-01') AND
NOT EXISTS (SELECT 1 FROM data_2017 d2
WHERE d2.buyer_id = d1.buyer_id AND d2.tgl < '2017-04-01');
The first EXISTS clause asserts that the first query 2019 buyer also was active between April and June (inclusive) in 2017. The second EXISTS clause makes sure that this same buyer also had no activity in the first quarter of 2017.

How to add a dash between running numbers and comma between non-running numbers

I would like to replace a set of running and non running numbers with commas and hyphens where appropriate.
Using STUFF & XML PATH I was able to accomplish some of what I want by getting something like 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 15, 19, 20, 21, 22, 24.
WITH CTE AS (
SELECT DISTINCT t1.ORDERNo, t1.Part, t2.LineNum
FROM [DBName].[DBA].Table1 t1
JOIN Table2 t2 ON t2.Part = t1.Part
WHERE t1.ORDERNo = 'AB12345')
SELECT c1.ORDERNo, c1.Part, STUFF((SELECT ', ' + CAST(LineNum AS VARCHAR(5))
FROM CTE c2
WHERE c2.ORDERNo= c1.ORDERNo
FOR XML PATH('')), 1, 2, '') AS [LineNums]
FROM CTE c1
GROUP BY c1.ORDERNo, c1.Part
Here is some sample output:
ORDERNo Part LineNums
ON5650 PT01-0181 5, 6, 7, 8, 12
ON5652 PT01-0181 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 15, 19, 20, 21, 22, 24
ON5654 PT01-0181 1, 4
ON5656 PT01-0181 1, 2, 4
ON5730 PT01-0181 1, 2
ON5253 PT16-3934 1, 2, 3, 4, 5
ON1723 PT02-0585 1, 2, 3, 6, 8, 9, 10
Would like to have:
OrderNo Part LineNums
ON5650 PT01-0181 5-8, 12
ON5652 PT01-0181 1-10, 13, 15, 19-22, 24
ON5654 PT01-0181 1, 4
ON5656 PT01-0181 1-2, 4
ON5730 PT01-0181 1-2
ON5253 PT16-3934 1-5
ON1723 PT02-0585 1-3, 6, 8-10
This is a classic gaps-and-islands problem.
(a good read on the subject is Itzik Ben-Gan's Gaps and islands from SQL Server MVP Deep Dives)
The idea is that you first need to identify the groups of consecutive numbers. Once you've done that, the rest is easy.
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE
(
N int
);
INSERT INTO #T VALUES
(1), (2), (3), (4),
(6),
(8),
(10), (11),
(13), (14), (15),
(17),
(19), (20), (21),
(25);
Then, use a common table expression to identify the groups.
With Grouped AS
(
SELECT N,
N - ROW_NUMBER() OVER(ORDER BY N) As Grp
FROM #T
)
The result if this cte is this:
N Grp
1 0
2 0
3 0
4 0
6 1
8 2
10 3
11 3
13 4
14 4
15 4
17 5
19 6
20 6
21 6
25 9
As you can see, while the numbers are consecutive, the grp value stays the same.
When a row has a number that isn't consecutive with the previous number, the grp value changes.
Then you select from that cte, using a case expression to either select a single number (if it's the only one in it's group) or the start and end of the group, separated by a dash:
SELECT STUFF(
(
SELECT ', ' +
CASE WHEN MIN(N) = MAX(N) THEN CAST(MIN(N) as varchar(11))
ELSE CAST(MIN(N) as varchar(11)) +'-' + CAST(MAX(N) as varchar(11))
END
FROM Grouped
GROUP BY grp
FOR XML PATH('')
), 1, 2, '') As GapsAndIslands
The result:
GapsAndIslands
1-4, 6, 8, 10-11, 13-15, 17, 19-21, 25
For fun I put together another way using Window Aggregates (e.g. SUM() OVER ...). I also use some newer T-SQL functionality such as CONCAT (2012+) and STRING_AGG (2017+). This using Zohar's sample data.
DECLARE #T AS TABLE(N INT PRIMARY KEY CLUSTERED);
INSERT INTO #T VALUES (1),(2),(3),(4),(6),(8),(10),(11),(13),(14),(15),(17),(19),(20),(21),(25);
WITH
a AS (
SELECT t.N,isNewGroup = SIGN(t.N-LAG(t.N,1,t.N-1) OVER (ORDER BY t.N)-1)
FROM #t AS t),
b AS (
SELECT a.N, GroupNbr = SUM(a.isNewGroup) OVER (ORDER BY a.N)
FROM a),
c AS (
SELECT b.GroupNbr,
txt = CONCAT(MIN(b.N), REPLICATE(CONCAT('-',MAX(b.N)), SIGN(MAX(b.N)-MIN(b.N))))
FROM b
GROUP BY b.GroupNbr)
SELECT STRING_AGG(c.txt,', ') WITHIN GROUP (ORDER BY c.GroupNbr) AS Islands
FROM c;
Returns:
Islands
1-4, 6 , 8, 10-11, 13-15, 17, 19-21, 25
And here an approach using a recursive CTE.
DECLARE #T AS TABLE(N INT PRIMARY KEY CLUSTERED);
INSERT INTO #T VALUES (1),(2),(3),(4),(6),(8),(10),(11),(13),(14),(15),(17),(19),(20),(21),(25);
WITH Numbered AS
(
SELECT N, ROW_NUMBER() OVER(ORDER BY N) AS RowIndex FROM #T
)
,recCTE AS
(
SELECT N
,RowIndex
,CAST(N AS VARCHAR(MAX)) AS OutputString
,(SELECT MAX(n2.RowIndex) FROM Numbered n2) AS MaxRowIndex
FROM Numbered WHERE RowIndex=1
UNION ALL
SELECT n.N
,n.RowIndex
,CASE WHEN A.TheEnd =1 THEN CONCAT(r.OutputString,CASE WHEN IsIsland=1 THEN '-' ELSE ',' END, n.N)
WHEN A.IsIsland=1 AND A.IsWithin=0 THEN CONCAT(r.OutputString,'-')
WHEN A.IsIsland=1 AND A.IsWithin=1 THEN r.OutputString
WHEN A.IsIsland=0 AND A.IsWithin=1 THEN CONCAT(r.OutputString,r.N,',',n.N)
ELSE CONCAT(r.OutputString,',',n.N)
END
,r.MaxRowIndex
FROM Numbered n
INNER JOIN recCTE r ON n.RowIndex=r.RowIndex+1
CROSS APPLY(SELECT CASE WHEN n.N-r.N=1 THEN 1 ELSE 0 END AS IsIsland
,CASE WHEN RIGHT(r.OutputString,1)='-' THEN 1 ELSE 0 END AS IsWithin
,CASE WHEN n.RowIndex=r.MaxRowIndex THEN 1 ELSE 0 END AS TheEnd) A
)
SELECT TOP 1 OutputString FROM recCTE ORDER BY RowIndex DESC;
The idea in short:
First we create a numbered set.
The recursive CTE will use the row's index to pick the next row, thus iterating through the set row-by-row
The APPLY determines three BIT values:
Is the distance to the previous value 1, then we are on the island, otherwise not
Is the last character of the growing output string a hyphen, then we are waiting for the end of an island, otherwise not.
...and if we've reached the end
The CASE deals with this four-field-matrix:
First we deal with the end to avoid a trailing hyphen at the end
Reaching an island we add a hyphen
Staying on the island we just continue
Reaching the end of an island we add the last number, a comma and start a new island
any other case will just add a comma and start a new island.
Hint: You can read island as group or section, while the commas mark the gaps.
Combining what I already had and using Zohar Peled's code I was finally able to figure out a solution:
WITH cteLineNums AS (
SELECT TOP 100 PERCENT t1.OrderNo, t1.Part, t2.LineNum
, (t2.line_number - ROW_NUMBER() OVER(PARTITION BY t1.OrderNo, t1.Part ORDER BY t1.OrderNo, t1.Part, t2.LineNum)) AS RowSeq
FROM [DBName].[DBA].Table1 t1
JOIN Table2 t2 ON t2.Part = t1.Part
WHERE t1.OrderNo = 'AB12345')
GROUP BY t1.OrderNo, t1.Part, t2.LineNum
ORDER BY t1.OrderNo, t1.Part, t2.LineNum)
SELECT OrderNo, Part
, STUFF((SELECT ', ' +
CASE WHEN MIN(line_number) = MAX(line_number) THEN CAST(MIN(line_number) AS VARCHAR(3))
WHEN MIN(line_number) = (MAX(line_number)-1) THEN CAST(MIN(line_number) AS VARCHAR(3)) + ', ' + CAST(MAX(line_number) AS VARCHAR(3))
ELSE CAST(MIN(line_number) AS VARCHAR(3)) + '-' + CAST(MAX(line_number) AS VARCHAR(3))
END
FROM cteLineNums c1
WHERE c1.OrderNo = c2.OrderNo
AND c1.Part = c2.Part
GROUP BY OrderNo, Part
ORDER BY OrderNo, Part
FOR XML PATH('')), 1, 2, '') AS [LineNums]
FROM cteLineNums c2
GROUP BY OrderNo, Part
I used the ROW_NUMBER() OVER PARTITION BY since I returned multiple records with different Order Numbers and Part Numbers. All this lead to me still having to do the self join in the second part in order to get the correct LineNums to show for each record.
The second WHEN in the CASE statement is due to the code defaulting to having something like 2, 5, 8-9, 14 displayed when it should be 2, 5, 8, 9, 14.

Redshift Dist key, IDentity column or join column? Cardinality of Column, Used in Join consideration for sort Key

I have a table that has an Identity column called ID and another column called DateID that references another table.
The date column is used in joins but the ID column has much more cardinality.
Distinct count for ID column : 657167
Distinct count for DateID column: 350
Can anyone please provide any insights as to which column would be a better choice for distribution key?
*Also regarding another question:
I have a dilemma in selecting sort and dist keys in my table.
sort Keys
Should I consider cardinality when selecting a sort key?
A column that would join with other tables would be candidates for a sort key, Is my assumption correct?
If I use compound sort key and use two columns does the order of columns matter?
If I define the column DateID as dist key should I put DateID in front of customerId while defining compound sort keys?*
another question merged to this old question as they are related.
P.S. I read some articles regarding choosing dist key and they say I should be using a column that is used in joining with other tables and has greater cardinality.
SELECT SP.*,
CP.*,
TV.*
FROM
(
SELECT * --> there are about 20 aggregation statements in the select statement
FROM FactCustomer f -- contains about 600K records
JOIN DimDate d -- contains about 700 records
ON f.DateID = d.DateID
JOIN DimTime t -- contains 24 records
ON f.TimeID = t.HourID
JOIN DimSalesBranch s -- contains about 64K records
ON f.BranchID = s.BranchID
WHERE s.BranchID IN ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
AND d.DateTimeInfo >= (CASE
WHEN s.OpeningDate > '2018-01-01' THEN
s.OpeningDate
ELSE
'2018-01-01'
END
)
AND d.DateTimeInfo <= '2018-12-31'
AND StartHour >= 9
AND starthour > 0
AND (EndHour <= 22)
) SP
LEFT JOIN
(
SELECT * --> there are about 20 aggregation statements in the select statement
FROM FactCustomer f
JOIN DimDate d
ON f.DateID = d.DateID
JOIN DimTime t
ON f.TimeID = t.HourID
JOIN DimSalesBranch s
ON f.BranchID = s.BranchID
WHERE s.BranchID IN ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
AND d.DateTimeInfo >= (CASE
WHEN s.OpeningDate > '2018-01-01' THEN
s.OpeningDate
ELSE
'2018-01-01'
END
)
AND d.DateTimeInfo <= '2018-09-16'
AND StartHour >= 9
AND (EndHour <= 22)
) CP
ON SP.StartDate = CP.StartDate_CP
AND SP.EndDate = CP.EndDate_CP
LEFT JOIN
(
SELECT * --> there are about 6 aggregation statements in the select statement
FROM FactSalesTargetBranch f
JOIN DimDate d
ON f.DateID = d.DateID
JOIN DimSalesBranch s
ON f.BranchID = s.BranchID
WHERE s.BranchID IN ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
AND d.DateTimeInfo >= (CASE
WHEN s.OpeningDate > '2018-01-01' THEN
s.OpeningDate
ELSE
'2018-01-01'
END
)
AND d.DateTimeInfo <= '2018-09-16'
) TV
ON SP.StartDate = TV.StartDate_TV
AND SP.EndDate = TV.EndDate_TV;
Any insights much appreciated.
Regards.
In this case
Use "even" distribution for your main table, this will allow good
paralellism. (dateid will be a bad candidate)
Use "all" distribution for your dateid table (the smaller table that
you join with)
Generally, "even" distribution is a good choice and will give you the best results unless you need to join large tables together.
see https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html

PostgreSQL: multiple LEFT JOIN with multiple conditions

Here is an extract of my data model (including an extract of tables content).
I need to compulse the number of operations of type 1 over year 2015. I also want the complete list of towns in my result, not only towns referenced in the operation table (with a number equal to zero for towns with no registered operations). I then need to specify several conditions but the WHERE clause turns my LEFT JOIN in an INNER JOIN (see this post), so I have to specify the conditions inside the ON clauses.
SELECT
town.town_code,
count(operation.*) AS nb
FROM town
LEFT JOIN operation ON town.town_code = operation.ope_town AND operation.ope_year = 2015
LEFT JOIN intervention ON operation.ope_id = intervention.int_ope_id
LEFT JOIN nature ON intervention.int_id = nature.int_id AND nature.type_id = 1
GROUP BY town.town_code ORDER BY town.town_code ;
I get the following result:
town_code | nb
------------+-----
86000 | 1
86001 | 0
86002 | 1
86003 | 1
86004 | 0
86005 | 0
There is a problem with town code 86003 which should have 0. This town code refers to one operation (#5) which refers to one intervention (#16) which refers to a nature type = 3. So one of the conditions is not filled...
How can I deal with several conditions within ON clauses?
EDIT : Here is the script to create the tables and test.
CREATE TABLE town (town_code INTEGER, town_name CHARACTER VARING(255)) ;
CREATE TABLE operation (ope_id INTEGER, ope_year INTEGER, ope_town INTEGER) ;
CREATE TABLE intervention (int_id INTEGER, int_ope_id INTEGER) ;
CREATE TABLE nature (int_id INTEGER, type_id INTEGER) ;
INSERT INTO town VALUES (86000, 'Lille'), (86001, 'Paris'), (86002, 'Nantes'), (86003, 'Rennes'), (86004, 'Marseille'), (86005, 'Londres') ;
INSERT INTO operation VALUES (1, 2014, 86000), (2, 2015, 86000), (3, 2012, 86001), (4, 2015, 86002), (5, 2015, 86003) ;
INSERT INTO intervention VALUES (12, 1), (13, 2), (14, 3), (15, 4), (16, 5) ;
INSERT INTO nature VALUES (12, 1), (13, 1), (14, 3), (15, 1), (16, 3) ;
It's because you select first left join. For examle you can use:
SELECT t.town_code, count(j.*) AS nb FROM town t
LEFT JOIN (SELECT o.ope_town cd, o.ope_year yr FROM operation o, intervention i, nature n
WHERE o.ope_year = 2015
AND o.ope_id = i.int_ope_id AND n.type_id = 1
AND i.int_id = n.int_id) j
ON j.cd = t.town_code
GROUP BY t.town_code ORDER BY t.town_code;