T SQL Group By Having Count of returning wrong data - tsql

I have a complex t-sql query "for me anyway" that isn't functioning the way I need it to.
The query is designed to return simular records unioned across two databases that have simular records in each database.
If a product fails, it will be assigned a "Failed" in one DB, or a "PF" in the other DB. "PR" means "PRODUCT READY" in both.
I am trying to return a list that includes only "Failed or PF" data that has < two records based on the ProdNo column.
"This is to prompt the employee to test the product again", if 2 records exist in either DB, no action is needed."
My query breaks down when I try to limit the results to show only entries that have less than 2 duplicate "ProdNo" values.
In other words, a product is produced and given a ProdNo number. After testing, it can be marked as a PR, PF, or Failed.
My query should never produce any results with PR, yet when a test is performed several days after the original test, PR values appear in my results.
Here is the query with notes.
-- 1st half of union query
-- Find all run failed's that do not have a PR'ed 2nd test.
Declare #daysback int
set #daysback = -2
select min(sid3)as 'ProdNo',
min([Timestamp])as 'TimeS',
min(Burn) as 'type',
min(Mixer) as 'Mixer'
from [Stat].[dbo].[oedata]
where sid3 IN
(
-- Find run faileds and PRs in Stat db
SELECT [sid3]
from [Stat].[dbo].[oedata]
where (type ='wos') and (burn = 'failed')
and (Flag = '128')
)
--- Limit Results to return only instances of 1 record
AND [Timestamp] > DATEADD( d, #daysback, getdate())
group by Sid3
having COUNT(Sid3) = 1
union all
-- Find PF's in CompanyMES MLab DB
select min(mProd_ProdNumber)as 'ProdNo',
min([Timestamp])as 'TimeS',
min(CheckType) as 'type',
min(Mixer) as 'Mixer'
from [CompanyMES].[dbo].[mLab]
where mProd_ProdNumber IN
(
-- Find failed DFs or scrap wos products
SELECT [mProd_ProdNumber]
from [CompanyMES].[dbo].[mLab]
where (CheckType = 'PF' )
)
-- Limit Results to instances with only 1 record
AND [Timestamp] > DATEADD( d, #daysback, getdate())
group by mProd_ProdNumber
having COUNT(mProd_ProdNumber) < 2
order by TimeS Desc
--------------------------------------------------------------------------
Example data and results:
ProdNo Type
=================
'1111' 'PF'
'1111' 'PR'
'1112' 'PR'
'1113' 'PF'
'1114' 'Failed'
ProdNo 1111 shouldn't return anything as it has 2 records as well as a PR exists.
1113 and 1114 should return results as they both have only 1 record as well as have PF and Failed Types

I think the issue is that you are applying a filter on the Timestamp in your outer queries, but not the inner one where you are filtering the Product Numbers. So, for 1111 and 1112, it could have a 'PF' (or 'Failed') outside of your timestamp filtered range, but only 'PR' inside of it (in one of the tables).

Related

Aggregation on updating order data in Druid

I have streaming data using Kafka to Druid. It's an eCommerce de-normalized order event data where status and few fields get updated in every event.
I need to do aggregate query based on timestamp with the most updated entry only.
For example: If data sample is:
{"orderId":"123","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:02:33Z"}
{"orderId":"abc","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:03:33Z"}
{"orderId":"123","status":"Shipped","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-07T02:03:33Z"}
Now if I want to query on all orders stuck on "Initiated" status for more than 2 days then for above data it should only show orderId "abc".
But if I query something like
Select orderId,qty,paymentId from order where status = Initiated and WHERE "timestamp" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
This query will return both orders "123" and "abc", but 123 has another event received after 2 days so the previous events should not be included in result.
Is their any good and optimized way to handle this kind of scenarios in Apache druid?
One way I was thinking to use a separate lookup table to store orderId and latest status and perform a join with this lookup and above aggregation query on orderId and status
EDIT 1:
This query works but it joins on whole table, which can give resource limit exception for big datasets:
WITH maxOrderTime (orderId, "__time") AS
(
SELECT orderId, max("__time") FROM inline_data
GROUP BY orderId
)
SELECT inline_data.orderId FROM inline_data
JOIN maxOrderTime
ON inline_data.orderId = maxOrderTime.orderId
AND inline_data."__time" = maxOrderTime."__time"
WHERE inline_data.status='Initiated' and inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
EDIT 2:
Tried with:
SELECT
inline_data.orderID,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderID
HAVING last_status = 1
But gives this error:
Error: Unknown exception
Error while applying rule DruidQueryRule(AGGREGATE), args
[rel#1853:LogicalAggregate.NONE.,
rel#1863:DruidQueryRel.NONE.[](query={"queryType":"scan","dataSource":{"type":"table","name":"inline_data"},"intervals":{"type":"intervals","intervals":["-146136543-09-08T08:23:32.096Z/2021-03-14T09:57:05.000Z"]},"virtualColumns":[{"type":"expression","name":"v0","expression":"lookup("status",'status_as_number')","outputType":"STRING"}],"resultFormat":"compactedList","batchSize":20480,"order":"none","filter":null,"columns":["orderID","v0"],"legacy":false,"context":{"sqlOuterLimit":100,"sqlQueryId":"fbc167be-48fc-4863-b3a8-b8a7c45fb60f"},"descending":false,"granularity":{"type":"all"}},signature={orderID:LONG,
v0:STRING})]
java.lang.RuntimeException
I think this can be done easier. If you replace the status to a numeric representation, you can use it more easy.
First use an inline lookup to replace the status. See this page how to define a lookup: https://druid.apache.org/docs/0.20.1/querying/lookups.html
Now, we have for example these values in a lookup named status_as_number:
Initiated = 1
Shipped = 2
Since we now have a numeric representation, you can simply do a group by query and see the max status number. A query like this would be sufficient:
SELECT
inline_data.orderId,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderId
HAVING last_status = 1
Note: this query is not tested. The HAVING part makes sure that you only see orders which are Initiated.
I hope this solves your problem.

EXISTS in filter returning too many values

I need to write a query that uses EXISTS, rather than IN, so that it will run fast. The filter is being fed so many parameter values that EXISTS seems like the only option. The difference is between a 20+ minute query and a 5 second query.
This is the query I have:
SELECT DISTINCT d.GROUP_NAME
FROM [EMPLOYEE] e JOIN [DATA_FACT] d ON (e.KEY = d.KEY)
WHERE d.DATE BETWEEN #Start and #End
AND EXISTS
(
select '1234567' -- #ID
)
AND e.Location IN (#Location)
ORDER BY d.GROUP_NAME ASC
The problem is that it is returning too many records. Based on the values I'm passing to filter on, I should get 1 row back but instead I am getting 28.
If I remove the EXISTS and add the following then I get the 1 record I need:
AND e.ID IN ('1234567')
Is there a way to fix the query to work with EXISTS so that I get the correct results?
This is essentially what you want if you are going to try to use exists to filter your data_fact table by parameters in your employee table. Not sure how much it's going to improve your performance though when you throw a massive number of employee IDs at it.
SELECT
d.GROUP_NAME
FROM [DATA_FACT] AS d
WHERE d.DATE BETWEEN #Start and #End
AND EXISTS
(
select 1
from EMPLOYEE AS e
WHERE d.[KEY] = e.[KEY]
AND e.[Location] IN (#Location)
AND e.ID IN ('1234567')
)
ORDER BY d.GROUP_NAME ASC

The command has not been executed - Republicated With New Informations - MATCH Clause

I have a Orientdb Database with 2 vertex
Pesquisador with 1 million records
Publicacao with 1 million records too
and an Edge COLABORA_COM with 6,239,382
I need select thousand researchers with the highest number of publications.
I execute the command
SELECT psq1.psq_nome AS nomePesquisador, COUNT(pub1) AS qtdPub
FROM (
MATCH
{class:Pesquisador, as:psq1}.outE("PUBLICOU").inV(){as:pub1}
RETURN psq1, pub1
)
GROUP BY psq1
ORDER BY qtdPub DESC, nomePesquisador
LIMIT 1000;
then there are the error "The command has not been executed"
But if a execute the query
SELECT ordem, out.psq_nome, out.psq_data_nascimento
FROM PUBLICOU
WHERE in.pub_id = 5022
ORDER BY ordem;
the query execute ok
is there an error of memory with MATCH?
Should I change the memory settings?

Tableau - Calculating average where date is less than value from another data source

I am trying to calculate the average of a column in Tableau, except the problem is I am trying to use a single date value (based on filter) from another data source to only calculate the average where the exam date is <= the filtered date value from the other source.
Note: Parameters will not work for me here, since new date values are being added constantly to the set.
I have tried many different approaches, but the simplest was trying to use a calculated field that pulls in the filtered exam date from the other data source.
It successfully can pull the filtered date, but the formula does not work as expected. 2 versions of the calculation are below:
IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated])) THEN AVG([Raw Score]) END
IF DATEDIFF('day', DATE(ATTR([Exam Date])), DATE(ATTR([Averages (Tableau Test Scores)].[Updated]))) > 1 THEN AVG([Raw Score]) END
Basically, I am looking for the equivalent of this in SQL Server:
SELECT AVG([Raw Score]) WHERE ExamDate <= (Filtered Exam Date)
Below a workbook that shows an example of what I am trying to accomplish. Currently it returns all blanks, likely due to the many-to-one comparison I am trying to use in my calculation.
Any feedback is greatly appreciated!
Tableau Test Exam Workbook
I was able to solve this by using Custom SQL to join the tables together and calculate the average based on my conditions, to get the column results I wanted.
Would still be great to have this ability directly in Tableau, but whatever gets the job done.
Edit:
SELECT
[AcademicYear]
,[Discipline]
--Get the number of student takers
,COUNT([Id]) AS [Students (N)]
--Get the average of the Raw Score
,CAST(AVG(RawScore) AS DECIMAL(10,2)) AS [School Mean]
--Get the number of failures based on an "adjusted score" column
,COUNT([AdjustedScore] < 70 THEN 1 END) AS [School Failures]
--This is the column used as the cutoff point for including scores
,[Average_Update].[Updated]
FROM [dbo].[Average] [Average]
FULL OUTER JOIN [dbo].[Average_Update] [Average_Update] ON ([Average_Update].[Id] = [Average].UpdateDateId)
--The meat of joining data for accurate calculations
FULL OUTER JOIN (
SELECT DISTINCT S.[Id], S.[LastName], S.[FirstName], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[Subject], P.[Id] AS PeriodId
FROM [StudentScore] S
FULL OUTER JOIN
(
--Get only the 1st attempt
SELECT DISTINCT [NBOMEId], S2.[Subject], MIN([ExamDate]) AS ExamDate
FROM [StudentScore] S2
GROUP BY [NBOMEId],S2.[Subject]
) B
ON S.[NBOMEId] = B.[NBOMEId] AND S.[Subject] = B.[Subject] AND S.[ExamDate] = B.[ExamDate]
--Group in "Exam Periods" based on the list of periods w/ start & end dates in another table.
FULL OUTER JOIN [ExamPeriod] P
ON S.[ExamDate] = P.PeriodStart AND S.[ExamDate] <= P.PeriodEnd
WHERE S.[Subject] = B.[Subject]
GROUP BY P.[Id], S.[Subject], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[NBOMEId], S.[NBOMELastName], S.[NBOMEFirstName], S.[SecondYrTake]) [StudentScore]
ON
([StudentScore].PeriodId = [Average_Update].ExamPeriodId
AND [StudentScore].Subject = [Average].Subject
AND [StudentScore].[ExamDate] <= [Average_Update].[Updated])
--End meat
--Joins to pull in relevant data for normalized tables
FULL OUTER JOIN [dbo].[Student] [Student] ON ([StudentScore].[NBOMEId] = [Student].[NBOMEId])
INNER JOIN [dbo].[ExamPeriod] [ExamPeriod] ON ([Average_Update].ExamPeriodId = [ExamPeriod].[Id])
INNER JOIN [dbo].[AcademicYear] [AcademicYear] ON ([ExamPeriod].[AcademicYearId] = [AcademicYear].[Id])
--This will pull only the latest update entry for every academic year.
WHERE [Updated] IN (
SELECT DISTINCT MAX([Updated]) AS MaxDate
FROM [Average_Update]
GROUP BY[ExamPeriodId])
GROUP BY [AcademicYear].[AcademicYearText], [Average].[Subject], [Average_Update].[Updated],
ORDER BY [AcademicYear].[AcademicYearText], [Average_Update].[Updated], [Average].[Subject]
I couldn't download your file to test with your data, but try reversing the order of taking the average ie
average(IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated]) then [Raw Score]) END)
as written, I believe you'll be averaging the data before returning it from the if statement, whereas you want to return the data, then average it.

PostgreSQL and pl/pgsql SYNTAX to update fields based on SELECT and FUNCTION (while loop, DISTINCT COUNT)

I have a large database, that I want to do some logic to update new fields.
The primary key is id for the table harvard_assignees
The LOGIC GOES LIKE THIS
Select all of the records based on id
For each record (WHILE), if (state is NOT NULL && country is NULL), update country_out = "US" ELSE update country_out=country
I see step 1 as a PostgreSQL query and step 2 as a function. Just trying to figure out the easiest way to implement natively with the exact syntax.
====
The second function is a little more interesting, requiring (I believe) DISTINCT:
Find all DISTINCT foreign_keys (a bivariate key of pat_type,patent)
Count Records that contain that value (e.g., n=3 records have fkey "D","388585")
Update those 3 records to identify percent as 1/n (e.g., UPDATE 3 records, set percent = 1/3)
For the first one:
UPDATE
harvard_assignees
SET
country_out = (CASE
WHEN (state is NOT NULL AND country is NULL) THEN 'US'
ELSE country
END);
At first it had condition "id = ..." but I removed that because I believe you actually want to update all records.
And for the second one:
UPDATE
example_table
SET
percent = (SELECT 1/cnt FROM (SELECT count(*) AS cnt FROM example_table AS x WHERE x.fn_key_1 = example_table.fn_key_1 AND x.fn_key_2 = example_table.fn_key_2) AS tmp WHERE cnt > 0)
That one will be kind of slow though.
I'm thinking on a solution based on window functions, you may want to explore those too.