How to query the first row efficiently? - kdb

I have a table with large amount of records:
date instrument price
2019.03.07 X 1.1
2019.03.07 X 1.0
2019.03.07 X 1.2
...
When I query for the day opening price, I use:
1 sublist select from prices where date = 2019.03.07, instrument = `X
It takes a long time to execute because it selects all the prices on that day and get the first one.
I also tried:
select from prices where date = 2019.03.07, instrument = `X, i = 0 //It does not return any record (why?)
select from prices where date = 2019.03.07, instrument = `X, i = first i //Seem to work. Does it?
In Oracle an equivalent will be:
select * from prices where date = to_date(...) and instrument = "X" and rownum = 1
and Oracle will stop immediately when it finds the first record.
How to do this in KDB (e.g. stop immediately after it finds the first record)?

In kdb, where subclauses in select statements are executed sequentially. i.e. only those records which pass the first "test" get passed to the second test. With that in mind, looking at your two attempts:
select from prices where date = 2019.03.07, instrument = `X, i = 0 //It does not return any record (why?)
This doesn't (necessarily) return anything, because by the time it gets to the i=0 check, you've already filtered out some records (possibly including the first record in the original table, which would have i=0)
select from prices where date = 2019.03.07, instrument = `X, i = first i //Seem to work. Does it?
This one should work. First you filter by date. Then within the records for that date, you select the records for instrument `X. Then within those records, you take the record where i is the first i (where i has already been filtered down, so first i is simply the index of the first record [still the index from the original table, not the filtered down version])

Q-SQL equivalent for that is select[n] which also performs better than other approaches in most of the cases. Positive 'n' will give first n records and negative will give last n records.
q) select[1] from prices where date = 2019.03.07, instrument = `X
There is no inbuilt functionality to stop after first match. You can write custom function for that but that would probably execute slower than above supported version.

Related

Sqlalchemy; count items between two dates over a time period

My postgres DB has a Price table where I store price data for a bunch of products. For each Price object I store when it was created (Price.timestamp), and whenever there is a new price for the product, I create a new Price object, and for the old Price object I store when it ended (Price.end_time). Both times are datetime objects.
Now, I want to count how many Prices there are at over a time period. Easy I thought, so I did the query below:
trunc_date = db.func.date_trunc('day', Price.timestamp)
query = db.session.query(trunc_date, db.func.count(Price.id))
query = query.order_by(trunc_date.desc())
query = query.group_by(trunc_date)
prices_count = query.all()
Which is great, but only counts how many prices were new/created for each day. So what I thought I could do, was to filter so that I would get prices where the trunc_date is between the beginning and the end for the Price, like below:
query = query.filter(Price.timestamp < trunc_date < Price.time_ended)
But apparently you are not allowed to use trunc_date this way. Can anyone help me with how I am supposed to write my query?
Data example:
Price.id Price.timestamp Price.time_ended
1 2022-18-09 2022-26-09
2 2022-13-09 2022-20-09
The query result i would like to get is:
2022-27-09; 0
2022-26-09; 1
2022-25-09; 1
...
2022-20-09; 2
2022-19-09; 2
2022-18-09; 2
2022-17-09; 1
...
2022-12-09; 0
Have you tried separating the conditions inside the filter?
query = db.session.\
query(trunc_date, db.func.count(Price.id)).\
filter(
(Price.timestamp < trunc_date),
(trunc_date < Price.time_ended)
).\
group_by(trunc_date).\
order_by(trunc_date.desc()).\
all()
you can use
trunc_date.between(Price.timestamp, Price.time_ended)
I figured it out.
First I created a date range by using a subquery.
todays_date = datetime.today() - timedelta(days = 1)
numdays = 360
min_date = todays_date - timedelta(days = numdays)
date_series = db.func.generate_series(min_date , todays_date, timedelta(days=1))
trunc_date = db.func.date_trunc('days', date_series)
subquery = db.session.query(trunc_date.label('day')).subquery()
Then I used the subquery as input in my original query, and I was finally able to filter on the dates from the subquery.
query = db.session.query(subquery.c.day, db.func.count(Price.id))
query = query.order_by(subquery.c.day.desc())
query = query.group_by(subquery.c.day)
query = query.filter(Price.timestamp < subquery.c.day)
query = query.filter(Price.time_ended > subquery.c.day)
Now, query.all() will give you a nice list that counts the prices for each day specified in the date_series.

SQL: How to select first record per day, assuming that each day contains more than 1 value

I am trying to write a SQL query where the results would show the first value (ID) per user per day for the last year.
I tried using the query below and am able to get results for one day but when I try to change the time range to > 2021-06-01, it does not give me the results I expect.
select * from table
where value in
(
SELECT min(value)
FROM table
WHERE valueid = x
group by user
) and Time = '2022-05-30' and value is not null

Filter portal for most recently created record by group

I have a portal on my "Clients" table. The related table contains the results of surveys that are updated over time. For each combination of client and category (a field in the related table), I only want the portal to display the most recently collected row.
Here is a link to a trivial example that illustrates the issue I'm trying to address. I have two tables in this example (Related on ClientID):
Clients
Table 1 Get Summary Method
The Table 1 Get Summary Method table looks like this:
Where:
MaxDate is a summary field = Maximum of Date
MaxDateGroup is a calculated field = GetSummary ( MaxDate ;
ClientIDCategory )
ShowInPortal = If ( Date = MaxDateGroup ; 1 ; 0 )
The table is sorted on ClientIDCategory
Issue 1 that I'm stumped on: .
ShowInPortal should equal 1 in row 3 (PKTable01 = 5), row 4 (PKTable01 = 6), and row 6 (PKTable01 = 4) in the table above. I'm not sure why FM is interpreting 1Red and 1Blue as the same category, or perhaps I'm just misunderstanding what the GetSummary function does.
The Clients table looks like this:
Where:
The portal records are sorted on ClientIDCategory
Issue 2 that I'm stumped on:
I only want rows with a ShowInPortal value equal to 1 should appear in the portal. I tried creating a portal filter with the following formula: Table 1 Get Summary Method::ShowInPortal = 1. However, using that filter removes all row from the portal.
Any help is greatly appreciated.
One solution is to use ExecuteSQL to grab the Max Date. This removes the need for Summary functions and sorts, and works as expected. Propose to return it as number to avoid any issues with date formats.
GetAsTimestamp (
ExecuteSQL (
"SELECT DISTINCT COALESCE(MaxDate,'')
FROM Survey
WHERE ClientIDCategory = ? "
; "" ; "";ClientIDCategory )
)
Also, you need to change the ShowInPortal field to an unstored calc field with:
If ( GetAsNumber(Date) = MaxDateGroupSQL ; 1 ; 0 )
Then filter the portal on this field.
I can send you the sample file if you want.

Using many connection.cursor()

I want to fetch data from 3 tables in a single database at once. I used 3 conn.cursor() to it.. Are there any sophisticated ways to do it?
conn = psycopg2.connect(database="plottest", user="postgres")
self.statusbar.showMessage("Database opened Sucessfully", 1000)
cur = conn.cursor()
cur1 = conn.cursor()
cur2 = conn.cursor()
cur.execute("SELECT id ,actual from \"%s\" " % date)
rows = cur.fetchall()
cur1.execute("SELECT qty from DAILY where date = \'%s\'" % date)
dailyqty = cur1.fetchone()
cur2.execute("SELECT qty from MONTHLY where month = \'%s\'" % month)
monthqty = cur2.fetchone()
Awoogah awoogah, SQL injection warning! Don't write code using string interpolation. What happens if someone calls your code with the "date" ');-- DROP TABLE DAILY;-- ?
Use bind parameters. Always.
The only exception is for dynamic identifiers, like in the case above where you seem to use a table named after the current date. In that case you must "double quote" them and double any contained double-quotes. In your case that means that date should be date.replace('"', '""') where you substitute it into the SQL.
Now, back to our regular programming.
Since you fetchall from each cursor you can just re-use it. You don't need new cursors each time.
You can also combine the daily and monthly stats if you want, with a UNION ALL. I fixed your capitalisation and parameter binding in the process:
cur.execute("""SELECT 1, qty FROM daily WHERE date = %s
UNION ALL
SELECT 2, qty FROM monthly WHERE month = %s
ORDER BY 1""",
(date, month))
Note that string interpolation isn't used, instead a 2-tuple of parameters is passed to psycopg2 to bind directly. There's no need for quotes around the parameters, psycopg2 adds them if needed.
This avoids a client-server round trip by bundling the two queries. The extra column andORDER BY is technically needed so you can safely assume the first row is the daily results and second is the monthly. In practice PostgreSQL won't re-order them with UNION ALL though.
You can combine
SELECT a1 FROM t1 WHERE b1 = 'v1';
and
SELECT a2 FROM t2 WHERE b2 = 'v2';
to a single statement like this:
SELECT t1.a1, t2.a2 FROM t1, t2
WHERE t1.b1 = 'v1' AND t2.b2 = 'v2';
provided that both queries return exactly one row.

Tableau - Calculating average where date is less than value from another data source

I am trying to calculate the average of a column in Tableau, except the problem is I am trying to use a single date value (based on filter) from another data source to only calculate the average where the exam date is <= the filtered date value from the other source.
Note: Parameters will not work for me here, since new date values are being added constantly to the set.
I have tried many different approaches, but the simplest was trying to use a calculated field that pulls in the filtered exam date from the other data source.
It successfully can pull the filtered date, but the formula does not work as expected. 2 versions of the calculation are below:
IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated])) THEN AVG([Raw Score]) END
IF DATEDIFF('day', DATE(ATTR([Exam Date])), DATE(ATTR([Averages (Tableau Test Scores)].[Updated]))) > 1 THEN AVG([Raw Score]) END
Basically, I am looking for the equivalent of this in SQL Server:
SELECT AVG([Raw Score]) WHERE ExamDate <= (Filtered Exam Date)
Below a workbook that shows an example of what I am trying to accomplish. Currently it returns all blanks, likely due to the many-to-one comparison I am trying to use in my calculation.
Any feedback is greatly appreciated!
Tableau Test Exam Workbook
I was able to solve this by using Custom SQL to join the tables together and calculate the average based on my conditions, to get the column results I wanted.
Would still be great to have this ability directly in Tableau, but whatever gets the job done.
Edit:
SELECT
[AcademicYear]
,[Discipline]
--Get the number of student takers
,COUNT([Id]) AS [Students (N)]
--Get the average of the Raw Score
,CAST(AVG(RawScore) AS DECIMAL(10,2)) AS [School Mean]
--Get the number of failures based on an "adjusted score" column
,COUNT([AdjustedScore] < 70 THEN 1 END) AS [School Failures]
--This is the column used as the cutoff point for including scores
,[Average_Update].[Updated]
FROM [dbo].[Average] [Average]
FULL OUTER JOIN [dbo].[Average_Update] [Average_Update] ON ([Average_Update].[Id] = [Average].UpdateDateId)
--The meat of joining data for accurate calculations
FULL OUTER JOIN (
SELECT DISTINCT S.[Id], S.[LastName], S.[FirstName], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[Subject], P.[Id] AS PeriodId
FROM [StudentScore] S
FULL OUTER JOIN
(
--Get only the 1st attempt
SELECT DISTINCT [NBOMEId], S2.[Subject], MIN([ExamDate]) AS ExamDate
FROM [StudentScore] S2
GROUP BY [NBOMEId],S2.[Subject]
) B
ON S.[NBOMEId] = B.[NBOMEId] AND S.[Subject] = B.[Subject] AND S.[ExamDate] = B.[ExamDate]
--Group in "Exam Periods" based on the list of periods w/ start & end dates in another table.
FULL OUTER JOIN [ExamPeriod] P
ON S.[ExamDate] = P.PeriodStart AND S.[ExamDate] <= P.PeriodEnd
WHERE S.[Subject] = B.[Subject]
GROUP BY P.[Id], S.[Subject], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[NBOMEId], S.[NBOMELastName], S.[NBOMEFirstName], S.[SecondYrTake]) [StudentScore]
ON
([StudentScore].PeriodId = [Average_Update].ExamPeriodId
AND [StudentScore].Subject = [Average].Subject
AND [StudentScore].[ExamDate] <= [Average_Update].[Updated])
--End meat
--Joins to pull in relevant data for normalized tables
FULL OUTER JOIN [dbo].[Student] [Student] ON ([StudentScore].[NBOMEId] = [Student].[NBOMEId])
INNER JOIN [dbo].[ExamPeriod] [ExamPeriod] ON ([Average_Update].ExamPeriodId = [ExamPeriod].[Id])
INNER JOIN [dbo].[AcademicYear] [AcademicYear] ON ([ExamPeriod].[AcademicYearId] = [AcademicYear].[Id])
--This will pull only the latest update entry for every academic year.
WHERE [Updated] IN (
SELECT DISTINCT MAX([Updated]) AS MaxDate
FROM [Average_Update]
GROUP BY[ExamPeriodId])
GROUP BY [AcademicYear].[AcademicYearText], [Average].[Subject], [Average_Update].[Updated],
ORDER BY [AcademicYear].[AcademicYearText], [Average_Update].[Updated], [Average].[Subject]
I couldn't download your file to test with your data, but try reversing the order of taking the average ie
average(IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated]) then [Raw Score]) END)
as written, I believe you'll be averaging the data before returning it from the if statement, whereas you want to return the data, then average it.