KDB/q: insertions of list into table - kdb

I am looking to insert a list of data into tables. I have tried both upsert and insert. Both did not update the trade table.
// simTrade is a function that takes in
// number of orders and generate a table with random data
simTrade:{[nOrders]
seed:-314159;
openTime:`time$09:30;
closeTime:`time$16:00;
listCustomers:`XNYS`ARCX`XCHI`XASE`XCIS`XNAS`XBOS`XPHL`BATS`BATY`EDGA`EDGX`IEXG;
listProducts: `Derivative`Futures`Indicies
system "S ",string seed;
times: asc closeTime&openTime+nOrders?390*60*1000;
dates: asc 2015.03.01&2015.01.01+nOrders?30;
customers: nOrders?listCustomers;
products: nOrders?listProducts;
orderIds: 1+til nOrders;
versions: nOrders?5;
sizes: 100*nOrders?10;
trade:([]
time:`time$();
date:`date$();
customer:`symbol$(); / Customer name {xyz fund, asd,fund}
product:`symbol$(); / product name {Derivative, Equities}
orderId:`long$(); / order id {1-10}
version:`long$(); / version {1-10}
size:`long$()
)
insert[`trade; (times;dates;customers;products;orderIds;versions;sizes)];
show trade
}
lob:simTrade[100]
I have tried checking the data type error but couldn't find any issues with it.
may I also ask why when I change insert -> upsert, an error is returned
evaluation error: length
Thanks for any help

insert and upsert can be applied to global variables only. table is local variable, hence insert/upsert throws type error. For more details see insert
I would suggest you to fill table values in-place:
...
trade: ([]
time:times;
date:dates;
customer:customers; / Customer name {xyz fund, asd,fund}
product:products; / product name {Derivative, Equities}
orderId:orderIds; / order id {1-10}
version:versions; / version {1-10}
size:sizes
);
...

Related

PostgreSQL array of data composite update element using where condition

I have a composite type:
CREATE TYPE mydata_t AS
(
user_id integer,
value character(4)
);
Also, I have a table, uses this composite type as an array of mydata_t.
CREATE TABLE tbl
(
id serial NOT NULL,
data_list mydata_t[],
PRIMARY KEY (id)
);
Here I want to update the mydata_t in data_list, where mydata_t.user_id is 100000
But I don't know which array element's user_id is equal to 100000
So I have to make a search first to find the element where its user_id is equal to 100000 ... that's my problem ... I don't know how to make the query .... in fact, I want to update the value of the array element, where it's user_id is equal to 100000 (Also where the id of tbl is for example 1) ... What will be my query?
Something like this (I know it's wrong !!!)
UPDATE "tbl" SET "data_list"[i]."value"='YYYY'
WHERE "id"=1 AND EXISTS (SELECT ROW_NUMBER() OVER() AS i
FROM unnest("data_list") "d" WHERE "d"."user_id"=10000 LIMIT 1)
For example, this is my tbl data:
Row1 => id = 1, data = ARRAY[ROW(5,'YYYY'),ROW(6,'YYYY')]
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'YYYY')]
Now i want to update tbl where id is 2 and set the value of one of the tbl.data elements to 'XXXX' where the user_id of element is equal to 11
In fact, the final result of Row2 will be this:
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'XXXX')]
If you know the value value, you can use the array_replace() function to make the change:
UPDATE tbl
SET data_list = array_replace(data_list, (11, 'YYYY')::mydata_t, (11, 'XXXX')::mydata_t)
WHERE id = 2
If you do not know the value value then the situation becomes more complex:
UPDATE tbl SET data_list = data_arr
FROM (
-- UPDATE doesn't allow aggregate functions so aggregate here
SELECT array_agg(new_data) AS data_arr
FROM (
-- For the id value, get the data_list values that are NOT modified
SELECT (user_id, value)::mydata_t AS new_data
FROM tbl, unnest(data_list)
WHERE id = 2 AND user_id != 11
UNION
-- Add the values to update
VALUES ((11, 'XXXX')::mydata_t)
) x
) y
WHERE id = 2
You should keep in mind, though, that there is an awful lot of work going on in the background that cannot be optimised. The array of mydata_t values has to be examined from start to finish and you cannot use an index on this. Furthermore, updates actually insert a new row in the underlying file on disk and if your array has more than a few entries this will involve substantial work. This gets even more problematic when your arrays are larger than the pagesize of your PostgreSQL server, typically 8kB. All behind the scene so it will work, but at a performance penalty. Even though array_replace sounds like changes are made in-place (and they indeed are in memory), the UPDATE command will write a completely new tuple to disk. So if you have 4,000 array elements that means that at least 40kB of data will have to be read (8 bytes for the mydata_t type on a typical system x 4,000 = 32kB in a TOAST file, plus the main page of the table, 8kB) and then written to disk after the update. A real performance killer.
As #klin pointed out, this design may be more trouble than it is worth. Should you make data_list as table (as I would do), the update query becomes:
UPDATE data_list SET value = 'XXXX'
WHERE id = 2 AND user_id = 11
This will have MUCH better performance, especially if you add the appropriate indexes. You could then still create a view to publish the data in an aggregated form with a custom type if your business logic so requires.

updatexml for particular rows only

Context: I want to increase the allowance value of some employees from £1875 to £7500, and update their balance to be £7500 minus whatever they have currently used.
My Update statement works for one employee at a time, but I need to update around 200 records, out of a table containing about 6000.
I am struggling to workout how to modify the below to update more than one record, but only the 200 records I need to update.
UPDATE employeeaccounts
SET xml = To_clob(Updatexml(Xmltype(xml),
'/EmployeeAccount/CurrentAllowance/text()',187500,
'/EmployeeAccount/AllowanceBalance/text()',
750000 - (SELECT Extractvalue(Xmltype(xml),
'/EmployeeAccount/AllowanceBalance',
'xmlns:ts=\"http://schemas.com/\", xmlns:xt=\"http://schemas.com\"'
)
FROM employeeaccounts
WHERE id = '123456')))
WHERE id = '123456'
Example of xml column (stored as clob) that I want to update. Table has column ID that hold PK of employees ID EG 123456
<EmployeeAccount>
<LastUpdated>2016-06-03T09:26:38+01:00</LastUpdated>
<MajorVersion>1</MajorVersion>
<MinorVersion>2</MinorVersion>
<EmployeeID>123456</EmployeeID>
<CurrencyID>GBP</CurrencyID>
<CurrentAllowance>187500</CurrentAllowance>
<AllowanceBalance>100000</AllowanceBalance>
<EarnedDiscount>0.0</EarnedDiscount>
<NormalDiscount>0.0</NormalDiscount>
<AccountCreditLimit>0</AccountCreditLimit>
<AccountBalance>0</AccountBalance>
</EmployeeAccount>
You don't need a subquery to get the old balance, you can use the value from the current row; which means you don't need to correlate that subquery and can just use an in() in the main statement:
UPDATE employeeaccounts
SET xml = To_clob(Updatexml(Xmltype(xml),
'/EmployeeAccount/CurrentAllowance/text()',187500,
'/EmployeeAccount/AllowanceBalance/text()',
750000 - Extractvalue(Xmltype(xml),
'/EmployeeAccount/AllowanceBalance',
'xmlns:ts=\"http://schemas.com/\", xmlns:xt=\"http://schemas.com\"')
))
WHERE id in (123456, 654321, ...);

PostgreSQL hierarchical nested set huge database

I have a database that must store thousands of scenarios (each scenario with a single unix_timestamp value). Each scenario has 1,800,000 registers organized in a Nested Set structure.
The general table structure is given by:
table_skeleton:
- unix_timestamp integer
- lft integer
- rgt integer
- value
Usually, my SELECTs are will perform taking all nested values within an specific scenario, it means for example:
SELECT * FROM table_skeleton WHERE unix_timestamp = 123 AND lft >= 10 AND rgt <= 53
So I hierarchically divided my table into master / children within groups of dates, for example:
table_skeleton_201303 inherits table_skeleton:
- unix_timestamp integer
- lft integer
- ...
and
table_skeleton_201304 inherits table_skeleton:
- unix_timestamp integer
- lft integer
- ...
And also created index for each children according to the usual search I am expecting, it is for example:
Create Index idx_201303
on table_skeleton_201303
using btree(unix_timestamp, lft, rgt)
It improved the retrieval, but it still takes about 1 minute for each select.
I imagined that this was because the index was too big to be loaded into memory always so I tried to create partial index for each timestamp, for example:
Create Index idx_201303_1362981600
on table_skeleton_201303
using btree(lft, rgt)
WHERE unix_timestamp = 1362981600
And in fact the second type of index created is much, much, much smaller than the general one. However, when I run an EXPLAIN ANALYZE for the SELECT I've previously shown here, the query solver ignores my new partial index and keeps using the giant old one.
Is there a reason for that?
Is there any new approach to optimize such type of huge nested set hierarchical database?
When you filter on a table by field_a > x and field_b > y, then an index for field_a, field_b will (actually just may, depending on the distribution and the percentage of rows with field_a > x, as per the statistics collected) only be used for "field_a > x", and field_b > y will be a sequential search.
In the case above, having two indexes, one for each field, could be used and each of the results joined, the internal equivalent of:
SELECT *
FROM table t
JOIN (
SELECT id table field_a > x) ta ON (ta.id = t.id)
JOIN (
SELECT id table field_b > y) tb ON (tb.id = t.id);
There is a change you could benefit from a GIST index, and treating your lft and rgt fields as points:
CREATE INDEX ON table USING GIST (unix_timestamp, point(lft, rgt));
SELECT * table
WHERE unix_timestamp = 123 AND
point(lft,rgt) <# box(point(10,'-inf'), point('inf',53));

Temporary Table Value into a Table-Value UDF

I was having some trouble with an SQL 2k sproc and which we moved to SQL 2k5 so we could used Table Value UDF's instead of Scalar UDF's.
This is simplified, but this is my problem.
I have a temporary table that I fill up with product information. I then pass that product information into a UDF and return the information back to my main results set. It doesn't seem to work.
Am I not allowed to pass a Temporary Table value into an CROSS APPLY'd Table Value UDF?
--CREATE AND FILL #brandInfo
SELECT sku, upc, prd_id, cp.customerPrice
FROM products p
JOIN #brandInfo b ON p.brd_id=b.brd_id
CROSS APPLY f_GetCustomerPrice(b.priceAdjustmentValue, b.priceAdjustmentAmount, p.Price) cp
--f_GetCUstomerPrice uses the AdjValue, AdjAmount, and Price to calculate users actual price
When I put dummy values in for b.priceAdjustmentValue and b.priceAdjustmentAmount it works great. But as soon as I try to load the temp table values in it bombs.
Msg 207, Level 16, State 1, Line 140
Invalid column name 'b.priceAdjustmentValue'.
Msg 207, Level 16, State 1, Line 140
Invalid column name 'b.priceAdjustmentAmount'.
Have you tried:
--CREATE AND FILL #brandInfo
SELECT sku, upc, prd_id, cp.customerPrice
FROM products p
JOIN #brandInfo b ON p.brd_id=b.brd_id
CROSS APPLY (
SELECT *
FROM f_GetCustomerPrice(b.priceAdjustmentValue, b.priceAdjustmentAmount, p.Price) cp
)
--f_GetCUstomerPrice uses the AdjValue, AdjAmount, and Price to calculate users actual price
Giving the UDF the proper context in order to resolve the column references?
EDIT:
I have built the following UDF in my local Northwind 2005 database:
CREATE FUNCTION dbo.f_GetCustomerPrice(#adjVal DECIMAL(28,9), #adjAmt DECIMAL(28,9), #price DECIMAL(28,9))
RETURNS TABLE
AS RETURN
(
SELECT Level = 'One', AdjustValue = #adjVal, AdjustAmount = #adjAmt, Price = #price
UNION
SELECT Level = 'Two', AdjustValue = 2 * #adjVal, AdjustAmount = 2 * #adjAmt, Price = 2 * #price
)
GO
And referenced it in the following query without issue:
SELECT p.ProductID,
p.ProductName,
b.CompanyName,
f.Level
FROM Products p
JOIN Suppliers b
ON p.SupplierID = b.SupplierID
CROSS APPLY dbo.f_GetCustomerPrice(p.UnitsInStock, p.ReorderLevel, p.UnitPrice) f
Are you certain that your definition of #brandInfo has the priceAdjustmentValue and priceAdjustmentAmount columns defined on it? More importantly, if you are putting this in a stored procedure as you mentioned, does there exist a #brandInfo table already without those columns defined? I know #brandInfo is a temporary table, but if it exists at the time you attempt to create the stored procedure and it lacks the columns, the parsing engine may be getting tripped up. Oddly, if the table doesn't exist at all, the parsing engine simply glides past the missing table and creates the SP for you.

Select most reviewed courses starting from courses having at least 2 reviews

I'm using Flask-SQLAlchemy with PostgreSQL. I have the following two models:
class Course(db.Model):
id = db.Column(db.Integer, primary_key = True )
course_name =db.Column(db.String(120))
course_description = db.Column(db.Text)
course_reviews = db.relationship('Review', backref ='course', lazy ='dynamic')
class Review(db.Model):
__table_args__ = ( db.UniqueConstraint('course_id', 'user_id'), { } )
id = db.Column(db.Integer, primary_key = True )
review_date = db.Column(db.DateTime)#default=db.func.now()
review_comment = db.Column(db.Text)
rating = db.Column(db.SmallInteger)
course_id = db.Column(db.Integer, db.ForeignKey('course.id') )
user_id = db.Column(db.Integer, db.ForeignKey('user.id') )
I want to select the courses that are most reviewed starting with at least two reviews. The following SQLAlchemy query worked fine with SQlite:
most_rated_courses = db.session.query(models.Review, func.count(models.Review.course_id)).group_by(models.Review.course_id).\
having(func.count(models.Review.course_id) >1) \ .order_by(func.count(models.Review.course_id).desc()).all()
But when I switched to PostgreSQL in production it gives me the following error:
ProgrammingError: (ProgrammingError) column "review.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT review.id AS review_id, review.review_date AS review_...
^
'SELECT review.id AS review_id, review.review_date AS review_review_date, review.review_comment AS review_review_comment, review.rating AS review_rating, review.course_id AS review_course_id, review.user_id AS review_user_id, count(review.course_id) AS count_1 \nFROM review GROUP BY review.course_id \nHAVING count(review.course_id) > %(count_2)s ORDER BY count(review.course_id) DESC' {'count_2': 1}
I tried to fix the query by adding models.Review in the GROUP BY clause but it did not work:
most_rated_courses = db.session.query(models.Review, func.count(models.Review.course_id)).group_by(models.Review.course_id).\
having(func.count(models.Review.course_id) >1) \.order_by(func.count(models.Review.course_id).desc()).all()
Can anyone please help me with this issue. Thanks a lot
SQLite and MySQL both have the behavior that they allow a query that has aggregates (like count()) without applying GROUP BY to all other columns - which in terms of standard SQL is invalid, because if more than one row is present in that aggregated group, it has to pick the first one it sees for return, which is essentially random.
So your query for Review basically returns to you the first "Review" row for each distinct course id - like for course id 3, if you had seven "Review" rows, it's just choosing an essentially random "Review" row within the group of "course_id=3". I gather the answer you really want, "Course", is available here because you can take that semi-randomly selected Review object and just call ".course" on it, giving you the correct Course, but this is a backwards way to go.
But once you get on a proper database like Postgresql you need to use correct SQL. The data you need from the "review" table is just the course_id and the count, nothing else, so query just for that (first assume we don't actually need to display the counts, that's in a minute):
most_rated_course_ids = session.query(
Review.course_id,
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
order_by(func.count(Review.course_id).desc()).\
all()
but that's not your Course object - you want to take that list of ids and apply it to the course table. We first need to keep our list of course ids as a SQL construct, instead of loading the data - that is, turn it into a derived table by converting the query into a subquery (change the word .all() to .subquery()):
most_rated_course_id_subquery = session.query(
Review.course_id,
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
order_by(func.count(Review.course_id).desc()).\
subquery()
one simple way to link that to Course is to use an IN:
courses = session.query(Course).filter(
Course.id.in_(most_rated_course_id_subquery)).all()
but that's essentially going to throw away the "ORDER BY" you're looking for and also doesn't give us any nice way of actually reporting on those counts along with the course results. We need to have that count along with our Course so that we can report it and also order by it. For this we use a JOIN from the "course" table to our derived table. SQLAlchemy is smart enough to know to join on the "course_id" foreign key if we just call join():
courses = session.query(Course).join(most_rated_course_id_subquery).all()
then to get at the count, we need to add that to the columns returned by our subquery along with a label so we can refer to it:
most_rated_course_id_subquery = session.query(
Review.course_id,
func.count(Review.course_id).label("count")
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
subquery()
courses = session.query(
Course, most_rated_course_id_subquery.c.count
).join(
most_rated_course_id_subquery
).order_by(
most_rated_course_id_subquery.c.count.desc()
).all()
A great article I like to point out to people about GROUP BY and this kind of query is SQL GROUP BY techniques which points out the common need for the "select from A join to (subquery of B with aggregate/GROUP BY)" pattern.