executing query having over billion rows - kdb

I have a table say 'T' in kdb which has rows over 6 billion. When I tried to execute query like this
select from T where i < 10
it throws wsfull expection. Is there any way I can execute queries like this in table having large amount of data.

10#T
The expression as you wrote it first makes a bitmap containing all of the elements where i (rownumber) < 10, which is as tall as one of your columns. It then does where (which just contains til 10) and then gets them from each row. You can save the last step with:
T[til 10]
but 10#T is shorter.

Assuming you have a partitioned table here, it is normally beneficial to have the partitioning column (date, int etc.) as the first item in the where clause of your query - otherwise as mentioned previously you are reading a six billion item list into memory, which will result in a 'wsfull signal for any machine with less than the requisite amount of RAM.
Bear in mind that row index starts at 0 for each partition, and is not reflective of position in the overall table. The query that you gave as an example in your question would return the first ten rows of each partition of table T in your database.
In order to do this without reaching your memory limit, you can try running the following (if your database is date-partitioned):
raze{10#select from T where date=x}each date

Related

What kind of index should I use in postgresql for a column with 3 values

I have a table with 100Mil+ records and 237 fields.
One of the fields is a varchar 1 field with three possible values (Y,N,I)
I need to find all of the records with N.
Right now I have a b-tree index built and the query below takes about 20 min to run.
Is there another index I can use to get better performance?
SELECT * FROM tableone WHERE export_value='N';
Assuming your values are roughly equally distributed (say at least 15% of each value) and roughly equally distributed throughout the table (some physically at the beginning, some in the middle, some at the end) then no.
If you think about it you'll see why. You'll have to look up tens of millions of disk blocks in the index and then fetch them from the disk one by one. By the time you have done that, it would have been quicker to just scan the whole table and pick out the values as they match. The planner knows this and would probably not use the index at all.
However - if you only have 17 rows with "N" or they are all very recently added to the table and so physically happen to be close to each other then yes, and index can help.
If you only had a few rows with "N" you would have mentioned it, so we can ignore that one.
If however you mostly insert to this table you might find a BRIN index helpful. That can let the planner see that e.g. the first 80% of your table doesn't have any "N" blocks and so it just needs to look at the last bit.

IBM Datastage : Creating column that is calculation

I have a table which columns are location and credit, the location contains string rows which mainly is location_name and npl_of_location_name. the credit contains integer rows which mainly is credit_of_location_name and credit_npl_of_location_name. I need to make a column which calculates the ((odd rows of the credit - the even rows of the credit)*0.1). How do i do this?
When you specify "odd rows" and "even rows" are you referring to row numbers? Because, unless your query sorts the data, you have not control over row order; the database server returns rows however they are physically stored.
Once you are sure that your rows are properly sorted, then you can use a technique such as Mod(#INROWNUM,2) = 1 to determine "odd" and zero is even. This works best if the Transformer is executing in sequential mode; if it is executed in parallel mode then you need to use a partitioning algorithm that ensures that the odd and even rows for a particular location are in the same node.

Simple update query taking too long - Postgres

I have a table with 28 million rows that I want to update. It has around 60 columns and a ID column (primary key) with an index created on it. I created four new columns and I want to populate them with the data from four columns from other table which also has an ID column with an index created on it. Both tables have the same amount of rows and just the primary key and the index on the IDENTI column. The query has been running for 15 hours and since it is high priority work, we are starting to get nervous about it and we don't have so much time to experiment with queries. We have never update a table so big (7 GB), so we are not sure if this amount of time is normal.
This is the query:
UPDATE consolidated
SET IDEDUP2=uni.IDEDUP2
USE21=uni.USE21
USE22=uni.USE22
PESOXX2=uni.PESOXX2
FROM uni_group uni, consolidated con
WHERE con.IDENTI=uni.IDENTI
How can I make it faster? Is it possible? If not, is there a way to check how much longer it is going to take (without killing the process)?
Just as additional information, we have ran before much more complex queries for 3 million row tables (postgis) and It has taken it about 15 hours as well.
You should not repeat the target table in the FROM clause. Your statement creates a cartesian join of the consolidated table with itself, which is not what you want.
You should use the following:
UPDATE consolidated con
SET IDEDUP2=uni.IDEDUP2
USE21=uni.USE21
USE22=uni.USE22
PESOXX2=uni.PESOXX2
FROM uni_group uni
WHERE con.IDENTI = uni.IDENTI

long running queries and new data

I'm looking at a postgres system with tables containing 10 or 100's of millions of rows, and being fed at a rate of a few rows per second.
I need to do some processing on the rows of these tables, so I plan to run some simple select queries: select * with a where clause based on a range (each row contains a timestamp, that's what I'll work with for ranges). It may be a "closed range", with a start and an end I know are contained in the table, and I know no new data will fall into the range, or an open range : ie one of the range boundary might not be "in the table yet" and rows being fed in the table might thus fall in that range.
Since the response will itself contains millions of rows, and the processing per row can take some time (10s of ms) I'm fully aware I'll use a cursor and fetch, say, a few 1000 rows at a time. My question is:
If I run an "open range" query: will I only get the result as it was when I started the query, or will new rows being inserted in the table that fall in the range while I run my fetch show up ?
(I tend to think that no I won't see new rows, but I'd like a confirmation...)
updated
It should not happen under any isolation level:
https://www.postgresql.org/docs/current/static/transaction-iso.html
but Postgres insures it only in Serializable isolation
Well, I think when you make a query, that means you create a new transaction and it will not receive/update data from any other transaction until it commit.
So, basically "you only get the result as it was when you started the query"

Postgresql Hstore and Toast Bloat

I was using hstore, Postgresql 9.3.4, to store a count for each time an event happened in a given day, with an update like the following.
days_count = days_count || hstore('x', (coalesce((days_count -> 'x')::integer, 0) + 1)::text)
Where x is the day of the year. After running a simulation of expected behavior for production I ended up with a table that was 150MB + 2GB Toast + 25-30MB for the index, after Analyze and Vacuum.
I am now instead breaking up the above column into one for each month like the following
y_month_days_count = y_month_days_count || hstore('x', (coalesce((y_month_days_count -> 'x')::integer, 0) + 1)::text)
Where x is the day of the month, and y is the month of the year.
I am still running the simulation right now, but so far at third of the way done I am at 60MB + A pretty steady 20-30MB of Toast + 25-30MB for the index. Which means in the end I should end up with about 180MB + 30-40MB for Toast + 25MB-30MB for the index after Analyze and Vacuum.
So first is there any known issues with Hstore and Toast bloat that would explain my issue with my first set up?
Second will my current solution of breaking up the columns cause any type of issues with hstore and performance in the future because of the number of hstore columns on one table? It seems to be steady now with row numbers in the hundred of thousands, and while I know more columns can make things slower, I am unsure if this is worse with hstore columns.
Finally I did find something out. I have one hstore column that ends up representing each hour a day, so it has 24 different keys. When I run the simulation for just this column I end up with almost no toast, in the KB, but when I run the whole simulation, with the days broken up into months columns, my largest hstore has 52 keys.
So for a simple store of either a counter or a word or two, the max number of keys before I see any amount of toast for hstore is between 24 and 52 keys.
So first is there any known issues with Hstore and Toast bloat that would explain my issue with my first set up?
Yes.
When you update any part of an out-of-line stored TOASTed field like text, hstore or json the whole field must be re-written as a new row version. This is a consequence of MVCC - it's necessary to retain a copy of every version of the row that might still be visible to another transaction.
The old one can be vacuumed away when it's no longer required by any running transaction, so in practice this has minimal impact so long as autovacuum is running aggressively enough.
So if you're updating lots of rows with big text, hstore or json fields, or updating them frequently, tune autovacuum up so it runs more often and does work faster. Make sure you don't have long running <IDLE> in transaction connections.
You say the table sizes you quoted were "after analyze and vacuum" but I'm guessing you only ran a regular vacuum, so the table bloat would've been freed for re-use by PostgreSQL but not released back to the OS. See if VACUUM FULL compacts it.
Will my current solution of breaking up the columns cause any type of issues with hstore and performance in the future because of the number of hstore columns on one table?
Depends on your query patterns and workload, but probably not.