I have five years of minute bars for around 10,000 symbols, as CSV files. It totals around 50GB of text. My RAM is 32GB.
I'm trying to load all this data into a KDB table, so it's easily queryable.
symbols: `$(-4_')(string') key `:/home/chris/sync/us_equities
f: {`$(":/home/chris/sync/us_equities/", x, ".csv")}
load_symbol: {(0!(update "P"$-1_/:t, s: x from flip `t`o`h`l`c`v!("*FFFFI";",")0: f(string x))) }
({`:/home/chris/sync/price_data insert (load_symbol x)}) each symbols
Should I be using a straightforward table, or should I use partitions/splays?
I'm adding the ticker as an extra column of type symbol; is that right?
The last line, insert, is painfully slow. It looks like it would take around a day to process, perhaps longer. How might I optimize it? I tried peach but this was even slower. It looks like it starts out very fast, and gets slower with each step of the each.
Thank you!
Using a flat file is in this case is not recommended due to the data size and the frequency of updating it. The file needs to be recreated from scratch each time you insert leading to the insert time being linear to the total row count.
q)t:([]a:til 10000000;b:til 10000000)
q)`:t set t
`:t
q)\t `:t insert t
305
q)\t `:t insert t
365
q)\t `:t insert t
574
q)\t `:t insert t
809
q)\t `:t insert t
1236
q)\t `:t insert t
2687
q)\t `:t insert t
3200
Compare this to a splayed table, where each column in the new data is appended to the correspond file, resulting in constant inserts.
q)t:([]a:til 10000000;b:til 10000000)
q)`:t/ set t
`:t/
q)\t `:t insert t
166
q)\t `:t insert t
101
q)\t `:t insert t
97
q)\t `:t insert t
100
q)\t `:t insert t
111
q)\t `:t insert t
113
If the symbol is not already in the file then it would be wise to add it onto the table. However I would suggest naming the column sym instead of s. This is only due to it being conventional within kdb and that some built functions assume this name.
From my estimates, this table will be far too large for a simple splayed table. I would partition it by either date or month, depending on the types of queries you are running.
Sorting by sym and adding a parted attribute is a must if your queries will often select a subset of the syms. Sorting by time bucket within each sym is required to use an asof join, as it uses binary searches.
The following code will do this, but as your files are already separated by sym, you should be able to skip the sym sort.
/ to sort a table in memory and apply parted attribute
update `p#sym from `sym`time xasc data
/ to sort a table on disk and apply parted attribute
sym`time xasc `:path/to/partition
#[`:path/to/partition;`sym;`p#]
If your queries are more geared around selecting a specific time window across all symbols, you may be better sorting only by time bucket and apply a sorted attribute to this column.
Additionally you may want to consider streaming the csv files using .Q.fs or .Q.fsn to reduce the memory usage of any single load. This would allow you to use multi-threading or additional processes to load in the data with the same memory overhead.
Related
I have 2 billion of records in table in SQL developer and wanted to export the records in csv file but while exporting data I want to sort one column in ascending order. Is there any efficient or quick way to do this?
for ex:
Suppose the table name is TEMP and i want to sort the A_KEY column in ascending order and then export it
/* TEMP
P_ID ADDRESS A_KEY
1 242 Street 4
2 242 Street 5
3 242 Street 3
4 242 Long St 1
Expected Result in csv file:
P_ID, ADDRESS, A_KEY
4, 242 Long St,1
3, 242 Street,3
1, 242 Street, 4
2, 242 Long St,5
I have tried using below query :
insert into temp2select * from TEMP order by A_KEY ASC;
and then export the table from sqldeveloper but is there any efficient or quick way to direct export records without query?
Losing time on creating a new table (TEMP2) won't help because you are using ORDER BY clause during INSERT, but that means nothing. It is ORDER BY in SELECT statement that matters.
Therefore, run
select * from temp order by a_key;
and export result returned by that query.
2 billion rows? It'll take time. What will you do with such a CSV file? That's difficult to work with.
If you're trying to move data into another Oracle database, then consider using Data Pump export and import utilities which are designed for such a purpose.
I have an existing table in my database named price (has 264 rows) and I converted it into a hypertable price_hypertable doing:
CREATE TABLE price_hypertable (LIKE price INCLUDING DEFAULTS INCLUDING CONSTRAINTS EXCLUDING INDEXES);
SELECT create_hypertable('price_hypertable', 'start');
and the output it gave me is as follows:
create_hypertable
-------------------------------
(4,public,price_hypertable,t)
(1 row)
The next thing I did was to populate the price_hypertable as follows:
insert into price_hypertable select * from price;
And I got the following output:
INSERT 0 264
Now, I wanted to check the chunks created, for which I did:
select public.show_chunks('price_hypertable');
and the output I got:
show_chunks
----------------------------------------
_timescaledb_internal._hyper_4_3_chunk
_timescaledb_internal._hyper_4_4_chunk
(2 rows)
When I do:
select * from _timescaledb_internal._hyper_4_3_chunk;
select * from _timescaledb_internal._hyper_4_4_chunk ;
I see that the 264 entries are split as follows:
_timescaledb_internal._hyper_4_3_chunk has 98 rows
_timescaledb_internal._hyper_4_4_chunk has 166 rows
I have a few questions about these steps and their outputs:
Can someone please explain to me what do the values 4 and t represent, when I did
SELECT create_hypertable('price_hypertable', 'start');?
After populating the price_hypertable, the data was automatically split into chunks, but of different size. Why does this happen? Why wasn't the data just split in half (132 rows in each chunk instead of 98 and 166)?
Any help is appreciated. Thanks
For the first question, it is easier to see what they represent by executing create_hypertable as
SELECT * FROM create_hypertable('price_hypertable', 'start');
This gives something like:
hypertable_id | schema_name | table_name | created
---------------+-------------+--------------------+---------
4 | public | price_hypertable | t
For the second question, TmTron already answered. This is because the rows are sorted into buckets based on the time, and they are not necessarily evenly spaced. There is no automation that pick the correct interval for each bucket.
You can find information about the return values in the API documentation on create_hypertable which also discuss the parameter chunk_time_interval that can be used to set the chunk size.
related to your 2nd question:
When you don't specify the chunk_time_interval explicitly, the default is 7 days: see create-hypertable, Best Practices.
So the number of rows in each chunks depends on the distribution of your data (according to your start date-time column).
I want find if the row is part of an indexed column in postgresql.
For example:
When I open the table object I see my indexes and the cardinality of the each index.
my total rows in the table is 1,45,454 but the cardinality of the all the indexes are 1,45,300. Some 150 odd rows are not indexed in any of the indexes that I have created.
I ran the below query to find the cardinality,
SELECT relname,
relkind,
reltuples AS cardinality,
relpages
FROM pg_class
WHERE relname LIKE '%table_name%';
Could someone please explain why some rows are left as part of indexing and how to find the rows the 150 rows that are not indexed in my original table.
my total rows in the table is 1,45,454 but the cardinality of the all the indexes are 1,45,300
that means you have 154 duplicated index entries thus some of those 154 index entries(or less) points to more than 1 row(or more).
From the postgres documentation on planner statistics:
For efficiency reasons, reltuples and relpages are not updated on-the-fly, and so they usually contain somewhat out-of-date values. They are updated by VACUUM, ANALYZE, and a few DDL commands such as CREATE INDEX. A VACUUM or ANALYZE operation that does not scan the entire table (which is commonly the case) will incrementally update the reltuples count on the basis of the part of the table it did scan, resulting in an approximate value. In any case, the planner will scale the values it finds in pg_class to match the current physical table size, thus obtaining a closer approximation.
In other words, as long as that number is approximately correct, there's nothing wrong and nothing to worry about. If it were wildly off (say "203" instead of its current value), then it would be time to issue a VACUUM or ANALYZE job on the table.
Also worth checking the value of default_statistics_target. If that's set too low, your statistics will end up less and less accurate.
We run postgresql 9.5.2 in an RDS instance. One thing we noticed was that a certain table sometimes grow very rapidly in size.
The table in question has only 33k rows and ~600 columns. All columns are numeric (decimal(25, 6)). After vacuum full, the "total_bytes" as reported in the following query
select c.relname, pg_total_relation_size(c.oid) AS total_bytes
from pg_class c;
is about 150MB. However, we observed this grew to 71GB at one point. In a recent episode, total_bytes grew by 10GB in a 30 minute period.
During the episode mentioned above, we had a batch update query that runs ~4 times per minute that updates every record in the table. However, during other times table size remained constant despite similar update activities.
I understand this is probably caused by "dead records" being left over from the updates. Indeed when this table grow too big simply running vacuum full will shrink it to its normal size (150M). My questions are
have other people experienced similar rapid growth in table size in postgresql and is this normal?
if our batch update queries are causing the rapid growth in table size, why doesn't it happen every time? In fact I tried to to reproduce it manually by running something like
update my_table set x = x * 2
but couldn't -- table size remained the same before and after the query.
The problem is having 600 columns in a single table, which is never a good idea. This is going to cause a lot of problems, table size is just one of them.
From the PostgreSQL docs...
The actual storage requirement [for numeric values] is two bytes for each group of four decimal digits, plus three to eight bytes overhead.
So decimal(25, 6) is something like 8 + (31 / 4 * 2) or about 24 bytes per column. At 600 columns per row that's about 14,400 bytes per row or 14k per row. At 33,000 rows that's about 450 megs.
If you're updating every row 4 times per minute, that's going to leave about 1.8 gigs per minute of dead rows.
You should fix your schema design.
You shouldn't need to touch every row of a table 4 times a minute.
You should ask a question about redesigning that table and process.
I have a large table (+- 1 million rows, 7 columns including the primary key). The table contains two columns (ie: symbol_01 and symbol_02) that are indexed and used for querying. This table contains rows such as:
id symbol_01 symbol_02 value_01 value_02
1 aaa bbb 12 15
2 bbb aaa 12 15
3 ccc ddd 20 50
4 ddd ccc 20 50
As per the example rows 1 and 2 are identical except that symbol_01 and symbol_02 are swapped but they have the same values for value_01 and value_02. That is true once again with row 3 and 4. This is the case for the entire table, there are essentially two rows for each combination of symbol_01+symbol_02.
I need to figure out a better way of handling this to get rid of the duplication. So far the solution I am considering is to just have one column called symbol which would be a combination of the two symbols, so the table would be as follows:
id symbol value_01 value_02
1 ,aaa,bbb, 12 15
2 ,ccc,ddd, 20 50
This would cut the number of rows in half. As a side note, every value in the symbol column will be unique. Results always need to be queried for using both symbols, so I would do:
select value_01, value_02
from my_table
where symbol like '%,aaa,%' and symbol like '%,bbb,%'
This would work but my question is around performance. This is still going to be a big table (and will get bigger soon). So my question is, is this the best solution for this scenario given that symbol will be indexed, every symbol combination will be unique, and I will need to use LIKE to query results.
Is there a better way to do this? Im not sure how great LIKE is for performance but I don't see an alternative?
There's no high performance solution, because your problem is shoehorning multiple values into one column.
Create a child table (with a foreign key to your current/main table) to separately hold all the individual values you want to search on, index that column and your query will be simple and fast.
With this index:
create index symbol_index on t (
least(symbol_01, symbol_02),
greatest(symbol_01, symbol_02)
)
The query would be:
select *
from t
where
least(symbol_01, symbol_02) = least('aaa', 'bbb')
and
greatest(symbol_01, symbol_02) = greatest('aaa', 'bbb')
Or simply delete the duplicates:
delete from t
using (
select distinct on (
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) id
from t
order by
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) s
where id = s.id
Depending on the columns semantics it might be better to normalize the table as suggested by #Bohemian