Insert into timescaledb filling up ram - wrong chunk_time_interval? - postgresql

I have a postgres table which I would like to migrate to a timescaledb hypertable. I am using the faster method from this tutorial https://docs.timescale.com/timescaledb/latest/how-to-guides/migrate-data/same-db/#convert-the-new-table-to-a-hypertable to do so.
The command I am using is: INSERT INTO new_table SELECT * FROM old_table; where new_table is a hypertable
Is the problem that I have set chunk_time_interval incorrectly? I used 1h which really should be fine. The total dataset is about 650GB in the original postgres table and spans about 5 months. So that means the average chunk is about 200MB in size which is well below the recommended 25% * 32GB RAM. I actually purposefully chose a number I thought much to low because of additional data I will load to other hypertables in future.
If this is not the problem then what is?
Is there a way to limit postgres or timescaledb to not go over a set amount of ram to protect other processes?

I have experienced this problem before when using a space partition together with the time partition. Check that the number of chunks you have is not too high and make sure to always include a time range in your query. Whats the output of
select * from timescaledb_information.hypertables;
Does it show a high number of chunks in your hypertable?

Related

Why Restore doesnt end?

First I thought was too big restore, so instead of a single 2GB (compress) db backup I split into several backup, one for schema. This schema map has 600 Mb. Next step would be split for tables.
This one have some spatial data from my country map, not sure if that is relavant.
As you can see almost 2h. The disk arent really in use anymore. When restore start, disk reach 100% several times. But last hour has been flat 0%
And as you can see here I can access the data in the all the restored tables. So looks like is already done.
Is this normal?.
There is anything I can check to see what is doing the restore?
Hardware Setup:
Core i7 # 3.4 GHz - 24 GB Ram
DB on 250 gb SSD Backup files in SATA disk
EDIT
SELECT application_name, query, *
FROM pg_stat_activity
ORDER BY application_name, query;
Yes, that seems perfectly normal.
Most likely you observe index or constraint creation. Look at the output of
SELECT * FROM pg_stat_activity;
to confirm that (it should contain CREATE INDEX or ALTER TABLE).
It is too late now, but increasing maintenance_work_mem will speed up index creation.

Slow PostgreSQL sequential scans on RDS?

I have an RDS PostgreSQL instance that's running simple queries, much slower than I would expect - particularly sequential scans, like copying a table or counting a table.
Eg. create table copied_table as (select * from original_table) or select count(*) from some_table
Running count(*) on a 30GB table takes ~15 minutes (with indexes, immediately following a vaccuum).
It's an RDS db.r3.large, 15 GB memory, 400GB SSD. Watching the metrics logs, I've never seen Read IOPS exceed 1,400 and it's usually around 500, well below my expected base.
Configuration:
work_mem: 2GB,
shared_buffers: 3GB,
effective_cache_size: 8GB
wal_buffers: 16MB,
checkpoint_segments: 16
Is this the expected timing? Should I be seeing higher IOPS?
There is not much you can do around plain count queries like that in Postgres, except in 9.6 that implemented parallel sequential scans, which is not available yet in RDS.
Event though, there is a some tips that you can find here. Generally, it's recommended to try to make Postgres to use Index Only Scan, by creating indexes and it's columns in the projection.
SELECT id FROM table WHERE id > 6 and id <100;
-- or
SELECT count(id) FROM table ...
Table should have an index on that column.
The queries that you exposed as example, won't avoid the sequential scan. For the CREATE TABLE, if you don't care about the order in the table, you can open a few backends and import in parallel by filtering by a key range. Also, the only way to speed up this on RDS is increasing IOPs.

Heroku Postgres database size doesn't go down after deleting rows

I'm using a Dev level database on Heroku that was about 63GB and approaching about 9.9 million rows (close to the limit of 10 million for this tier). I ran a script that deleted about 5 million rows I didn't need, and now (few days later) in the Postgres control panel/using pginfo:table-size it shows roughly 4.7 million rows but it's still at 63GB. 64 is the limit for he next tier so I need to reduce the size.
I've tried vacuuming but pginfo:bloat said the bloat was only about 3GB. Any idea what's happening here?
If you have [vacuum][1]ed the table, don't worry about the size one disk still remaining unchanged. The space has been marked as reusable by new rows. So you can easily add another 4.7 million rows and the size on disk wont grow.
The standard form of VACUUM removes dead row versions in tables and
indexes and marks the space available for future reuse. However, it
will not return the space to the operating system, except in the
special case where one or more pages at the end of a table become
entirely free and an exclusive table lock can be easily obtained. In
contrast, VACUUM FULL actively compacts tables by writing a complete
new version of the table file with no dead space. This minimizes the
size of the table, but can take a long time. It also requires extra
disk space for the new copy of the table, until the operation
completes.
If you want to shrink it on disk, you will need to VACUUM FULL which locks the tables and needs as much extra space as the size of the tables when the operation is in progress. So you will have to check your quota before you try this and your site will be unresponsive.
Update:
You can get a rough idea about the size of your data on disk by querying the pg_class table like this:
SELECT SUM(relpages*8192) from pg_class
Another method is a query of this nature:
SELECT pg_database_size('yourdbname');
This link: https://www.postgresql.org/docs/9.5/static/disk-usage.html provides additional information on disk usage.

Cassandra: Load large data fast

We're currently working with Cassandra on a single node cluster to test application development on it. Right now, we have a really huge data set consisting of approximately 70M lines of texts that we would like dump into a Cassandra.
We have tried all of the following:
Line by line insertion using python Cassandra driver
Copy command of Cassandra
Set compression of sstable to none
We have explored the option of the sstable bulk loader, but we don't have an appropriate .db format for this. Our text file to be loaded has 70M lines that look like:
2f8e4787-eb9c-49e0-9a2d-23fa40c177a4 the magnet programs succeeded in attracting applicants and by the mid-1990s only #about a #third of students who #applied were accepted.
The column family that we're intending to insert into has this creation syntax:
CREATE TABLE post (
postid uuid,
posttext text,
PRIMARY KEY (postid)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={};
Problem:
The loading of the data into even a simple column family is taking forever -- 5hrs for 30M lines that were inserted. We were wondering if there is any way to expedite this as the performance for 70M lines of the same data being loaded into MySQL takes approximately 6 minutes on our server.
We were wondering if we have missed something? Or if someone could point us in the right direction?
Many thanks in advance!
The sstableloader is the fastest way to import data into Cassandra. You have to write the code to generate the sstables, but if you really care about speed this will give you the most bang for your buck.
This article is a bit old, but the basics still apply to how you generate the SSTables
.
If you really don't want to use the sstableloader, you should be able to go faster by doing the inserts in parallel. A single node can handle multiple connections at once, and you can scale out your Cassandra cluster for increased throughput.
I have a two node Cassandra 2.? cluster. Each node is I7 4200 MQ laptop, 1 TB HDD, 16 gig RAM). Have imported almost 5 billion rows using copy command. Each CSV file is a about 63 gig with approx 275 million rows. Takes about 8-10 hours to complete the import/per file.
Approx 6500 rows per sec.
YAML file is set to use 10 gigs of RAM. JIC that helps.

Database table size did not decrease proportionately

I am working with a PostgreSQL 8.4.13 database.
Recently I had around around 86.5 million records in a table. I deleted almost all of them - only 5000 records are left. I ran
reindex
and
vacuum analyze
after deleting the rows. But I still see that the table is occupying a large disk space:
jbossql=> SELECT pg_size_pretty(pg_total_relation_size('my_table'));
pg_size_pretty
----------------
7673 MB
Also, the index value of the remaining rows are pretty high still - like in the million range. I thought after vacuuming and re-indexing, the index of the remaining rows would start from 1.
I read the documentation and it's pretty clear that my understanding of re-indexing was skewed.
But nonetheless, my intention is to reduce the table size after delete operation and bring down the index values so that the read operations (SELECT) from the table does not take that long - currently it's taking me around 40 seconds to retrieve just one record from my table.
Update
Thanks Erwin. I have corrected the pg version number.
vacuum full
worked for me. I have one follow up question here:
Restart primary key numbers of existing rows after deleting most of a big table
To actually return disk space to the OS, run VACUUM FULL.
Further reading:
VACUUM returning disk space to operating system