what is the use of Store on disk in tMap component? - talend

what is the use of Store on disk in tMap component??
https://i.stack.imgur.com/E6dbg.jpg

This option allow you to store data in disk in order to process them in the job. If you don't use that option the data will be load in RAM and not in the disk. Sometime you don't have enought memory in the RAM so you check this option in order to make the disk work instead of RAM (More space).
The disavantage of that is that is slower than the default "in RAM" processing.

Related

What is difference MongoDB wiredTiger cache with in-memory DB

As i know MonogoDB cache working set in RAM.
Then if i increase wiredTigerCacheSizeGB as much as all of data in disk, does it work as fast as in-memory db?
if no, what is difference?
See In-Memory Storage Engine and WiredTiger Storage Engine
(In-memory) By avoiding disk I/O, the in-memory storage engine allows for more predictable latency of database operations.
Keep in mind that you are limited a 10000 GB when setting wiredTigerCacheSizeGB. You should also disable journaling and set storage.syncPeriodSecs to 0 in order to increase performance of WiredTiger. But, still WiredTiger has to create WiredTiger.wt and WiredTiger.turtle at least...
PS. I think this link might answer your question
I cannot answer all your questions.
A cache reads data from disk and keeps it in the RAM. When you access such data again then you read it from RAM instead of reading it again from disk - which would be much slower.
So, a cache is useless if you have to read the data only once. Some applications anticipate the data you may read in future and put it into the cache in advance.
The MongoDB in-memory DB puts all data into RAM only, it does not read or write anything from disk, apart from some logging data. When you stop an in-memory MongoDB process then all data is lost.
The wiredTiger storage engine is a data format used by MongoDB to store data persistently on disk.
If you set wiredTigerCacheSizeGB high enough to hold all of your data, then all of your reads will be satisfied from the cache. Writes will update the cache and also be written to storage.
If you use the in-memory configuration then all of your reads will be satisfied from memory. Writes will only go to memory and will not be stored on disk.
So if your workload is mostly reads, then the large cache will behave similarly to an in-memory DB. If your workload has a lot of writes, then the large cache configuration may be slower because it needs to write to disk.
Also, the in-memory DB will not preserve your data in the event of a crash, since it only holds data in memory.

PostgreSQL benchmarking over a RAMdisk?

I have been considering the idea of moving to a RAMdisk for a while. I know its risks, but just wanted to do a little benchmark. I just had two questions: (a) when reading the query plan, will it still differentiate between disk and buffers hits? If so, should I assume that both are equally expensive or should I assume that there is a difference between them?
(b) a RAM disk is not persistent, but if I want to export some results to persistent storage, are there some precautions I would need to take? Is it the same as usual e.g. COPY command?
I do not recommend using RAM disks in PostgreSQL for persistent storage. With careful tuning, you can get PostgreSQL not to use more disk I/O than what is required to make your data persistent.
I recommend doing this:
Have more RAM in your machine than the size of the database.
Define shared_buffers big enough to contain the database (on Linux, define memory hugepages to contain them).
Increase checkpoint_timeout and max_wal_size to get fewer checkpoints.
Set synchronous_commit = off to keep PostgreSQL from syncing WAL to disk on every commit.
If you are happy to lose all your data in the case of a crash, define your tables UNLOGGED. The data will survive a normal shutdown.
Anyway, to answer your questions:
(a) You should set seq_page_cost and random_page_cost way lower to tell PostgreSQL how fast your storage is.
(b) You could run backups with either pg_dump or pg_basebackup, they don't care what kind of storage you have got.
when reading the query plan, will it still differentiate between disk and buffers hits?
It never distinguished between them in the first place. It distinguishes between "hit" and "read", but the "read" can't tell which are truly from disk and which are from OS/FS cache.
PostgreSQL has no idea you are running on a RAM disk, so will continue to report those as it always has.
If so, should I assume that both are equally expensive or should I assume that there is a difference between them?
This is a question that should be answered through your benchmarking. On some systems, memory can be read-ahead from main memory into the faster caches, making sequential reads still faster than random reads. If you care, you will have to benchmark it on your own system.
Reading data from RAM into shared_buffers is still surprisingly expensive due to things like lock management. So as a rough starting point, maybe seq_page_cost=0.1 and random_page_cost=0.15.
a RAM disk is not persistent, but if I want to export some results to persistent storage, are there some precautions I would need to take?
The risk would be that your system crashes before the export has finished. But what precaution can you take against that?

MongoDB and disk space

We have a MongoDB cluster with 4 shards.
Our primary shard disk space has 700GB, and according to db.stats() that shard is using ~530GB.
When checking df -h, the disk usage is on 99% (9.5 GB free), i'm guessing this means that all the rest is data files pre-allocated by Mongo.
I've ran compact on couple of collections, and the disk space was reduced to 3.5GB(?)
We're going to run a process that will generate ~140GB of extra data (35GB per shard).
Should we have any concerned on running out of disk space?
Thanks in advance.
compact doesn't decrease disk usage at all, actually it could even lead to additional file perallocation. To reduce disk usage you could use repairDatabase command or start mongo with repair option. However, it would require additional free space on disk.
Described situation could be the case if you did a lot of document deletions or some operations that forced documents to move. In this case your database would be highly defragmented. compact command helps you to reduce defragmentation and you will have more space for new records, but again, it doesn't reclaim any space back to OS.
Best option for you is to try to get why you have such level of defragmentation.

PostgreSQL Table in memory

I created a database containing a total of 3 tables for a specific purpose. The total size of all tables is about 850 MB - very lean... out of which one single table contains about 800 MB (including index) of data and 5 million records (daily addition of about 6000 records).
The system is PG-Windows with 8 GB RAM Windows 7 laptop with SSD.
I allocated 2048MB as shared_buffers, 256MB as temp_buffers and 128MB as work_mem.
I execute a single query multiple times against the single table - hoping that the table stays in RAM (hence the above parameters).
But, although I see a spike in memory usage during execution (by about 200 MB), I do not see memory consumption remaining at at least 500 MB (for the data to stay in memory). All postgres exe running show 2-6 MB size in task manager. Hence, I suspect the LRU does not keep the data in memory.
Average query execution time is about 2 seconds (very simple single table query)... but I need to get it down to about 10-20 ms or even lesser if possible, purely because there are just too many times, the same is going to be executed and can be achieved only by keeping stuff in memory.
Any advice?
Regards,
Kapil
You should not expect postgres processes to show large memory use, even if the whole database is cached in RAM.
That is because PostgreSQL relies on buffered reads from the operating system buffer cache. In simplified terms, when PostgreSQL does a read(), the OS looks to see whether the requested blocks are cached in the "free" RAM that it uses for disk cache. If the block is in cache, the OS returns it almost instantly. If the block is not in cache the OS reads it from disk, adds it to the disk cache, and returns the block. Subsequent reads will fetch it from the cache unless it's displaced from the cache by other blocks.
That means that if you have enough free memory to fit the whole database in "free" operating system memory, you won't tend to hit the disk for reads.
Depending on the OS, behaviour for disk writes may differ. Linux will write-back cache "dirty" buffers, and will still return blocks from cache even if they've been written to. It'll write these back to the disk lazily unless forced to write them immediately by an fsync() as Pg uses at COMMIT time. When it does that it marks the cached blocks clean, but doesn't flush them. I don't know how Windows behaves here.
The point is that PostgreSQL can be running entirely out of RAM with a 1GB database, even though no PostgreSQL process seems to be using much RAM. Having shared_buffers too high just leads to double-caching and can reduce the amount of RAM available for the OS to cache blocks.
It isn't easy to see exactly what's cached in RAM because Pg relies on the OS cache. That's why I referred you to pg_fincore.
If you're on Windows and this won't work, you really just have to rely on observing disk activity. Does performance monitor show lots of uncached disk reads? Does operating system memory monitoring show lots of memory used for disk cache in the OS?
Make sure that effective_cache_size correctly reflects the RAM used for disk cache. It will help PostgreSQL choose appropriate query plans.
You are making the assumption, without apparent evidence, that the query performance you are experiencing is explained by disk read delays, and that it can be improved by in-memory caching. This may not be the case at all. You need to look at explain analyze output and system performance metrics to see what's going on.

Cassandra storage vs memory sizing

I am considering developing an application with a Cassandra backend. I am hoping that I will be able to run each cassandra node on commodity hardware with the following specs:
Quad Core 2GHz i7 CPU
2x 750GB disk drives
16 GB installed RAM
Now, I have been reading online that the available disk-space for Cassandra should be double the amount that is stored on the disks, which would mean that each node (set up in a RAID-1 configuration) would be able to store 375 GB of data, which is acceptable.
My question is this if 16GB RAM is enough to efficiently serve 375 GB of data per node. The data in the application developed will also be fairly time-dependant, such that recent data will be the data most read from the database. In fact, most of the data will be deleted after about 6 months.
Also, would I assign Cassandra a Heap (-Xmx) close to 16 GB, or does Cassandra utilize off-heap memory ?
You should not set the Cassandra heap to more than 8GB; bigger than that, and garbage collection will kill you with large pauses. Cassandra will use the buffer cache (like other applications) so the remaining memory isn't wasted.
16GB of RAM will be enough to serve the data if your hot set will all fit in RAM, or if serving rate can be served off disk. Disks can do about 100 random IO/s, so with your setup if you need more than 200 reads / second you will need to make sure the data is in cache. Cassandra exports good cache statistics (cassandra-cli show keyspaces) so you should easily be able to tell how effective your cache is being.
Do bear in mind, with only two disks in RAID-1, you will not have a dedicated commit log. This could hamper write performance quite badly. You may want to consider turning off the commit log if it does affect performance, and forgo durable writes.
Although it is probably wise not to use a really huge heap with Cassandra, at my company we have used 10GB to 12GB heaps without any issues so far. Our servers typically have at least 48 GB of memory (RAM is cheap -- so why not :-)) and so we may try expanding the heap a bit more and see what happens.