Postgres cursor calculation in Talend - postgresql

Data was read from postgres table & written to file using Talend. Table size is 1.8GB with 1,050,000 records and has around 125 columns.
Assigned JVM as -Xms256M -Xmx1024M. The job failed due to being out of memory. Postgres keeps the result set in physical memory until the query completes. So the entire JVM was occupied and getting an out of memory issue. Please correct me if my understanding is wrong.
Enabled Cursor option and kept the value as 100,000 and JVM as -Xms256M -Xmx1024M. Job failed with java.lang.OutOfMemoryError: Java heap space
I don't understand the concept here. Cursor used here denotes the fetch size of rows. In my case, 100,000 was set. So 100,000 will be fetched and stored in physical memory and it will be pushed to file. Then, the occupied memory will be released and the next batch will be fetched. Please correct me if I'm wrong.
Considering my case, with 1,050,000 records it occupies 1.8GB. Each record occupies 1.8KB of size. 100,000 * 1.8 = 180,000KB. So entire size is just 175MB. Why is the job not running with a 1GB JVM? Someone please help me with, how does this process work?
Some records got dropped after setting the cursor option in talend. Cannot trace the problem in that.

Had the same problem with a tPostgresInput component. Disabling the Auto Commit setting in the tPostgresConnection component fixed the problem for me!

Related

What happens if one working set exceeds memory in MongDB?

I understand that the working set is cached every time a query is run in the mongoDB.
What happens if one working set exceeds the caching memory when I know that the data on the previous page is removed and cached?
Ex)
cacheSizeGB: 0.5g,
total document size about query: 1g
And is it helpful to reduce the size of the document being cached by using the project command?
The cache is managed by WiredTiger, the mongod will request documents it needs, and if WT doesn't find it in the cache, it will read it in from disk. If this makes the cache exceed the threshold (default 80% of maximum size), the background eviction workers will start removing the least recently used items from the cache.
The result set is constructed/sorted in heap memory, not cache. If your result set is too large, mongod may be able to use disk for temporary space while sorting.
Large results may also cause the operating system to use swap, and in extreme cases it might run out of memory and get killed by the OOM killer.

SnappyData Spark Scall java.sql.BatchUpdateException

So, I have around 35 GB of zip files, each one contains 15 csv files, I have created a scala script that processes each one of the zip files and each one of the csv files per each zip file.
The problem is that after some amount of files the script lunches this error
ERROR Executor: Exception in task 0.0 in stage 114.0 (TID 3145)
java.io.IOException: java.sql.BatchUpdateException: (Server=localhost/127.0.0.1[1528] Thread=pool-3-thread-63) XCL54.T : [0] insert of keys [7243901, 7243902,
And the string continues with all the keys (records) that were not inserted.
So what I have found is that apparently (I said apparently because of my lack of knowledge about scala and snappy and spark) the memory that is been used is full... my question... how do I increment the size of the memory used? or how do I empty the data that is in memory and save it in the disk?
Can I close the session started and that way free the memory?
I have had to restart the server, remove the files processed and then I can continue with the importation but after some other files... again... same exception
My csv files are big... the biggest one is around 1 GB but this exception happens not just with the big files but when accumulating multiple files... until some size is reached... so where do I change that memory use size?
I have 12GB RAM...
You can use RDD persistance and store to disk/memory or a combination : https://spark.apache.org/docs/2.1.0/programming-guide.html#rdd-persistence
Also, try adding a large number of partitions when reading the file(s): sc.textFile(path, 200000)
I think you are running out of available memory. The exception message is misleading. If you only have 12GB of memory on your machine, I wonder if your data would fit.
What I would do is first figure out how memory you need.
1. Copy conf/servers.template to conf/servers file
2) Change this file with something like this: localhost -heap-size=3g
-memory-size=6g //this essentially allocates 3g in your server for computations (spark, etc) and allocates 6g of off-heap memory for your data (column tables only).
3) start your cluster using snappy-start-all.sh
4) Load some subset of your data (I doubt you have enough memory)
5) Check the memory used in the SnappyData Pulse UI (localhost:5050)
if you think you have enough memory, load the full data.
Hopefully that works out.
BatchUpdateException tells me that you are creating Snappy tables and inserting data in them. Also, BatchUpdateException in most of the cases means low memory (exception message needs to be better). So, I believe you may be right about the memory. For freeing the memory, you will have to drop the tables that you created. For information about memory size and table sizing, you may want to read these docs:
http://snappydatainc.github.io/snappydata/best_practices/capacity_planning/#memory-management-heap-and-off-heap
http://snappydatainc.github.io/snappydata/best_practices/capacity_planning/#table-memory-requirements
Also if you have lot of data that can't fit in memory, you can overflow it to disk. See the following doc about the overflow configuration:
http://snappydatainc.github.io/snappydata/best_practices/design_schema/#overflow-configuration
Hope it helps.

Getting error when i want to fetch bulk records

i am getting below error when i am trying to fetch more then 300000 records.
m using link to fetch records and using muiltiple classes.
Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
please let me know solution for this.
Thnaks
In your case, memory allocated to JVM is not sufficient.
You can try by allocating more memory as follows :
Run --> Run Configurations --> select the "JRE" tab --> then enter -Xmx2048m
I believe you are running program with default VM arguments.
You can also figure out memory requirement by performing heap dump analysis or memory analyzer.
Even though this may resolve your issue temporarily (depending upon how much memory is required for 300000 records), I would suggest to do changes in your program, such as fetching records in batches.
I would suggest to refer to this post.
How to deal with "java.lang.OutOfMemoryError: Java heap space" error (64MB heap size)

Spark throwing Out of Memory error

I have a single test node with 8 GB ram on which I am loading barely 10 MB of data(from csv files) into Cassandra(on the same node itself). Im trying to process this data using spark(running on the same node).
Please note that for SPARK_MEM, Im allocating 1 GB of RAM and SPARK_WORKER_MEMORY I'm allocating the same. The allocation of any extra amount of memory results in spark throwing a "Check if all workers are registered and have sufficient memory error", which is more often than not indicative of Spark trying to look for extra memory(as per SPARK_MEM and SPARK_WORKER_MEMORY properties) and coming up short.
When I try to load and process all data in the Cassandra table using spark context object, I'm getting an error during processing. So, I'm trying to use a looping mechanism to read chunks of data at a time from one table, process them and put them in another table.
My source code has the following structure
var data=sc.cassandraTable("keyspacename","tablename").where("value=?",1)
data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")
for(i<-2 to 50000){
data=sc.cassandraTable("keyspacename","tablename").where("value=?",i)
data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")
}
Now, this works for a while, for around 200 loops, and then this throws an error: java.lang.OutOfMemoryError: unable to create a new native thread.
I've got two questions:
Is this the right way to deal with data?
How can processing just 10 MB of data do this to a cluster?
You are running a query inside the for loop. If the 'value' column is not a key/indexed column, Spark will load the table into memory and then filter on the value. This will certainly cause an OOM.

Spark out of memory

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:
manually loop through all the files, do the calculations per file and merge the results in the end
read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization
I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.
The code I'm using:
- reads TSV files, and extracts meaningful data to (String, String, String) triplets
- afterwards some filtering, mapping and grouping is performed
- finally, the data is reduced and some aggregates are calculated
I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).
I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.
I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.
Me and my team had processed a csv data sized over 1 TB over 5 machine #32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
If you repartition an RDD, it requires additional computation that
has overhead above your heap size, try loading the file with more
paralelism by decreasing split-size in
TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE
(if you're using TextInputFormat) to elevate the level of
paralelism.
Try using mapPartition instead of map so you can handle the
computation inside a partition. If the computation uses a temporary
variable or instance and you're still facing out of memory, try
lowering the number of data per partition (increasing the partition
number)
Increase the driver memory and executor memory limit using
"spark.executor.memory" and "spark.driver.memory" in spark
configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
E.g.
for item in processed_data.collect():
print(item)
failed with OOM errors. On the other hand,
processed_data.saveAsTextFile(output_dir)
worked fine.
Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.