I need to fetch million data from database to GSP page,I written query like
"select * from tablename";
now am able to retrieve only thousand rows at a time if I upload more than that showing error like
java.lang.OutOfMemoryError: GC overhead limit exceeded
I am not using hibernate. How can I fetch large amount of data in grails project?
You have 2 options: use the pagination or use a query result iterator.
If you're using Grails, I recommand you to use Hibernate which allow you to create SQL queries without writting them by hand, and will handle all problems related to the security. Morehover, be restrictive on your request: the * is not always necessary and may save request time / memory.
Pagination
This is the best way to handle a large amount of data: you just have to split the query in sub-queries, returning a known amount of rows. To do so, you have to use the SQL closures LIMIT and OFFSET.
For example, your query could be: select * from tablename LIMIT 100 OFFSET 2000. You just have to change the OFFSET parameter to retrieve all values.
Thanks to that, your backend will not have to handle a huge amount of data at a time. Morehover, you can use Javascript to send requests to your backend while it's rendering previous results, which improves the response time (asynchronous scroll works like this for example).
Grails has a default pagination system that you can use "as is". Please, look at the official documentation here. Maybe you will have to tweak it a little bit if you don't use Hibernate.
Query result iterator
You can handle a huge amount of data by using an iterator on the result, but it depends on the querying framework. Morehover, with that method, you will generate huge HTML pages, where the size may be a problem (remember: you have an OutOfMemory, so you're talking about a hundreds or thousands Mo; at one time, the user will have to download them synchronously !)
Related
I am looking to support a use case that returns kdb datasets back to users. The users connects to kdb using the Java API, runs the query synchronously and retrieves results.
However, issues are coming up when returning larger datasets and therefore I would like to return the data from kdb to the java process in pages/slices. Unfortunately users need to be able to run queries that return millions of rows and it would be easier to handle if they were passed back in slices of say 100,000 rows (Cassandra and other DBs do this sort of thing).
The potential approaches I have come up with are as follows:
Run the "where" part of the query on the database and return only the indices/date partitions (if applicable) of the data required. The java process would then use these indices to select the data required slice by slice . This approach would control memory usage on the kdb side as it would not have to load all HDB data required at once. However, overall this would increase the run time of the query as data would have to be searched/queried multiple times. This could work well for simple selects but complicated queries may need to go through an "onboarding" process which I want to avoid.
Store results of the query in a global variable in kdb which the java process can then query slice by slice. This simpler method could support any query but could potentially hit limits on the kdb side (memory/timeout) if too large a dataset is queried.
Other points to consider:
It should support users running queries on any type of process - gateway, hdb, rdb etc
It should support more than just simple selects e.g.
((1!select sym, price from trade where sym=`AAA) uj
1!select sym,price from order where sym=`AAA)
lj select avgBid:avg bid by sym from quote where sym=`AAA
The paging functionality should be removed from the end user
Does anyone have any views on if there are there any options available other than the ones listed above? Essentially I am looking for a select[m n] type approach that supports any query.
I have some extensive queries (each of them lasts around 90 seconds). The good news is that my queries are not changed a lot. As a result, most of my queries are duplicate. I am looking for a way to cache the query result in PostgreSQL. I have searched for the answer but I could not find it (Some answers are outdated and some of them are not clear).
I use an application which is connected to the Postgres directly.
The query is a simple SQL query which return thousands of data instance.
SELECT * FROM Foo WHERE field_a<100
Is there any way to cache a query result for at least a couple of hours?
It is possible to cache expensive queries in postgres using a technique called a "materialized view", however given how simple your query is I'm not sure that this will give you much gain.
You may be better caching this information directly in your application, in memory. Or if possible caching a further processed set of data, rather than the raw rows.
ref:
https://www.postgresql.org/docs/current/rules-materializedviews.html
Depending on what your application looks like, a TEMPORARY TABLE might work for you. It is only visible to the connection that created it and it is automatically dropped when the database session is closed.
CREATE TEMPORARY TABLE tempfoo AS
SELECT * FROM Foo WHERE field_a<100;
The downside to this approach is that you get a snapshot of Foo when you create tempfoo. You will not see any new data that gets added to Foo when you look at tempfoo.
Another approach. If you have access to the database, you may be able to significantly speed up your queries by adding and index on on field
My question revolves around understanding the following two procedures (particularly performance and code logic) that I used to collect trade data from the US Census Bureau API. I already collected the data but I ended up writing two different ways of requesting and saving the data for which my questions pertain to.
Summary of my final questions comes at the bottom.
First way: npm request and mongodb to save the data
I limited my procedure using tiny-async-pool (sets concurrency of a certain function to perform) to not try to request too much at once or receive a timeout or overload my database with queries. Simply put, the bottleneck I was facing was the database since the API requests returned rather quickly (depending on body size 1-15 secs), but to save each array item (return data was nested array, sometimes from a few hundred items to over one hundred thousand items with max 10 values in each array) to its own mongodb document ranged from 100 ms to 700 ms. To save time from potential errors and not redoing the same queries, I also performed a check in my database before making the query to see if the query was already complete. The end result was that I did not follow this method since it was very error prone and susceptible to timeouts if the data was very large (I even set the timeout to 10 minutes in request options).
Second way: npm request and save data to csv
I used the same approach as the first method for the requests and concurrency, however I saved each query to its own csv file. In case of errors and not redoing successful queries I also did a check to see if the file already existed and if so skipped that query. This approach was error free, I ran it and after a few hours was able to have all the data saved. To write to csv was insanely fast, much more so than using mongodb.
Final summary and questions
My end goal was to get the data in the easiest manner possible. I used javascript because that's where I learned api requests and async operations, even though I will do most of my data analysis with python and pandas. I first tried the database method mostly because I thought it was the right way and I wanted to improve my database CRUD skills. After countless hours of refactoring code and trying new techniques I still could not get it to work properly. I resorted to the csv method which was a) much less code to write, b) less checks, c) faster, and d) more reliable.
My final questions are these:
Why was the csv approach better than the database approach? Any counter arguments or different approaches you would have used?
How do you handle bottlenecks and concurrency in your applications with regards to APIs and database operations? Do your techniques vary in production environments from personal use cases (in my case I just needed the data and a few hours of waiting was fine)?
Would you have used a different programming language or different package/module for this data collection procedure?
I'm currently doing a project with my own MEAN stack.
Now in a new project I'm creating I've got a collection that I'm paging with Express on serverside, returning the page size every time (e.g 10 results out of the total 2000) and the total rows found for the query the user preformed (e.g 193 for UserID 3).
Although this works fine, I'm afraid that this will create an enormous load on the server since a user can easily pull 50-60 pages a session with 10, 20, 50 or even 100 results each.
My question to you guys is: if I have say 1000 concurrent users paging every few seconds like this, will MongoDB be able to cope with this? If not, what might be my alternatives here?
Also is there anyway I can simulate such concurrent read tests on my app/MongoDB?
Please take in account that I must do server side paging because the app will be quite dynamic and information can change very often.
If you're planning on only using a single webserver, you could cache the result set belonging to a certain page in memory. If you're planning on using multiple webservers, caching in-memory would lead to different result sets across servers, so in that case I'd recommend storing your cache either in MongoDB or in Redis.
A certain result set would be stored under a certain key in your cache. Your key would probably be composed of something like entityName + filterOptions + offset + resultsLimit. So for example you're loading movies with title=titanic, skipping the first 100, so offset=100 and loading only 50 per page so limit=50, which would all be concatenated into a single key.
When a request comes in, you would first try to load the result set from the cache. If the result set is inside the cache, you'll return that to the client. If it's not in the cache, you'd query the database for the latest result set, put that in the cache and return it to the client.
Whether or not you could pull it off with 1000 concurrent users depends a lot on your hardware, the data you are loading, how you're loading it and the efficiency of your implementation. There's one way to find out, and that's testing.
Of course by using the asynchronous capabilities of Node.js you can achieve the best scalability, so every call that can be executed async, such as database calls, should definitely be executed asynchronously.
You could load test your application for free from your local computer using Apache JMeter or let it be tested using for example Azure.
I have a table of 3M rows.
I wanted to retrieve all those rows and do a visualization using dc.js.
Problem I have is, for just a single column it takes about 70 secs.
And If i write my query it takes about 240 secs to retrieve those rows.
I'm using using select query on columns like this.
SELECT COL1, COL2 FROM TABLE
That's it. No grouping, nothing.
But it takes hell lot of time.
Heard of indexing and I created a Index for the columns I use. But even though no fruitful results.
We should not retrieve 3M rows in any query. And sending 3M records will always take a lot of time (nothing to do with the database, it is transfer speed). It will kill your IO. The bulk of the time taken is on the transfer (IO) from request-originator and the postgres database.
Consider to break that requests into batches of async-requests that gets streamed down to clients. That means restructuring your front-end code (javascript) for improved user-experience.
You didn't specify the environment in which you are using PostgreSQL.
As an example, in Node.js you can solve this problem by streaming the data with the help of pg-query-stream and rendering it on the client side at the same time, so the client doesn't have to wait for the query to finish and can see intermediate results.
This may not be the best solution though. A better solution would be to implement data aggregation within a database function to provide a smaller data subset.