I am currently using DOMXpath for extracting information from website but most of the time i get the maximum execution time exceed error. Is there any method that similar to DOMXpath but way more faster?
Related
I'm trying to write an application that tries to provide an offline capability for a vast number of records (above 20m).
I've tried to do it using sqflite and tests show that it's not feasible since it either takes very long to write (if the index are predefined) or takes too long to index (after the inserts)
So I've decided to use the file system and since I'm only going to query with an id, use the file name as id and try to find the record.
but this time, the main problem is, file write operation takes very very long. I'm using File(filePath).writeAsString API and for 50 to 120 char strings, it sometimes takes more than 100ms.
I'm trying to utilize the isolations for better performance but it still does not help very much.
Is there a better approach or an file manipulation API available for this kind of an operation.
My question revolves around understanding the following two procedures (particularly performance and code logic) that I used to collect trade data from the US Census Bureau API. I already collected the data but I ended up writing two different ways of requesting and saving the data for which my questions pertain to.
Summary of my final questions comes at the bottom.
First way: npm request and mongodb to save the data
I limited my procedure using tiny-async-pool (sets concurrency of a certain function to perform) to not try to request too much at once or receive a timeout or overload my database with queries. Simply put, the bottleneck I was facing was the database since the API requests returned rather quickly (depending on body size 1-15 secs), but to save each array item (return data was nested array, sometimes from a few hundred items to over one hundred thousand items with max 10 values in each array) to its own mongodb document ranged from 100 ms to 700 ms. To save time from potential errors and not redoing the same queries, I also performed a check in my database before making the query to see if the query was already complete. The end result was that I did not follow this method since it was very error prone and susceptible to timeouts if the data was very large (I even set the timeout to 10 minutes in request options).
Second way: npm request and save data to csv
I used the same approach as the first method for the requests and concurrency, however I saved each query to its own csv file. In case of errors and not redoing successful queries I also did a check to see if the file already existed and if so skipped that query. This approach was error free, I ran it and after a few hours was able to have all the data saved. To write to csv was insanely fast, much more so than using mongodb.
Final summary and questions
My end goal was to get the data in the easiest manner possible. I used javascript because that's where I learned api requests and async operations, even though I will do most of my data analysis with python and pandas. I first tried the database method mostly because I thought it was the right way and I wanted to improve my database CRUD skills. After countless hours of refactoring code and trying new techniques I still could not get it to work properly. I resorted to the csv method which was a) much less code to write, b) less checks, c) faster, and d) more reliable.
My final questions are these:
Why was the csv approach better than the database approach? Any counter arguments or different approaches you would have used?
How do you handle bottlenecks and concurrency in your applications with regards to APIs and database operations? Do your techniques vary in production environments from personal use cases (in my case I just needed the data and a few hours of waiting was fine)?
Would you have used a different programming language or different package/module for this data collection procedure?
How can I bench mark SQL performance in postgreSQL? I tried using Explain Analyze but that gives varied Execution time every time when I repeat same query.
I am applying some tuning techniques on my query and trying to see whether my tuning technique is improving the query performace. The Explain analyze has varying execution times that I cant bechmark and compare . The tuning has imapact in MilliSeconds so I am looking for bechmarch that can give fixed values to compare against.
There will always be variations in the time it takes a statement to complete:
Pages may be cached in memory or have to be read from disk. This is usually the source of the greatest deviations.
Concurrent processes may need CPU time
You have to wait for internal short time locks (latches) to access a data structure
These are just the first three things that come to my mind.
In short, execution time is always subject to small variations.
Run the query several times and take the the median of the execution times. That is as good as it gets.
Tuning for milliseconds only makes sense if it is a query that is executed a lot.
Also, tuning only makes sense if you have realistic test data. Don't make the mistake to examine and tune a query with only a few test data when it will have to perform with millions of rows.
I need to fetch million data from database to GSP page,I written query like
"select * from tablename";
now am able to retrieve only thousand rows at a time if I upload more than that showing error like
java.lang.OutOfMemoryError: GC overhead limit exceeded
I am not using hibernate. How can I fetch large amount of data in grails project?
You have 2 options: use the pagination or use a query result iterator.
If you're using Grails, I recommand you to use Hibernate which allow you to create SQL queries without writting them by hand, and will handle all problems related to the security. Morehover, be restrictive on your request: the * is not always necessary and may save request time / memory.
Pagination
This is the best way to handle a large amount of data: you just have to split the query in sub-queries, returning a known amount of rows. To do so, you have to use the SQL closures LIMIT and OFFSET.
For example, your query could be: select * from tablename LIMIT 100 OFFSET 2000. You just have to change the OFFSET parameter to retrieve all values.
Thanks to that, your backend will not have to handle a huge amount of data at a time. Morehover, you can use Javascript to send requests to your backend while it's rendering previous results, which improves the response time (asynchronous scroll works like this for example).
Grails has a default pagination system that you can use "as is". Please, look at the official documentation here. Maybe you will have to tweak it a little bit if you don't use Hibernate.
Query result iterator
You can handle a huge amount of data by using an iterator on the result, but it depends on the querying framework. Morehover, with that method, you will generate huge HTML pages, where the size may be a problem (remember: you have an OutOfMemory, so you're talking about a hundreds or thousands Mo; at one time, the user will have to download them synchronously !)
How to optimize the viewcount calculation on mongoDB?
We have an huge number of almost static pages apart from the viewcount. We've tried to calculate it from log without triggering DB operation when users are viewing the webpage, and process the log during easy hours. Is a more elegant way to optimize this viewcount calculation?
You could use Google Analytics or something similar to do it for you. Plus you'd get a whole lot of other useful metrics.