I am facing issue with insert of mongo. When I use insert_one to insert data in IDEA, the code takes a few seconds to response on average, while using insert_many to insert all the data that needs to be inserted only takes a few seconds. However, the interesting thing is that the same data inserted using Navicat takes milliseconds.
Another interesting point is that when I run the code on mac book, it is several times faster than windows, and the time difference between the two is obvious.In other high configuration windows environment, it will also be very slow.
By the way, when I query data, the response is fast, in any environment.
Related
My question revolves around understanding the following two procedures (particularly performance and code logic) that I used to collect trade data from the US Census Bureau API. I already collected the data but I ended up writing two different ways of requesting and saving the data for which my questions pertain to.
Summary of my final questions comes at the bottom.
First way: npm request and mongodb to save the data
I limited my procedure using tiny-async-pool (sets concurrency of a certain function to perform) to not try to request too much at once or receive a timeout or overload my database with queries. Simply put, the bottleneck I was facing was the database since the API requests returned rather quickly (depending on body size 1-15 secs), but to save each array item (return data was nested array, sometimes from a few hundred items to over one hundred thousand items with max 10 values in each array) to its own mongodb document ranged from 100 ms to 700 ms. To save time from potential errors and not redoing the same queries, I also performed a check in my database before making the query to see if the query was already complete. The end result was that I did not follow this method since it was very error prone and susceptible to timeouts if the data was very large (I even set the timeout to 10 minutes in request options).
Second way: npm request and save data to csv
I used the same approach as the first method for the requests and concurrency, however I saved each query to its own csv file. In case of errors and not redoing successful queries I also did a check to see if the file already existed and if so skipped that query. This approach was error free, I ran it and after a few hours was able to have all the data saved. To write to csv was insanely fast, much more so than using mongodb.
Final summary and questions
My end goal was to get the data in the easiest manner possible. I used javascript because that's where I learned api requests and async operations, even though I will do most of my data analysis with python and pandas. I first tried the database method mostly because I thought it was the right way and I wanted to improve my database CRUD skills. After countless hours of refactoring code and trying new techniques I still could not get it to work properly. I resorted to the csv method which was a) much less code to write, b) less checks, c) faster, and d) more reliable.
My final questions are these:
Why was the csv approach better than the database approach? Any counter arguments or different approaches you would have used?
How do you handle bottlenecks and concurrency in your applications with regards to APIs and database operations? Do your techniques vary in production environments from personal use cases (in my case I just needed the data and a few hours of waiting was fine)?
Would you have used a different programming language or different package/module for this data collection procedure?
I have a table of 3M rows.
I wanted to retrieve all those rows and do a visualization using dc.js.
Problem I have is, for just a single column it takes about 70 secs.
And If i write my query it takes about 240 secs to retrieve those rows.
I'm using using select query on columns like this.
SELECT COL1, COL2 FROM TABLE
That's it. No grouping, nothing.
But it takes hell lot of time.
Heard of indexing and I created a Index for the columns I use. But even though no fruitful results.
We should not retrieve 3M rows in any query. And sending 3M records will always take a lot of time (nothing to do with the database, it is transfer speed). It will kill your IO. The bulk of the time taken is on the transfer (IO) from request-originator and the postgres database.
Consider to break that requests into batches of async-requests that gets streamed down to clients. That means restructuring your front-end code (javascript) for improved user-experience.
You didn't specify the environment in which you are using PostgreSQL.
As an example, in Node.js you can solve this problem by streaming the data with the help of pg-query-stream and rendering it on the client side at the same time, so the client doesn't have to wait for the query to finish and can see intermediate results.
This may not be the best solution though. A better solution would be to implement data aggregation within a database function to provide a smaller data subset.
I noticed that the first time I run a query on RedShift, it takes 3-10 second. When I run same query again, even with different arguments in WHERE condition, it runs fast (0.2 sec).
Query I was talking about runs on a table of ~1M rows, on 3 integer columns.
Is this huge difference in execution times caused by the fact that RedShift compiles the query first time its run, and then re-uses the compiled code?
If yes - how to always keep this cache of compiled queries warm?
One more question:
Given queryA and queryB.
Let's assume queryA was compiled and executed first.
How similar should queryB be to queryA, such that execution of queryB will use the code compiled for queryA?
The answer of first question is yes. Amazon Redshift compiles code for the query and cache it. The compiled code is shared across sessions in a cluster, so the same query with even different parameters in the different session will run faster because of no overhead.
Also they recommend to use the result of the second execution of the query for the benchmark.
There is the answer for this question and details in the following link.
http://docs.aws.amazon.com/redshift/latest/dg/c-compiled-code.html
I have an application written on Play Framework 1.2.4 with Hibernate(default C3P0 connection pooling) and PostgreSQL database (9.1).
Recently I turned on slow queries logging ( >= 100 ms) in postgresql.conf and found some issues.
But when I tried to analyze and optimize one particular query, I found that it is blazing fast in psql (0.5 - 1 ms) in comparison to 200-250 ms in the log. The same thing happened with the other queries.
The application and database server is running on the same machine and communicating using localhost interface.
JDBC driver - postgresql-9.0-801.jdbc4
I wonder what could be wrong, because query duration in the log is calculated considering only database processing time excluding external things like network turnarounds etc.
Possibility 1: If the slow queries occur occasionally or in bursts, it could be checkpoint activity. Enable checkpoint logging (log_checkpoints = on), make sure the log level (log_min_messages) is 'info' or lower, and see what turns up. Checkpoints that're taking a long time or happening too often suggest you probably need some checkpoint/WAL and bgwriter tuning. This isn't likely to be the cause if the same statements are always slow and others always perform well.
Possibility 2: Your query plans are different because you're running them directly in psql while Hibernate, via PgJDBC, will at least sometimes be doing a PREPARE and EXECUTE (at the protocol level so you won't see actual statements). For this, compare query performance with PREPARE test_query(...) AS SELECT ... then EXPLAIN ANALYZE EXECUTE test_query(...). The parameters in the PREPARE are type names for the positional parameters ($1,$2,etc); the parameters in the EXECUTE are values.
If the prepared plan is different to the one-off plan, you can set PgJDBC's prepare threshold via connection parameters to tell it never to use server-side prepared statements.
This difference between the plans of prepared and unprepared statements should go away in PostgreSQL 9.2. It's been a long-standing wart, but Tom Lane dealt with it for the up-coming release.
It's very hard to say for sure without knowing all the details of your system, but I can think of a couple of possibilities:
The query results are cached. If you run the same query twice in a short space of time, it will almost always complete much more quickly on the second pass. PostgreSQL maintains a cache of recently retrieved data for just this purpose. If you are pulling the queries from the tail of your log and executing them immediately this could be what's happening.
Other processes are interfering. The execution time for a query varies depending on what else is going on in the system. If the queries are taking 100ms during peak hour on your website when a lot of users are connected but only 1ms when you try them again late at night this could be what's happening.
The point is you are correct that the query duration isn't affected by which library or application is calling it, so the difference must be coming from something else. Keep looking, good luck!
There are several possible reasons. First if the database was very busy when the slow queries excuted, the query may be slower. So you may need to observe the load of the OS at that moment for future analysis.
Second the history plan of the sql may be different from the current session plan. So you may need to install auto_explain to see the actual plan of the slow query.
I am trying to anonymize a large data set of about 600k records (removing sensitive information like email, etc.) so that it can be used for some performance tests.
I am using Scala (Casbah) with Mongo. The actual script is pretty simple and straightforward. When I run the script, the entire process starts off pretty fast - parsing 1000 records every 2-3 seconds, but it slows down tremendously and starts crawling very slowly.
I know this is pretty vague without too much details, but any idea why this is happening, and any hints on how I could speed this up?
It turned out to be an issue with the driver and not with Mongo. When I tried the same inserts using the mongo shell, it was through without breaking a sweat.
UPDATE
So, I tried both approaches. Inserting into existing collection and dumping the results in a new collection. The first approach was faster for me. Of course, one should never assume this to be always true and must benchmark before choosing the first approach over the second. In both case, Mongo was very very fast (meaning - it was not taking hours to get this done). There was a problem with the Java interface I was using to connect with Mongo, which was more of a stupid mistake on my part.