Big Batch Command in OrientDB - orientdb

I need to execute a considerably-size insertion operation from my Java-based web app. Due to the potential size of the operation I'm using a dynamically generated OCommmandScript which I am then executing using
orientGraph.command(new OCommandScript("sql", sql)).execute(params);
Now, in the typical scenario, sql is big (~5000 characters), params contains about 20k entries and OrientDB takes more that one minute to execute the script. This is far more than I can afford, as this is executed in the context of a REST request. Am I doing something wrong? Is there any way I can speed up this operation for a large number of insertions?

Related

Flutter file write takes too long

I'm trying to write an application that tries to provide an offline capability for a vast number of records (above 20m).
I've tried to do it using sqflite and tests show that it's not feasible since it either takes very long to write (if the index are predefined) or takes too long to index (after the inserts)
So I've decided to use the file system and since I'm only going to query with an id, use the file name as id and try to find the record.
but this time, the main problem is, file write operation takes very very long. I'm using File(filePath).writeAsString API and for 50 to 120 char strings, it sometimes takes more than 100ms.
I'm trying to utilize the isolations for better performance but it still does not help very much.
Is there a better approach or an file manipulation API available for this kind of an operation.

Create data as a load on J-Meter

I'm working on Kubernetes on Microsoft azure with real data. Now, I need to generate a sample of data on JMeter then use it as workload to stress the CPU in Tea-Store microservices on Kubernetes. Any hint or source about How to do that, and which type of files work with JMeter?
If you want a specific answer you need to ask more specific question.
The most common parameterization options are:
If you need to ingest data from external data sources:
CSV Data Set Config allows reading CSV files into JMeter Variables so each virtual user on each iteration reads next line from the CSV file
__CSVRead() function does more or less the same however it can be declared/used in the runtime so you can have dynamic filename/path and you decide when to proceed to next column/row
JDBC Request sampler allows reading test data from the database or creating test data in the database
__StringFromFile() function reads next line from file each time it's being called
__FileToString() function reads whole file into memory/variable
If you need to generate brand new/random data:
__threadNum() - number of current thread
__time() and __timeShift() - current timestamp in various formats plus possibility to generate dates in future or past
__Random() - generate a random number
__RandomString() - generate a random string out of provided characters
__UUID() - generate unique GUID-like structure
__groovy() - for everything else, it executes arbitrary Groovy code and returns the result
IN addition to great Dmitri's answer I would like to add few cents.
Please take a look to 13-Step Guide to Performance Testing in Kubernetes, especially to
Step 12: Automating the Performance Tests
When running performance tests, we need to run these tests for a range
of workload scenarios (e.g. concurrency levels, heap sizes, message
sizes, etc.). Running the tests manually for each of these scenarios
is time-consuming and likely to cause errors. Therefore it is
important to automate the performance tests prior to executing them.
We automate our performance tests using a shell script:
start_performance_test.sh.
This script can have an idea for smth similar for you. Also overall the article introduces you Jmeter usage with some examples.

Why is saving data from an API to CSV faster than uploading it to MongoDB database

My question revolves around understanding the following two procedures (particularly performance and code logic) that I used to collect trade data from the US Census Bureau API. I already collected the data but I ended up writing two different ways of requesting and saving the data for which my questions pertain to.
Summary of my final questions comes at the bottom.
First way: npm request and mongodb to save the data
I limited my procedure using tiny-async-pool (sets concurrency of a certain function to perform) to not try to request too much at once or receive a timeout or overload my database with queries. Simply put, the bottleneck I was facing was the database since the API requests returned rather quickly (depending on body size 1-15 secs), but to save each array item (return data was nested array, sometimes from a few hundred items to over one hundred thousand items with max 10 values in each array) to its own mongodb document ranged from 100 ms to 700 ms. To save time from potential errors and not redoing the same queries, I also performed a check in my database before making the query to see if the query was already complete. The end result was that I did not follow this method since it was very error prone and susceptible to timeouts if the data was very large (I even set the timeout to 10 minutes in request options).
Second way: npm request and save data to csv
I used the same approach as the first method for the requests and concurrency, however I saved each query to its own csv file. In case of errors and not redoing successful queries I also did a check to see if the file already existed and if so skipped that query. This approach was error free, I ran it and after a few hours was able to have all the data saved. To write to csv was insanely fast, much more so than using mongodb.
Final summary and questions
My end goal was to get the data in the easiest manner possible. I used javascript because that's where I learned api requests and async operations, even though I will do most of my data analysis with python and pandas. I first tried the database method mostly because I thought it was the right way and I wanted to improve my database CRUD skills. After countless hours of refactoring code and trying new techniques I still could not get it to work properly. I resorted to the csv method which was a) much less code to write, b) less checks, c) faster, and d) more reliable.
My final questions are these:
Why was the csv approach better than the database approach? Any counter arguments or different approaches you would have used?
How do you handle bottlenecks and concurrency in your applications with regards to APIs and database operations? Do your techniques vary in production environments from personal use cases (in my case I just needed the data and a few hours of waiting was fine)?
Would you have used a different programming language or different package/module for this data collection procedure?

How to use Task Parallel Iibrary with ADO .NET data access

I'm trying to optimise ADO .NET (.Net 4.5) data access with Task parallel library (.Net 4.5), For an example when selecting 1000,000,000 records from a database how can we use the machine multicore processor effectively with Task parallel library. If anyone has found use full sources to get some idea please post :)
The following applies to all DB access technologies, not just ADO.NET.
Client-side processing is usually the wrong place to solve data access problems. You can achieve several orders of magnitude improvement in performance by optimizing your schema, create proper indexes and writing proper SQL queries.
Why transfer 1M records to a client for processing, over a limited network connection with significant latency, when a proper query could return the 2-3 records that matter?
RDBMS systems are designed to take advantage of available processors, RAM and disk arrays to perform queries as fast as possible. DB servers typically have far larger amounts of RAM and faster disk arrays than client machines.
What type of processing are you trying to do? Are you perhaps trying to analyze transactional data? In this case you should first extract the data to a reporting, or better yet, an OLAP database. A star schema with proper indexes and precalculated analytics can be 1000x times faster than an OLTP schema for analysis.
Improved SQL coding can also result in 10x-50x times improvement or more. A typical mistake by programmers not accustomed to SQL is to use cursors instead of set operations to process data. This usually leads to horrendous performance degradation, in the order of 50x times and worse.
Pulling all data to the client to process them row-by-row is even worse. This is essentially the same as using cursors, only the data has to travel over the wire and processing will have to use the client's often limited memory.
The only place where asynchronous processing offers any advantage, is when you want to fire off a long operation and execute code when processing finishes. ADO.NET already provides asynchronous operations using the APM model (BeginExecute/EndExecute). You can use TPL to wrap this in a task to simplify programming but you won't get any performance improvements.
It could be that your problem is not suited to database processing at all. If your algorithm requires that you scan the entire dataset multiple times, it would be better to extract all the data to a suitable file format in one go, and transfer it to another machine for processing.

PostgreSQL. Slow queries in log file are fast in psql

I have an application written on Play Framework 1.2.4 with Hibernate(default C3P0 connection pooling) and PostgreSQL database (9.1).
Recently I turned on slow queries logging ( >= 100 ms) in postgresql.conf and found some issues.
But when I tried to analyze and optimize one particular query, I found that it is blazing fast in psql (0.5 - 1 ms) in comparison to 200-250 ms in the log. The same thing happened with the other queries.
The application and database server is running on the same machine and communicating using localhost interface.
JDBC driver - postgresql-9.0-801.jdbc4
I wonder what could be wrong, because query duration in the log is calculated considering only database processing time excluding external things like network turnarounds etc.
Possibility 1: If the slow queries occur occasionally or in bursts, it could be checkpoint activity. Enable checkpoint logging (log_checkpoints = on), make sure the log level (log_min_messages) is 'info' or lower, and see what turns up. Checkpoints that're taking a long time or happening too often suggest you probably need some checkpoint/WAL and bgwriter tuning. This isn't likely to be the cause if the same statements are always slow and others always perform well.
Possibility 2: Your query plans are different because you're running them directly in psql while Hibernate, via PgJDBC, will at least sometimes be doing a PREPARE and EXECUTE (at the protocol level so you won't see actual statements). For this, compare query performance with PREPARE test_query(...) AS SELECT ... then EXPLAIN ANALYZE EXECUTE test_query(...). The parameters in the PREPARE are type names for the positional parameters ($1,$2,etc); the parameters in the EXECUTE are values.
If the prepared plan is different to the one-off plan, you can set PgJDBC's prepare threshold via connection parameters to tell it never to use server-side prepared statements.
This difference between the plans of prepared and unprepared statements should go away in PostgreSQL 9.2. It's been a long-standing wart, but Tom Lane dealt with it for the up-coming release.
It's very hard to say for sure without knowing all the details of your system, but I can think of a couple of possibilities:
The query results are cached. If you run the same query twice in a short space of time, it will almost always complete much more quickly on the second pass. PostgreSQL maintains a cache of recently retrieved data for just this purpose. If you are pulling the queries from the tail of your log and executing them immediately this could be what's happening.
Other processes are interfering. The execution time for a query varies depending on what else is going on in the system. If the queries are taking 100ms during peak hour on your website when a lot of users are connected but only 1ms when you try them again late at night this could be what's happening.
The point is you are correct that the query duration isn't affected by which library or application is calling it, so the difference must be coming from something else. Keep looking, good luck!
There are several possible reasons. First if the database was very busy when the slow queries excuted, the query may be slower. So you may need to observe the load of the OS at that moment for future analysis.
Second the history plan of the sql may be different from the current session plan. So you may need to install auto_explain to see the actual plan of the slow query.