client_backend vs parallel_worker? - postgresql

I'm running:
select *
from pg_stat_activity
And it shows 2 rows with same query content (under query field), and in active state,
but one row show client_backed (backend_type) and the other row show parallel_worker (backend_type)
why do I have 2 instances of same query ? (I have run just one query in my app)
what is the different between client_backed and parallel_worker ?

Since PostgreSQL v10 there is parallel processing for queries:
If the optimizer decides it is a good idea and there are enough resources, PostgreSQL will start parallel worker processes that execute the query together with your client backend. Eventually, the client backend will gather all the information from the parallel workers and finish query processing.
This speeds up query processing, but uses more resources on the database server.
The parameters that govern this are, among others max_parallel_workers, which limits the total limit for parallel worker processes, and max_parallel_workers_per_gather, which limits the numbers of parallel workers for a single query.

Related

MongoDB process documents concurrently (simple task queue)

Consider that I have a collection of tasks in MongoDB 5.2 and a number of independent worker processes which needs to take bulk of the tasks and process them.
Is there a way to do this in MongoDB in a safe concurrent way? E.g. only one worker should be working on specific set of tasks. Other workers should be able to get unclaimed tasks and process them in parallel without stepping on each other toes.
PostgreSQL has a very convenient SELECT FOR UPDATE, SKIP LOCKED query that takes unclaimed records and locks them at the same time. Is it possible to implement a similar system using MongoDB?

Measuring load per database in Postgres using 'active' processes in pg_stat_activity?

I am trying to measure the load that various databases living on the same Postgres server are incurring, to determine how to best split them up across multiple servers. I devised this query:
select
now() as now,
datname as database,
usename as user,
count(*) as processes
from pg_stat_activity
where state = 'active'
and waiting = 'f'
and query not like '%from pg_stat_activity%'
group by
datname,
usename;
But there were surprisingly few active processes!
Digging deeper I ran a simple query that returns 20k rows and took 5 seconds to complete, according to the client I ran it from. When I queried pg_stat_activity during that time, the process was idle! I repeated this experiment several times.
The Postgres documentation says active means
The backend is executing a query.
and idle means
The backend is waiting for a new client command.
Is it really more nuanced than that? Why was the process running my query was not active when I checked in?
If this approach is flawed, what alternatives are there for measuring load at a database granularity than periodically sampling the number of active processes?
your expectations regarding active, idleand idle in transaction are very right. The only explanation I can think of is a huge delay in displaying data client side. So the query indeed finished on server and session is idle and yet you don't see the result with client.
regarding the load measurement - I would not rely on number of active sessions much. Pure luck to hit the fast query in active state. Eg hypothetically you can check pg_stat_activity each second and see one active session, but between measurement one db was queried 10 times and another once - yet none of those numbers will be seen. Because they were active between executions. And this 10+1 active states (although mean that one db is queried 10times more often) do not mean you should consider load at all - because cluster is so much not loaded, that you can't even catch executions. But this unavoidably mean that you can catch many active sessions and it would not mean that server is loaded indeed.
so at least take now()-query_start to your query to catch longer queries. Or even better save execution time for some often queries and measure if it degrades over time. Or better select pid and check resources eaten by that pid.
Btw for longer queries look into pg_stat_statements - looking how they change over time can give you some expectations on how the load changes

How read in parallel all data from OrientDB

I want read all data from Orientdb database and I dont want get an iterator, I want read in some way all data in parallel by chunks from distinct pc across the network. There is any way to read databaseĀ“s clusters in parallel or any other way to do this?
I have seen the Spark connector for Orientdb, they query directly the clusters of the Orientdb classes in order to read the values of a complete class in parallel.
Orient-Spark
Git-code
You can use PARALLEL in a SELECT query.
See: https://orientdb.com/docs/last/SQL-Query.html
PARALLEL Executes the query against x concurrent threads, where x refers to the number of processors or cores found on the host operating system of the query. You may find PARALLEL execution useful on long running queries or queries that involve multiple cluster. For simple queries, using PARALLEL may cause a slow down due to the overhead inherent in using multiple threads.

Does PostgreSQL query partitions in parallel?

Postgres now has parallel queries. Are parallel queries used when the table is partitioned, the query is on the master table, and more than one partitions (child tables) are involved.
For example, I partition by the hour of the day. Then I want to count a type of event over more than one hour. The aggregation can be done on each partition, with the results added up at the end.
The alternative is to use a union between the partitions (child tables). In this case Postgres does parallel execution.
No, partitions are not queried in parallel. At this time (9.6) only table scans use parallel execution. The table is divided among the available workers, and each worker scans part of the table. At the end the primary worker combines the partial results.
A side effect of this is that the optimizer is more likely to chose a full table scan when parallel query execution is enabled.
As far as I can tell, there is no plan to parallelize execution based on partitions (or union all). A suggestion to add this has been added here.
Edit: My original answer was wrong. This answer has been completely revised.

How I can speed up mongodb?

I have a crawler which consits of 6+ processes. Half of processes are masters which crawl web and when they find jobs they put it inside jobs collection. In most times masters save 100 jobs at once (well I mean they get 100 jobs and save each one of them separatly as fast as possible.
The second half of processes are slaves which constantly check if new jobs of some type are available for him - if yes it marks them in_process (it is done using findOneAndUpdate), then processe the job and save the result in another collection.
Moreover from time to time master processes have to read a lot of data from jobs table to synchronize.
So to sum up there are a lot of read operations and write operations on db. When db was small it was working ok but now when I have ~700k job records (job document is small, it has 8 fields and has proper indexes / compound indexes) my db slacks. I can observe it when displaying "stats" page which basicaly executes ~16 count operations with some conditions (on indexed fields).
When masters/slaves processes are not runing stats page displays after 2 seconds. When masters/slaves are runing same page displays around 1 minute and sometime is does not display at all (timeout).
So how I can make my db to handle more requests per second? I have to replicate it?