hiveQL counter limit exceeded error - hiveql

I am running a create table query in Hiveql and obtain the following error when it is run:
Status: Failed
Counters limit exceeded: Too many counters: 2001 max=2000
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Counters limit exceeded: Too many counters: 2001 max=2000
I have attempted to set the counters to to a greater number, i.e.
set tez.counters.max=16000;
However, it still falls over with the same error.
My query incorporates 13 left joins but the data sets are relatively small (1,000's rows). The query did work when there were roughly 10 joins but since I've added additional joins it has started to fail.
Any suggestions on how I can configure this to work would be greatly appreciated!

You need to find real initial error log from failed container. The error you have shown here is not initial error. 2001 containers (including their restart attempts) have failed because of some other error (which you really need to fix), then all job was terminated, all other containers were killed because of Failed Counters limit. Go to Job tracker and find some failed (not killed) container and read it's log. The real problem is not in limit and changing the Failed Counters limit will not help.

Divide your query into multiple step and then run it.
As you said your query works with 10 joins,So first create the table which has data with first 10 joins and then with the new table,create other table which has data from first table and three other tables.
I faced the same issue as I was applying union all statement on 100 tables.But when I started to run only 10 tables at a time it works.
Hope This Helps!!!!

Related

spark.databricks.queryWatchdog.outputRatioThreshold Error for FPGrowth using Pyspark on Databricks

I'm working on Market Basket Analysis using Pyspark on Databricks.
The transactional dataset consists of a total of 5.4 Million transactions, with approx. 11,000 items.
I'm able to run FPGrowth on the dataset, but whenever I'm trying to either display or take a count of model.freqItemsets & model.associationRules, I'm getting this weird error every time:
org.apache.spark.SparkException: Job 88 cancelled because Task 8084 in Stage 283 exceeded the maximum allowed ratio of input to output records (1 to 235158, max allowed 1 to 10000); this limit can be modified with configuration parameter spark.databricks.queryWatchdog.outputRatioThreshold
I'm not even able to understand why am I facing this error, and how I can resolve the same.
Any help would be appreciated. Thanks in advance!
I tried reading the docs provided by Databricks, yet I'm not clearly able to understand why am I getting this error

Is there a way to query Prometheus to count failed jobs in time range?

There are several metrics collected for cron jobs, unfortunately I‘m not sure how to use them properly.
I wanted to use the kube_job_status_failed == 1 metrics. I can use a regex for job=~“+.myjobname.+“ to aggregate all failed attempts for a cron job.
This is where i got stuck. Is there a way to count the amount of distinct labels(=number of failed attempts) in a given time period?
Or can I use the metrics the other way around meaning checking whether there was kube_job_status_succeeded{job=~“+.myjobname+.“}==1 in a given time period?
I feel like I’m so close to solving this but I just can’t wrap my head around it.
EDIT: Added PictureThis shows that there clearly are several succeded jobs over time, I just have no clue on how to count them
This should give you the number of failed jobs matching the job name in 1h period:
count_over_time(kube_job_status_failed{job=~“+.myjobname+.“}==1 [1h])
I searched for this answer myself and found offset working for my purpose.
kube_job_failed{job_name=~"^your_job_name.*", namespace="your_teamspace",} - kube_job_failed{job_name=~"^your_job_name.*", namespace="your_teamspace",} offset 6h > 2
I needed 6h, not 1h and the amount of failed jobs to be larger than 2 in this timerange.

How to avoid long delay before finally getting "40001 could not serialize access due to concurrent update"

We have a Postgres 12 system running one master master and two async hot-standby replica servers and we use SERIALIZABLE transactions. All the database servers have very fast SSD storage for Postgres and 64 GB of RAM. Clients connect directly to master server if they cannot accept delayed data for a transaction. Read-only clients that accept data up to 5 seconds old use the replica servers for querying data. Read-only clients use REPEATABLE READ transactions.
I'm aware that because we use SERIALIZABLE transactions Postgres might give us false positive matches and force us to repeat transactions. This is fine and expected.
However, the problem I'm seeing is that randomly a single line INSERT or UPDATE query stalls for a very long time. As an example, one error case was as follows (speaking directly to master to allow modifying table data):
A simple single row insert
insert into restservices (id, parent_id, ...) values ('...', '...', ...);
stalled for 74.62 seconds before finally emitting error
ERROR 40001 could not serialize access due to concurrent update
with error context
SQL statement "SELECT 1 FROM ONLY "public"."restservices" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
We log all queries exceeding 40 ms so I know this kind of stall is rare. Like maybe a couple of queries a day. We average around 200-400 transactions per second during normal load with 5-40 queries per transaction.
After finally getting the above error, the client code automatically released two savepoints, rolled back the transaction and disconnected from database (this cleanup took 2 ms total). It then reconnected to database 2 ms later and replayed the whole transaction from the start and finished in 66 ms including the time to connect to the database. So I think this is not about performance of the client or the master server as a whole. The expected transaction time is between 5-90 ms depending on transaction.
Is there some PostgreSQL connection or master configuration setting that I can use to make PostgreSQL to return the error 40001 faster even if it caused more transactions to be rolled back? Does anybody know if setting
set local statement_timeout='250'
within the transaction has dangerous side-effects? According to the documentation https://www.postgresql.org/docs/12/runtime-config-client.html "Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions" but I could set the timeout only for transactions by this client that's able to automatically retry the transaction very fast.
Is there anything else to try?
It looks like someone had the parent row to the one you were trying to insert locked. PostgreSQL doesn't know what to do about that until the lock is released, so it blocks. If you failed rather than blocking, and upon failure retried the exact same thing, the same parent row would (most likely) still be locked and so would just fail again, and you would busy-wait. Busy-waiting is not good, so blocking rather than failing is generally a good thing here. It blocks and then unblocks only to fail, but once it does fail a retry should succeed.
An obvious exception to blocking-better-than-failing being if when you retry, you can pick a different parent row to retry with, if that make sense in your context. In this case, maybe the best thing to do is explicitly lock the parent row with NOWAIT before attempting the insert. That way you can perhaps deal with failures in a more nuanced way.
If you must retry with the same parent_id, then I think the only real solution is to figure out who is holding the parent row lock for so long, and fix that. I don't think that setting statement_timeout would be hazardous, but it also wouldn't solve your problem, as you would probably just keep retrying until the lock on the offending row is released. (Setting it on the other session, the one holding the lock, might be helpful, depending on what that session is doing while the lock is held.)

How to determine number of write transactions per second in Postgres

Is there a way to measure how many write transactions are happening per second in Postgres? As I understand pg_stat_database.xact_commit will show total number of transactions committed, but I want to exclude readonly queries and only see the number of commits that actually modified data.
Run
SELECT txid_current();
to get the current transaction number.
If you do that at two points in time and subtract the numbers, you know how many transactions (committed or rolled back) have occurred in the mean time.
Read-only transactions do not consume a transaction ID.
This script can be used to count number of transaction commits performed between starting the script and killing it: https://gist.github.com/dmos62/aa754a04ff8bf36d6565d74b2dad6513
Usage looks like this:
./count_txs.sh psql postgresql://x:y#z:1234/w
ctrl-c to stop counting
^C
55
This means that 55 transaction commits have been performed between starting the script and killing it.

Cannot update large amount of records in orientdb

while update 20000 record by the orientdb java api.got following warning message and start new another procces and update records from beginning .even though previous updating process is run, after updated 12000 record.
warning: connection re-acquired transparently after xxx ms and y retries : no errors will be thrown at application level
I tried to insert 20000 record by increasing time out period. but it doesn't work.
would please help me to stop, start new process.