Count(*) returns a value of 0 (even though it should be at least 1+) - hiveql

I have a table of users in S3 that I'm running some queries against. In particular, I'm trying to get a count of records for a particular user ID. I start by querying the entire table as:
Select *
From table
Limit 100
That works just fine and returns results. I then copy one of the user ID's that I get from that result and run this query:
Select count(id)
From table
Where id = 'abc123'
Since I copied the user ID directly from the table I should get a count of at least 1 - as I know there is at least one record for that ID. However, Hive returns a result of 0.
I have tried analyze table to compute statistics and then re-ran my query and still got a count of 0.
So I then tried the following code but this timed out on me and wouldn't show any results. It took the query 0.001 seconds to run but then it just sat there "loading..." the table and then I eventually get a message saying "Operation timed out."
Select *
From table
Where id = 'abc123'
Limit 100
Any thoughts on why this may be happening or how to fix it?
Thanks!

Related

Does Postgres lock more rows than the limit provided?

If the LIMIT is applied to the rows returned from a query, does this mean that more rows could be locked that are not returned?
Like this:
select * from myTable where status = 'READY' limit 10 FOR UPDATE
If there are 1000 rows in a status of READY, does it row lock them all but only return 10?
I am seeing quite a costly -> LockRows on my explain plan and trying to understand why.
Thanks
From the documentation, it seems pretty clear that only the actual records returned by your select query would be locked:
FOR UPDATE causes the rows retrieved by the SELECT statement to be locked as though for update. This prevents them from being locked, modified or deleted by other transactions until the current transaction ends.
That being said, one possible explanation for why the LockRows operation seems so costly is that, in order to isolate the 10 records you want for locking, it first has to do a sort to implement LIMIT. This is an operation which involves the entire table, so for a large table, and without an index to help, this could take some time.
Let's say this were your actual query:
select * from myTable where status = 'READY' order by some_col limit 10 FOR UPDATE
This query would benefit from the following index:
create index idx on myTable (status, some_col);
The first column in the index status would let Postgres discard records not matching the WHERE filter. After this, the index also covers some_col, which means Postgres could easily find the limit 10 records you want already in the correct order.

PostgresSql 9.6 deletes suddenly became slow

I have a database table where debug log entries are recorded. There are no foreign keys - it is a single standalone table.
I wrote a utility to delete a number of entries starting with the oldest.
There are 65 million entries so I deleted them 100,000 at a time to give some progress feedback to the user.
There is a primary key column called id
All was going fine until it got to about 5,000,000 million records remaining. Then it started taking over 1 minute to execute.
What is more, if I used PgAdmin and type the query in myself, but use an Id that I know is less than the minimum id, it still takes over one minute to execute!
I.e: delete from public.inettklog where id <= 56301001
And I know the min(id) is 56301002
Here is the result of an explain analyze
Your stats are way out of date. It thinks it will find 30 million rows, but instead finds zero. ANALYZE the table.

The last updated data shows first in the postgres selet query?

I have simple query that takes some results from User model.
Query 1:
SELECT users.id, users.name, company_id, updated_at
FROM "users"
WHERE (TRIM(telephone) = '8973847' AND company_id = 90)
LIMIT 20 OFFSET 0;
Result:
Then I have done some update on the customer 341683 and again I run the same query that time the result shows different, means the last updated shows first. So postgres is taking the last updated by default or any other things happens here?
Without an order by clause, the database is free to return rows in any order, and will usually just return them in whichever way is fastest. It stands to reason the row you recently updated will be in some cache, and thus returned first.
If you need to rely on the order of the returned rows, you need to explicitly state it, e.g.:
SELECT users.id, users.name, company_id, updated_at
FROM "users"
WHERE (TRIM(telephone) = '8973847' AND company_id = 90)
ORDER BY id -- Here!
LIMIT 20 OFFSET 0

How can you use 'For update skip locked' in postgres without locking rows in all tables used in the query?

When you want to use postgres's SELECT FOR UPDATE SKIP LOCKED functionality to ensure that two different users reading from a table and claiming tasks do not get blocked by each other and also do not get tasks already being read by another user:
A join is being used in the query to retrieve tasks. We do not want any other table to have row-level locking except the table that contains the main info. Sample query below - Lock only the rows in the table -'task' in the below query
SELECT v.someid , v.info, v.parentinfo_id, v.stage FROM task v, parentinfo pi WHERE v.stage = 'READY_TASK'
AND v.parentinfo_id = pi.id
AND pi.important_info_number = (
SELECT MAX(important_info_number) FROM parentinfo )
ORDER BY v.id limit 200 for update skip locked;
Now if user A is retrieving some 200 rows of this table, user B should be able to retrieve another set of 200 rows.
EDIT: As per the comment below, the query will be changed to :
SELECT v.someid , v.info, v.parentinfo_id, v.stage FROM task v, parentinfo pi WHERE v.stage = 'READY_TASK'
AND v.parentinfo_id = pi.id
AND pi.important_info_number = (
SELECT MAX(important_info_number) FROM parentinfo) ORDER BY v.id limit 200 for update of v skip locked;
How best to place order by such that rows are ordered? While the order would get effected if multiple users invoke this command, still some order sanctity should be maintained of the rows that are being returned.
Also, does this also ensure that multiple threads invoking the same select query would be retrieving a different set of rows or is the locking only done for update commands?
Just experimented with this a little bit - multiple select queries will end up retrieving different set of rows. Also, order by ensures the order of the final result obtained.
Yes,
FOR UPDATE OF "TABLE_NAME" SKIP LOCKED
will lock only TABLE_NAME

insert from select misses some rows - Postgres

We have a table with around 7M records with type ferrari and want to do a schema migration. We used this script
insert into new_car id, name, type, colorType
select id, name, type, 'red'
from old_car
where type = 'ferrari'
order by id asc
The script took around 50 minutes to execute and after it got complete we realised that the new_car table have 2M less records than old_car table.
While the script was executing the old_car table still got inserts, updates and etc concurrently.
Does this concurrency may cause some sort of problem? What are the possible cause of the lack of ~2M rows? (the old_car table didn't got 2M deletes while the query was running, maybe something like 100 or 200 deletes)