I am using a hash object for recursive lookup.
Example data
id1 id2
A B
B C
C D
Output
orig_id id1 id2
A A B
A B C
A C D
I am having no issues with my recursive lookup using a hash object.
Rather, my Question is is there a practical limit to the number of keys in a hash object?
More context - I have over 10 million records of id1 id2. These are randomly generated unique identifiers for each record. Therefore, for recursive lookup, I need all pairs of ids in my hash object. My understanding of hash objects is that 2^20 is the upper limit of hash keys, but I do not receive any warnings or errors when I load my table of id-pairs as a hash object.
Alternatively, if a hash object is truly capped at 2^20 hash keys, is there an alternative way to perform a recursive lookup? I've tried looking into data access functions but was unable to find a function like locatec or locaten that would return the row number of an observation that satisfied a condition. Is there a function I'm missing that works like locatec but returns a row number?
Related
Let's say I want to find rows in the table my_table that have the value 5 at the first position of the array column my_array_column. To prepare the table, I executed the following statements:
CREATE TABLE my_table (
id serial primary key,
my_array_column integer[]
);
CREATE INDEX my_table_my_array_column_index on "my_table" USING GIN ("my_array_column");
SET enable_seqscan TO off;
INSERT INTO my_table (my_array_column) VALUES ('{5,7,10}');
Now, the query can look like this:
select * from my_table where my_array_column[1] = 5;
This works, but it doesn't use the created GIN index. Is it possible to search for the value 5 at a specific position with an index?
I want to find rows in the table my_table that have the value 5 at the first position of the array column
A partial index would be most efficient for that definition:
CREATE INDEX my_table_my_array_special_idx ON my_table ((true))
WHERE my_array_column[1] = 5;
If only a small fraction of rows qualifies, a partial index is accordingly smaller. Plus, the actual index column only occupies minimum space (typically 8 bytes). And, on top of that, Postgres 13 or later can apply index deduplication to make the index much smaller, yet.
Once the index is fully cached, its small size does not make it much faster, but still.
And most writes do not have to manipulate the index, which may be the most important benefit, depending on the workload.
Oh, and Postgres collects statistics for a partial index. So you can expect the query planner to make a fully educated choice when that index is involved.
Related:
PostgreSQL partial index unused when created on a table with existing data
Index that is not used, yet influences query
It's applicable when the query repeats the same condition.
Typically, you have something useful as index field on top of your declared purpose. But if you don't, just use any small constant - true in my example, but anything < 8 bytes is equally good.
Minor disclaimer: The "first position" in a Postgres array does not necessarily have index 1. If non-standard array indexes are possible, consider:
...
WHERE (my_array_column[:])[1] = 5;
In index and queries.
See:
Normalize array subscripts for 1-dimensional array so they start with 1
You can index just the first position. You need an extra set of parentheses in the create statement to do that:
create index on my_table ((my_array_column[1]));
Or you could augment your query to work with your gin index, on the theory that an array can't have the first element be 5 unless at least one element is 5.
select * from my_table where my_array_column[1] = 5 and my_array_column #> ARRAY[5];
Of course this won't be very efficient if a lot of your arrays contain 5, but in some other spot in the array. It would have to recheck all of those "false matches" to eliminate them. So if you only care about the first element, the first index I showed is better. (Of course, if you only care about the first element, why use an array to start with?)
If you always look at the first position a regular B-Tree index will do:
create index on my_table ( (my_array_column[1]) );
If you don't know the position, then a GIN index is indeed needed, but you need to use an operator that is supported by a gin index, that would be e.g. the #> operator. But for that you need to use a different query:
select *
from my_table
where my_array_column #> array[5];
That would find all rows where the array column contains the value 5.
But you should head the advice given in the manual regarding the use of arrays:
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.
I'm joining table A against tables B and C. All three tables have similar columns and indexes, and B and C have about the same number of rows. But A and B have indexes on nvarchar columns while C has indexes on varchar columns.
Tested separately, joining on B is 30-60 times faster than joining on C. (4 seconds vs. 2-4 minutes.) Looking at the execution plan, B uses an index seek while C uses an index scan. The details for the join on C mention implicit conversion of the varchar columns, while the join on B mentions no such conversion. Is this why it's using a scan instead of a seek, and is this probably why it's so slow? (Another potential issue: the index scan on C has an estimated number of executions of 1, but the actual number of executions is around 8500.)
C is static historical data, so I could alter the columns and rebuild the indexes if it would help.
Yes implicit conversions may result in index scans instead of seeks. Data is converted to the data type with the higher data type precedence. As you've seen in this case the VARCHAR column of table c is converted to an NVARCHAR value. The implicit conversion protects against losing data during conversion, i.e. NVARCHAR columns can hold significantly more distinct characters than VARCHAR so the implicit conversion of VARCHAR from table C ensures that all the values from table C will be preserved.
Details on specific implicit conversion scenarios are further outlined here. If you have the option and this won't have any negative implications elsewhere, I'd suggest making this column in table C an NVARCHAR data type.
I have a table with three columns A, B, C, all of type bytea.
There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs
When creating indexes for all columns with
CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);
index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database.
When running
SELECT * FROM pg_stat_activity;
wait_event_type and wait_event are both NULL, state is active.
Why are the second index creations taking so long, and can I do anything to speed them up?
Ensure the statistics on your table are up-to-date.
Then execute the following query:
SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'
Basically, the database will have more work to create indexes when:
The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.
I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.
Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:
Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.
Note: the description is simplified to highlight the difference.
Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).
Solution 2:
Order records on the disk.
The initialization would be something like this:
CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;
The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.
Solution3:
Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.
i have a column family use counter as create table command below: (KEY i use bigin to filter when query ).
CREATE TABLE BannerCount (
KEY bigint PRIMARY KEY
) WITH
comment='' AND
comparator=text AND
read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
default_validation=counter AND
min_compaction_threshold=4 AND
max_compaction_threshold=32 AND
replicate_on_write='true' AND
compaction_strategy_class='SizeTieredCompactionStrategy' AND
compression_parameters:sstable_compression='SnappyCompressor';
But when i insert data to this column family , and select using Where command to filter data
results i retrived very strange :( like that:
use Query:
select count(1) From BannerCount where KEY > -1
count
-------
71
use Query:
select count(1) From BannerCount where KEY > 0;
count
-------
3
use Query:
select count(1) From BannerCount ;
count
-------
122
What happen with my query , who any tell me why i get that :( :(
To understand the reason for this, you should understand Cassandra's data model. You're probably using RandomPartitioner here, so each of these KEY values in your table are being hashed to token values, so they get stored in a distributed way around your ring.
So finding all rows whose key has a higher value than X isn't the sort of query Cassandra is optimized for. You should probably be keying your rows on some other value, and then using either wide rows for your bigint values (since columns are sorted) or put them in a second column, and create an index on it.
To explain in a little more detail why your results seem strange: CQL 2 implicitly turns "KEY >= X" into "token(KEY) >= token(X)", so that a querier can iterate through all the rows in a somewhat-efficient way. So really, you're finding all the rows whose hash is greater than the hash of X. See CASSANDRA-3771 for how that confusion is being resolved in CQL 3. That said, the proper fix for you is to structure your data according to the queries you expect to be running on it.
I'm querying to return all rows from a table except those that are in some list of values that is constant at query time. E.g. SELECT * FROM table WHERE id IN (%), and % is guaranteed to be a list of values, not be a subquery. However, this list of values may be up to 1000 elements long in some cases. Should I limit this to a smaller sublist (as few as 50-100 elements is as low as I can go, in this case) or will there be a negligible performance gain?
I assume it's a large table, otherwise it wouldn't matter much.
Depending on table size and number of keys, this may turn into a sequence scan. If there are many IN keys, Postgres often chooses not to use an index for it. The more keys, the bigger the chance of a sequence scan.
If you use another indexed column in WHERE, like:
select * from table where id in (%) and my_date > '2010-01-01';
It's likely to fetch all rows matching the indexed (my_date) columns, and then perform an in-memory scan on them.
Using a JOIN to a persistent or temporary table may, but does not have to help. It still will need to locate all the rows, either with a nested loop (unlikely for large data), or for a hash/merge join.
I would say the solution is:
Use as few IN keys as possible.
Use other criteria for indexing and querying whenever possible. If IN requires an in-memory scan of all rows, at least there will be fewer of them thanks to additional criteria.
Use a temporary table to JOIN, gives better performance and has no limits. An IN() having a 1000 arguments, will give you problems in any database.