cassandra paging and hash numbers - hash

i want to use slice query in cassandra like this:
create table users (KEY varchar PRIMARY KEY, data varchar);
insert into users (KEY, 'data') values ('1', 'one');
insert into users (KEY, 'data') values ('2', 'two');
insert into users (KEY, 'data') values ('3', 'three');
insert into users (KEY, 'data') values ('4', 'four');
select * from users;
3 | three
2 | two
1 | one
4 | four
select * from users LIMIT 1;
3 | three
select * from users WHERE KEY > '3' LIMIT 1;
2 | two
select * from users WHERE KEY > '2' LIMIT 1;
1 | one
select * from users WHERE KEY > '1' LIMIT 1;
4 | four
in this example partitioner is ordered.but my partitioner is unordered.so i use query like this:
select * from users WHERE token(KEY) > token('3') LIMIT 3;
if i want to access all rows i have to start queries from key with lowest hash number.
is there any way to find key with lowest hash number? if not is there a better way to paging through table row's?
thank for your help :)
edit:
now i have another problem.token function is only supported on partition key.and in my cf primary key is compound as (word,docid).so for example i have severel word=hi.and when i use query like select * from users WHERE token(word) > token('hi') LIMIT 3; it start from last hi in my cf and in this way some rows with word=hi has been ignored.

Your first query can have no WHERE and it will return the first keys, ordered by hash. You can then use the last key returned by that in the next query.
You can also use token(''), which evaluates to the lowest token i.e.
select * from users WHERE token(KEY) > token('') LIMIT 3;
will return the first 3 keys ordered by token and is equivalent to
select * from users LIMIT 3;
You can also make use of the new automatic paging in Cassandra 2.0 (see CASSANDRA-4415).

Related

Subquery returning field from second table with between operator using value from first table

I have a problem with doing a subquery in PostgreSQL to get a calculated column value, it reports:
[21000] ERROR: more than one row returned by a subquery used as an expression
Situation
I have two tables:
accounts
postal_code_to_state
Accounts table (subset of columns)
name
postal_code
state
Cust One
00020
NULL
Cust Two
63076
CD
Postal Code to State
pc_from
pc_to
state
10
30
AB
63000
63100
CD
The accounts table has rows where the state value may be unknown, but there is a postal code value. The postal code field is char (but that is incidental).
The postal_code_to_state table has rows with postcode (integer) from & to columns. That is the low/high numbers of the postal code range for a state.
There is no common field to do joins. To get the state from the postal_code_to_state table the char field is cast to INT and the between operator used, e.g.
SELECT state
FROM postal_code_to_state
WHERE CAST('00020' AS INT) BETWEEN pc_from AND pc_to
this works OK, there is also a unique index on pc_from and pc_to.
But I need to run a query selecting from the accounts table and populating the state column from the state column in the postal_code_to_state table using the postal_code from the accounts table to select the appropriate row.
I can't figure out why PostgreSQL is complaining about the subquery returning multiple rows. This is the query I am currently using:
SELECT id,
name,
postal_code,
state,
(SELECT state
FROM postal_code_to_state
WHERE CAST(accounts.postal_code AS INT) BETWEEN pc_from AND pc_to) AS new_state
FROM accounts
WHERE postal_code IS NOT NULL ;
If I use LIMIT 1 in the subquery it is OK, and it returns the correct state value from postal_code_to_state, but would like to have it working without need to do that.
UPDATE 2022-10-22
#Adrian - thanks for query to find duplicates, I had to change your query a little, the != 'empty' to != FALSE.
When I run it on data I get this, groups of two rows (1 & 2, 3 & 4, etc.) shows the overlapping ranges.
state
pc_from
pc_to
CA
9010
9134
OR
9070
9170
UD
33010
33100
PN
33070
33170
TS
34010
34149
GO
34070
34170
CB
86010
86100
IS
86070
86170
So if I run...
SELECT pc_from,
pc_to,
state
FROM postal_code_to_state
WHERE int4range(pc_from, pc_to) #> 9070;
I get...
pc_from
pc_to
state
9010
9134
CA
9070
9170
OR
So, from the PostgreSQL side, the problem is clear - obviously it is the data. On the point of the data, what is shown on a site that has Italian ZIP code information is interesting:
https://zip-codes.nonsolocap.it/cap?k=12071&b=&c=
This was one of the dupes I had already removed.
The exact same ZIP code is used in two completely different provinces (states) - go figure! Given that the ZIP code is meant to resolve down to the street level, I can't see how one code can be valid for two localities.
Try this query to find duplicates:
select
a_tbl.state, a_tbl.pc_from, a_tbl.pc_to
from
postal_code_to_state as a_tbl,
(select * from postal_code_to_state) as b_tbl
where
a_tbl.state != b_tbl.state
and
int4range(a_tbl.pc_from, a_tbl.pc_to, '[]') && int4range(b_tbl.pc_from, b_tbl.pc_to, '[]') != 'empty';
If there are duplicates, after clearing them then you can do:
alter table
postal_code_to_state
add
constraint exclude_test EXCLUDE USING GIST (int4range(pc_from, pc_to, '[]') WITH &&);
This will set up an exclusion constraint to prevent overlapping ranges.
So:
insert into postal_code_to_state values (10, 30, 'AB'), (6300, 63100, 'CD');
insert into postal_code_to_state values (25, 40, 'SK');
ERROR: conflicting key value violates exclusion constraint "exclude_test"
insert into postal_code_to_state values (31, 40, 'SK');
INSERT 0 1
select * from postal_code_to_state ;
pc_from | pc_to | state
---------+-------+-------
10 | 30 | AB
63000 | 63100 | CD
31 | 40 | SK
The combined unique index would not protect you against overlapping postal codes, only duplicates. First, I'd write the query like this
SELECT id, name, postcode, coalesce(accounts.state, postal_code_to_state.state) state
FROM accounts
LEFT JOIN postal_code_to_state ON accounts.postal_code::Integer BETWEEN pc_from AND pc_to
WHERE accounts.state IS NOT NULL OR postal_code_to_state.state IS NOT NULL;
You could modify it to tell you which are overlapping
SELECT id, coalesce(accounts.state, postal_code_to_state.state) state
FROM accounts
LEFT JOIN postal_code_to_state ON accounts.postal_code::Integer BETWEEN pc_from AND pc_to
WHERE accounts.state IS NOT NULL OR postal_code_to_state.state IS NOT NULL
GROUP BY id,state
HAVING count(id) > 1;
I haven't tested any of this.

Fast new row insertion if a value of a column depends on previous value in existing row

I have a table cusers with a primary key:
primary key(uid, lid, cnt)
And I try to insert some values into the table:
insert into cusers (uid, lid, cnt, dyn, ts)
values
(A, B, C, (
select C - cnt
from cusers
where uid = A and lid = B
order by ts desc
limit 1
), now())
on conflict do nothing
Quite often (with the possibility of 98%) a row cannot be inserted to cusers because it violates the primary key constraint, so hard select queries do not need to be executed at all. But as I can see PostgreSQL first counts the select query as a result of dyn column and only then rejects row because of uid, lid, cnt violation.
What is the best way to insert rows quickly in such situation?
Another explanation
I have a system where one row depends on another. Here is an example:
(x, x, 2, 2, <timestamp>)
(x, x, 5, 3, <timestamp>)
Two columns contain an absolute value (2 and 5) and relative value (2, 5 - 2). Each time I insert new row it should:
avoid same rows (see primary key constraint)
if new row differs, it should count a difference and put it into the dyn column (so I take the last inserted row for the user according to the timestamp and subtract values).
Another solution I've found is to use returning uid, lid, ts for inserts and get user ids which were really inserted - this is how I know they have differences from existing rows. Then I update inserted values:
update cusers
set dyn = (
select max(cnt) - min(cnt)
from (
select cnt
from cusers
where uid = A and lid = B
order by ts desc
limit 2) Table
)
where uid = A and lid = B and ts = TS
But it is not a fast approach either, as it seeks all over the ts column to find the two last inserted rows for each user. I need a fast insert query as I insert millions of rows at a time (but I do not write duplicates).
What the solution can be? May be I need a new index for this? Thanks in advance.

select maximum column name from different table in a database

I am comparing from different table to get the COLUMN_NAME of the MAXIMUM value
Examples.
These are example tables: Fruit_tb, Vegetable_tb, State_tb, Foods_tb
Under Fruit_tb
fr_id fruit_one fruit_two
1 20 50
Under Vegetables_tb (v = Vegetables)
v_id v_one V_two
1 10 9
Under State_tb
stateid stateOne stateTwo
1 70 87
Under Food_tb
foodid foodOne foodTwo
1 10 3
Now here is the scenario, I want to get the COLUMN NAMES of the max or greatest value in each table.
You can maybe find out the row which contains the max value of a column. For eg:
SELECT fr_id , MAX(fruit_one) FROM Fruit_tb GROUP BY fr_id;
In order to find the out the max value of a table:
SELECT fr_id ,fruit_one FROM Fruit_tb WHERE fruit_one<(SELECT max(fruit_one ) from Fruit_tb) ORDER BY fr_id DESC limit 1;
A follow up SO for the above scenario.
Maybe you can use GREATEST in order to get the column name which has the max value. But then what I'm not sure is whether you'll be able to retrieve all the columns of different tables at once. You can do something like this to retrieve from a single table:
SELECT CASE GREATEST(`id`,`fr_id`)
WHEN `id` THEN `id`
WHEN `fr_id` THEN `fr_id`
ELSE 0
END AS maxcol,
GREATEST(`id`,`fr_id`) as maxvalue FROM Fruit_tb;
Maybe this SO could help you. Hope it helps!

Why does usage of lower() changes the order of resultset?

I have a table where I store information about users. The table has the following structure:
CREATE TABLE PERSONS
(
ID NUMBER(20, 0) NOT NULL,
FIRSTNAME VARCHAR2(40),
LASTNAME VARCHAR2(40),
BIRTHDAY DATE,
CONSTRAINT PERSONEN_PK PRIMARY KEY
(ID)
ENABLE
);
After inserting some test data:
SET DEFINE OFF;
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('1','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('2','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('3','Carl','Carlchen',to_date('01.01.12','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('4','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('5','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('6','Carl','Carlchen',to_date('01.01.12','DD.MM.RR'));
I want to select all duplicates of a given user. Let's use "Max Mustermann" for example:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname;
This gives me a result like this:
id first last birthday
=================================
1 Max Mustermann 31.10.89
2 Max Mustermann 31.10.89
4 Max Mustermann 31.10.89
5 Max Mustermann 31.10.89
I want to do a case insensitive compare, so I change the query using lower (and trim) like this:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE lower(trim(p.firstname)) = lower(trim('mAx '))
AND lower(trim(p.lastname)) = lower(trim(' musteRmann '))
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.lastname,p.firstname;
Now surprise the order has changed!
id first last birthday
=================================
1 Max Mustermann 31.10.89
5 Max Mustermann 31.10.89
4 Max Mustermann 31.10.89
2 Max Mustermann 31.10.89
Why does the order change, just by using lower() (same result when using without trim())!? I can get a stable ordering by adding the id column to the ORDER BY. But shouldn't the lower() have no affect to the ordering?
Workaround by also using id column for ORDER BY:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname,p.id;
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE lower(trim(p.firstname)) = lower(trim('mAx '))
AND lower(trim(p.lastname)) = lower(trim(' musteRmann '))
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.lastname,p.firstname,p.id;
If the values to be ordered by are identical, then the DBMS is free to choose any order it feels correct (the same way it is free to choose any order if no order by is specified alltogether).
Because all values of the columns in the order by are identical the resulting order is not stable. The only way to get a stable order is to include a unique column as an additional order criteria for ties - exactly what you did when you added the id column.
Why does the order change, just by using lower()
From a technical point, I'd guess that applying the lower() changed the execution plan and therefor the access path to the data.
But again (just to make sure): ordering on identical values never guarantees a stable order!
There is no ordering without an order by clause. Sometimes it looks like there might be (group by fooled a lot of people in older releases`, but it's only coincidental, and must not be relied upon. In your case you're ordering by some columns, but you expect duplicates within that ordering to be further ordered implicitly, which won't happen - or at least cannot be relied on.
In this case Oracle probably happens to be retrieving the rows for your first query in the order you inserted them purely as a side effect of how it's reading data from the blocks, and the order by sorts them within that set without actually changing them (or quite likely it's skipping the order by step internally if it realises it's pointless; the explain plan would tell you that).
If you change the order the order the records are created:
...
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values
('5','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values
('4','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
...
then the result 'order' changes too:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname;
ID FIRSTNAME LASTNAME BIRTHDAY
---------- -------------------- -------------------- ---------
1 Max Mustermann 31-OCT-89
2 Max Mustermann 31-OCT-89
5 Max Mustermann 31-OCT-89
4 Max Mustermann 31-OCT-89
Once you have the function things are changing enough for that happy accident to go out of the window, even if the records are inserted in id order (which has no relevance to the DB internally). lower() isn't changing the ordering, you just aren't getting lucky any more.
You cannot expect or rely on an order unless you fully specify it in the order by clause.

SQL basic full-text search

I have not worked much with TSQL or the full-text search feature of SQL Server so bear with me.
I have a table nvarchar column (Col) like this:
Col ... more columns
Row 1: '1'
Row 2: '1|2'
Row 3: '2|40'
I want to do a search to match similar users. So if I have a user that has a Col value of '1' I would expect the search to return the first two rows. If I had a user with a Col value of '1|2' I would expect to get Row 2 returned first and then Row 1. If I try to match users with a Col value of '4' I wouldn't get any results. I thought of doing a 'contains' by splitting the value I am using to query but it wouldn't work since '2|40' contains 4...
I looked up the documentation on using the 'FREETEXT' keyword but I don't think that would work for me since I essentially need to break up the Col values into words using the '|' as a break.
Thanks,
John
You should not store values like '1|2' in a field to store 2 values. If you have a maximum of 2 values, you should use 2 fields to store them. If you can have 0-many values, you should store them in a new table with a foreign key pointing to the primary key of your table..
If you only have max 2 values in your table. You can find your data like this:
DECLARE #s VARCHAR(3) = '1'
SELECT *
FROM <table>
WHERE #s IN(
PARSENAME(REPLACE(col, '|', '.'), 1),
PARSENAME(REPLACE(col, '|', '.'), 2)
--,PARSENAME(REPLACE(col, '|', '.'), 3) -- if col can contain 3
--,PARSENAME(REPLACE(col, '|', '.'), 4) -- or 4 values this can be used
)
Parsename can handle max 4 values. If 'col' can contain more than 4 values use this
DECLARE #s VARCHAR(3) = '1'
SELECT *
FROM <table>
WHERE '|' + col + '|' like '%|' + #s + '|%'
Need to mix this in with a case for when there is no | but this returns the left and right hand sides
select left('2|10', CHARINDEX('|', '2|10') - 1)
select right('2|10', CHARINDEX('|', '2|10'))