How do I verify in CQL that all the rows have successfully copied from a CSV to a Cassandra table? ***SELECT statements are not returning all results - pyspark

I am trying to understand Cassandra by playing with a public dataset.
I had inserted 1.5M rows from CSV to a table on my local instance of Cassandra, WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
The table was created with one field as a partition key, and one more as primary key
I had a confirmation that 1.5M rows were processed.
COPY Completed
But when I run SELECT or SELECT COUNT(*) on the table, I always get a max of 182 rows.

Secondly, the number of records returned with clustered columns seem to higher than single columns which is not making sense to me. What am I missing from Cassandra's architecture and querying point of view.
Lastly I have also tried reading the same Cassandra table from pyspark shell, and it seems to be reading 182 rows too.

Your primary key is PRIMARY KEY (state, severity). With this primary key definition, all rows for accidents in the same state of same severity will overwrite each other. You probably only have 182 different (state, severity) combinations in your dataset.
You could include another clustering column to record the unique accident, like an accident_id
This blog highlights the importance of the primary key, and has some examples:
https://www.datastax.com/blog/2016/02/most-important-thing-know-cassandra-data-modeling-primary-key

Related

How do you manage UPSERTs on PostgreSQL partitioned tables for unique constraints on columns outside the partitioning strategy?

This question is for a database using PostgreSQL 12.3; we are using declarative partitioning and ON CONFLICT against the partitioned table is possible.
We had a single table representing application event data from client activity. Therefore, each row has fields client_id int4 and a dttm timestamp field. There is also an event_id text field and a project_id int4 field which together formed the basis of a composite primary key. (While rare, it was possible for two event records to have the same event_id but different project_id values for the same client_id.)
The table became non-performant, and we saw that queries most often targeted a single client in a specific timeframe. So we shifted the data into a partitioned table: first by LIST (client_id) and then each partition is further partitioned by RANGE(dttm).
We are running into problems shifting our upsert strategy to work with this new table. We used to perform a query of INSERT INTO table SELECT * FROM staging_table ON CONFLICT (event_id, project_id) DO UPDATE ...
But since the columns that determine uniqueness (event_id and project_id) are not part of the partitioning strategy (dttm and client_id), I can't do the same thing with the partitioned table. I thought I could get around this by building UNIQUE indexes on each partition on (project_id, event_id) but the ON CONFLICT is still not firing because there is no such unique index on the parent table (there can't be, since it doesn't contain all partitioning columns). So now a single upsert query appears impossible.
I've found two solutions so far but both require additional changes to the upsert script that seem like they'd be less performant:
I can still do an INSERT INTO table_partition_subpartition ... ON CONFLICT (event_id, project_id) DO UPDATE ... but that requires explicitly determining the name of the partition for each row instead of just INSERT INTO table ... once for the entire dataset.
I could implement the "old way" UPSERT procedure: https://www.postgresql.org/docs/9.4/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE but this again requires looping through all rows.
Is there anything else I could do to retain the cleanliness of a single, one-and-done INSERT INTO table SELECT * FROM staging_table ON CONFLICT () DO UPDATE ... while still keeping the partitioning strategy as-is?
Edit: if it matters, concurrency is not an issue here; there's just one machine executing the UPSERT into the main table from the staging table on a schedule.

Adding a partition from another DB to existing partitioned table in Postgres 12

Short description of a problem
I need to build a partitioned table "users" with 2 partitions located on separate servers (Moscow and Hamburg), each partition is table with columns:
id - integer primary key with auto increment
region - smallint partition key, which equals either 100 for Hamburg or 200 for Moscow
login - unique character varying with length of 100.
I intended to make sequences for id as n*1000 + 100 for Hamburg, and n*1000 + 200 for Moscow, so just looking on primary key I will know which partition it belongs to.
region is intended to be read only and never change after creation, so no records will move between partitions.
SELECT queries must be able to return records from all partitions and UPDATE queries must be able to modify records on all partitions, INSERT/DELETE queries must be able to add/delete records only to local partition, so data stored in them is not completely isolated.
What was done
Using pgAdmin4
I created a "test" table on Hamburg server, added all column info, marked it as partitioned table with partition key region and partition type List.
I created a "hamburg" partition in this table, adding primary key constraint as id,region and unique key constraint as login,region.
I created a "moscow" table on Moscow server with the same column info as "test"
I added postgres_fdw extension to Hamburg server, created Foreign server pointing to DB on Moscow server and User mapping.
I added "moscow" foreign table to Hamburg server pointing to "moscow" table on Moscow server.
What is my problem
I couldn't figure out how to attach this foreign table as second partition to "test" table.
When I tried to attach partition through pgAdmin dialog in "test" table partitions properties it shows me an error: cannot unpack non-iterable Response object
When I tried to add partition with query as follows:
ALTER TABLE public.test ATTACH PARTITION public.moscow FOR VALUES IN (200);
It shows me an error:
ERROR: cannot attach foreign table "moscow" as partition of partitioned table "test"
DETAIL: Table "test" contains unique indexes.
SQL state: 42809
I removed unique constraint from login column but it shows the same error.
When I make partitioned table with the same properties and both partitions initially located on the same server all works well, except for postgres watch for login uniqueness per-partition rather than in whole table, but I suggest this is its limitation.
So, how can I attach a table located on the second server as partition to partitioned table located on the first one?
The error message is pretty clear: Since you cannot create an index on a partitioned table, PostgreSQL cannot create a partition of the unique index. But the unique index is required to implement the constraint.
See this source comment:
/*
* If we're attaching a foreign table, we must fail if any of the indexes
* is a constraint index; otherwise, there's nothing to do here. Do this
* before starting work, to avoid wasting the effort of building a few
* non-unique indexes before coming across a unique one.
*/
Either drop the unique constraint or don't use foreign tables as partitions.
Ok, I was finally able to add a foreign table as partition to partitioned table.
It was necessary to drop primary key property on id and unique property on login columns for partitioned table
After that I was able to attach foreign table as partition to partitioned table
Later I have added primary key property on id and unique property on login columns for each local partition.
So in the end I have unique global id as it is generated by sequences for each DB with never intersected values. For login uniqueness I have to manually check global table if there is any record with it before inserting.
P.S. Hopefully, this partitioning mechanism in postgres is suitable for geographically distant regions.

Redshift: Null Sort Key Stats in SVV_TABLE_INFO when SORTKEY AUTO is configured

I wanted to confirm if the following is true or if I am missing something: if a table in redshift is configured to have SORTKEY AUTO (assuming compound sortstyle, this style is default), then when we run a check on the meta data table SVV_TABLE_INFO for our table, that both the columns unsorted and vacuum_sort_benefit in that meta data table are null. Per this doc unsorted indicates what percentage of the table is not sorted and vacuum_sort_benefit indicates the percentage of improvement we would received if we ran a vacuum. Thus, if both are null, my guess is that the optimization strategy determined (given the number of queries ran against the table and their predicates/complexities) that no sort key is needed. Anyone else see something similar or can confirm?

Postgresql 11 Partitioning detail tables based on column in master table in foreign-key relationship

The improvements to declarative range-based partitioning in version 11 seem like they can really work for my use case, but I'm not completely sure how foreign keys work with partitions.
I have tables Files -< Segments -< Entries in which there are hundreds of Segments per File and hundreds of Entries per segment, so Entries is on the order of 10,000 times the size of Files. Files have a CreationDate field and customers will define a retention period so they'll drop old Entries. All of this clearly points to date-based partitions so it's quicker to query the most recent entries first, and easy to drop old ones.
The first issue I encounter is that when I try to create the Files table, it sounds like I have to include createdDate as part of the primary key in order to have it in the RANGE partition:
CREATE TABLE Files2
(
FileId BIGSERIAL PRIMARY KEY,
createdDate timestamp with time zone,
filepath character varying COLLATE pg_catalog."default" NOT NULL
) PARTITION BY RANGE (createdDate)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
ERROR: insufficient columns in PRIMARY KEY constraint definition
DETAIL: PRIMARY KEY constraint on table "files2" lacks column "createddate" which is part of the partition key.
If I drop the "PRIMARY KEY" from the definition of FileId, I don't get the error, but does that affect the efficiency of child lookups?
I also don't know how to declare the partitions for the child table. PARTITION BY RANGE (Files.createdDate) doesn't work.
Since it's only in version 11 that this use case would even be possible, I haven't found much information about it and would appreciate any pointers! Thanks!
I believe it is a new feature of the version 11. There is a document from which you can get the following information:
"A primary key constraint must include a partition key column. Attempting to create a primary key constraint that does not contain partitioned columns results in an error"
https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/support/lcc/pdf/PostgreSQL11NewFeaturesGAen20181022-1.pdf

cassandra 2.0.9: best practices for write-heavy columns

I am a little confused by clustering in Cassandra. I have an application that is very write-heavy and update-heavy. With a traditional relational database, I'd partition data into two tables: one table for data that changes infrequently; and one table (with shorter rows) for the columns that change frequently:
For example:
create table user_def ( id int primary key, email list< varchar > ); # stable
create table user_var ( id int primary key, state int ); # changes all the time
But Cassandra seems to be optimized for accessing sparsely-populated columns, so I'm not sure there is any advantage in mimicking this approach for Cassandra schemas.
With Cassandra, is there any advantage in separating frequently-updated columns to a separate table/column-family (away from infrequently-updated columns) or should I combine all the columns together into one table/column-family? Do circumstances change if I have a compound primary key and clustering comes into play?
Cassandra treats primary keys like this:
The first key in the primary key (which can be a composite) is used to partition your data. This defines which node(s) your data is saved in (and replicated to). Other fields in the primary key is then used to sort entries within a partition. The whole partition is always going to be in one node (and replica nodes) in its entirety. Moreover, each entry within a node is sorted by the "other" fields in the primary key. [The first element of the primary key is called the partition key, while the other fields in the primary key are called clustering keys.]
Based on that, I'd say you might as well simply have a table with id, state and email. It looks like you're using skinny rows, and I don't think you'd gain anything (if any) of creating the separate tables.
I had approved ashic's answer until I came upon this:
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
which states (for delete-heavy access):
...consider partitioning data with heavy churn rate into separate rows and deleting the entire rows when you no longer need them. Alternatively, partition it into separate tables and truncate them when they aren’t needed anymore...
This falls under the 'queue' anti-pattern for the product.