cassandra - cql - count and group data imported from csv file - cql3

I'm a newer with no-sql and my background is based on sql db (mysql).
Last months I started to work with big data and I choose cassandra as no-sql db.
This is my dev environment:
ubuntu 12.04 64 bit
cqlsh 4.1.1
Cassandra 2.0.6
CQL spec 3.1.1
Thrift protocol 19.39.0
My input is a daily csv files with many columns and I've to import just some of these columns. The structure of the csv file
user_id => text
col_A => int
col_B => int
col_C => int
other_col => do not import
.....
.....
.....
other_col => do not import
What is the condition for importing a csv row?
the value of columns user_id + col_A + col_B + col_C must to be unique.
Then I thought to create a table with as many primary key as columns
CREATE TABLE unique_value (
user_id text,
col_A int,
col_B int,
col_C int,
PRIMARY KEY (user_id, col_A, col_B, col_C)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
During the import will be insert just unique values of the combination "user_id + col_A + col_B + col_C" of the csv file and that is what I want.
After the import I need to query the table in order to get the total number of the unique user_id (not duplicated) grouped by the values of col_B. In sql the query should be
SELECT COUNT(b.user_id), b.col_B(
(SELECT COUNT(user_id) AS is_user_exclusive, user_id, col_B FROM unique_value
GROUP BY col_B
HAVING is_user_exclusive < 2) AS b
GROUP BY b.col_B
but I still can't find the right cql query or probably the right data modeling.
Do you have any hints?
Thank you in advance

Take a look at the counters!!
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_counter_t.html
You could create another table to do this counting
create table mycounts (count counter, user_id text, col_b int, PRIMARY KEY (user_id, col_b))
So, whenever you insert into unique_values table make a insert into the mycounts table. Then when you query you just do select * from mycounts. I hope this helps!!

Related

Optimal approach to bulk insert of pandas dataframe into PostgreSQL table

I need to upload multiple excel files to a postgresql table but they can olverlap each other in several registers, therefore I need to be aware of IntegrityErrors. I'm following two approaches:
cursor.copy_from: The fastest approach but I don't know how to catch and control all Integrityerrors due to duplicate registers
streamCSV = StringIO()
streamCSV.write(invoicing_info.to_csv(index=None, header=None, sep=';'))
streamCSV.seek(0)
with conn.cursor() as c:
c.copy_from(streamCSV, "staging.table_name", columns=dataframe.columns, sep=';')
conn.commit()
cursor.execute: I can count and handle each exception but it is very
slow.
data = invoicing_info.to_dict(orient='records')
with cursor as c:
for entry in data:
try:
c.execute(DLL_INSERT, entry)
successful_inserts += 1
connection.commit()
print('Successful insert. Operation number {}'.format(successful_inserts))
except psycopg2.IntegrityError as duplicate:
duplicate_registers += 1
connection.rollback()
print('Duplicate entry. Operation number {}'.format(duplicate_registers))
At the end of the routine, I need to determine the following info:
print("Initial shape: {}".format(invoicing_info.shape))
print("Successful inserts: {}".format(successful_inserts))
print("Duplicate entries: {}".format(duplicate_registers))
How can I modify the first approach to control all exceptions? How can I optimize the second approach?
while you have duplicate IDs in different excel sheets you have to answer for yourself how you make a decision to data from which excel sheet to trust?
while you are using multiple tables, and will use approach to have at least one row from conflicting pair you can always do following:
create temporary tables for each excel sheet
upload data to each table for excel sheet (like you do now in a bulk)
make an insert from select picking distinct on(id), in a manner:
INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id)
id, col1, col2
FROM
(
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet1
UNION
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet2
UNION
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet3
) as data
with such insert postgreSQL will take the random row out of non-unique id sets.
In case you would like to trust the first record you can add some order:
INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id)
id, ordering_column col1, col2
FROM
(
SELECT id, 1 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet1
UNION
SELECT id, 2 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet2
UNION
SELECT id, 3 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet3
) as data
ORDER BY ordering_column
for initial count of objects:
SELECT sum(count)
FROM
(
SELECT count(*) as count FROM temp_table_for_excel_sheet1
UNION
SELECT count(*) as count FROM temp_table_for_excel_sheet2
UNION
SELECT count(*) as count FROM temp_table_for_excel_sheet3
) as data
after finishing this bulk inserts you can run select count(*) FROM staging.table_name to get a result for total number of inserted records
for duplicate count you can run:
SELECT sum(count)
FROM
(
SELECT count(*) as count
FROM temp_table_for_excel_sheet2 WHERE id in (select id FROM temp_table_for_excel_sheet1 )
UNION
SELECT count(*) as count
FROM temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet1 )
)
UNION
SELECT count(*) as count
FROM temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet2 )
) as data
If the excel sheets contain duplicate records, Pandas seems a likely choice for identifying and eliminated dupes: https://33sticks.com/python-for-business-identifying-duplicate-data/. Or is the issue that different records in different sheets have the same id/index? If so, a similar approach could work where you use Pandas to isolate the ids used multiple times and then correct them with unique identifiers before attempting to upload to the SQL db.
For a bulk upload, I'd use an ORM. SQLAlchemy has some great info on bulk uploads: http://docs.sqlalchemy.org/en/rel_1_0/orm/persistence_techniques.html#bulk-operations, and there's a related discussion here: Bulk insert with SQLAlchemy ORM

How to remove duplicates in postgres (no unique id) [duplicate]

This question already has answers here:
How to delete duplicate rows without unique identifier
(10 answers)
Closed 5 years ago.
I have some difficulty in removing duplicates rows. I thought user_id and time_id together acting as an identifier but there were even duplicates for those.
user_id (text), time_id(bigint), value1 (numeric)
user_id; time_id; value1|
aaa;1;3|
aaa;1;3|
aaa;2;4|
baa;3;1|
In this case how do I remove duplicates?
Since I have 16 distinct values in time_id and 15,000 distinct ones in user_id, I tried something like this but I do not have an unique id..
DELETE FROM tablename a
USING tablename b
WHERE a.unique_id < b.unique_id
AND a.user_id = b.user_id
time_id = 1 (repeat till time_id 16)
Each table in Postgres has a few hidden system columns. One of them (ctid) is unique by definition and can be used in cases when a primary key is missing.
DELETE FROM tablename a
USING tablename b
WHERE a.ctid < b.ctid
AND a.user_id = b.user_id
AND a.time_id = b.time_id;
The problem is due to lack of primary key. Using hidden columns should not be a systematic method (see comments below). Once you delete duplicates you should create a primary key on (user_id, time_id) or create a new unique column for this purpose.
Please use any advice on deletions with care, make sure you have a way to "undo it" if needed. I think you need to add an auto-numbered column to assist in this endeavor
alter table tablename add column is_uniq serial
Then I'd suggest using row_number() to help identify the rows you do want to retain (where rn=1) and those to be deleted (where rn>1). Use the following as a guide:
select *
, ROW_NUMBER()over(partition by user_id, time_id, value1 order by is_uniq) as rn from tablename
I'm not sure if there is any other columns(s) to use for order by, but if there are then you can include that into over clause as well.
Once you have the "is_uniq" column and the rn>1 rows you should be able to safely delete the unwanted rows.
If you don't want to rely on ctid (personally,I do) ,you can add a unique column (such as aserial) and use that for identity-purposes,
CREATE TABLE lutser
( user_id text not null
, time_i integer not null
, value integer not null
);
INSERT INTO lutser(user_id,time_i,value) VALUES
('aaa', 1, 3)
,('aaa', 1, 3)
,('aaa', 2, 4)
,('baa', 3, 1)
;
SELECT*FROM lutser;
ALTER TABLE lutser
ADD COLUMN seq serial NOT NULL UNIQUE
;
SELECT*FROM lutser;
DELETE FROM lutser del
WHERE EXISTS(
SELECT*FROM lutser x
WHERE x.user_id=del.user_id
AND x.time_i=del.time_i
AND x.seq < del.seq
);
ALTER TABLE lutser
ADD PRIMARY KEY (user_id,time_i)
;
SELECT*FROM lutser;

to write a SQL query which select rows where column value changed from previous row

CREATE TABLE status( id serial NOT NULL,
id integer,
plan smallint,
ime timestamp without time zone
CONSTRAINT data_pkey PRIMARY KEY (id))
WITH (OIDS=FALSE);
ALTER TABLE data
OWNER TO postgres;
Index: data_idx
CREATE INDEX data_idx
ON data
USING btree
(time, id);
I have a table like this
id val plan time
1 8300 1 2011-01-01
2 8300 1 2011-01-02
3 8300 2 2011-01-03
4 9600 1 2011-01-04
5 9600 2 2011-01-05
How do I select the rows where sigplan changed from the previous row for that siteId?
In the example above, the query should return the rows
2011-01-03 (sigplan changed from 1 to 2 between 2011-01-01 and 2011-01-03 for 8300),
2011-01-05(sigplan changed from 1 to 2 between 2011-01-04 and 2011-01-05 for 9600).
The table contains lot of data so the query should be optimized.
SELECT siteId, sigplan, MAX(server_time) FROM traffview.status_data
GROUP BY siteId, sigplan
HAVING COUNT(1) > 1 AND MAX(server_time) > 'XXXXX' AND MAX(server_time) < 'XXXXX'
The annoying part is figuring out which is the previous row id with the same siteId. After that it is pretty easy by joining the table with itself.
SELECT t1.* FROM table t1, table t2
WHERE t1.sigplan != t2.sigplan
AND t2.id = (SELECT MAX(t3.id) FROM table t3 WHERE t3.id < t1.id)
If the table is moderately (not extremely) large I would consider doing this in application code instead, or by storing the change flag in its own column when writing a new row. A subquery for each row in the table has very poor performance.
This version doesn't have a sub-query, but does assume that you have consecutive IDs.
SELECT t1.*
FROM traffview AS t1, traffview AS t2
WHERE
t1.siteId = t2.siteId
AND t1.sigplan <> t2.sigplan
AND t1.id - t2.id = 1
ORDER BY
t1.server_time
In case you compare with previous rows it is useful to use LAG function which does the job for you:
SELECT sub.*
FROM (
SELECT
plan AS curr_plan,
LAG(plan) OVER (PARTITION BY val ORDER BY time) AS prev_plan,
val,
time
) sub
WHERE
sub.prev_plan IS NOT NULL AND sub.prev_plan <> sub.curr_plan;

Insert into table from select distinct query in postgresql

I have a table with 33 columns that has several duplicates so i am trying to remove all the duplicates this way because this select distinct query has the correct number of data.
CREATE TABLE students (
school char(2),sex char(1),age int,address char(1),famsize char(3),
Pstatus char(1),Medu int,Fedu int,Mjob varchar,Fjob varchar,reason varchar,
guardian varchar,traveltime int,studytime int,failures char(1),
schoolsup varchar,famsup varchar,paid varchar,activities varchar,
nursery varchar,higher varchar,internet varchar,romantic varchar,
famrel int,freetime int,goout int,Dalc int,Walc int,
health int,absences int,id serial primary key)
I want to insert all values from this select distinct query
with 8 columns into a different empty table.
SELECT DISTINCT ("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet")
FROM students;
I want to insert all values from this select distinct query with 8 columns into a different empty table.
Use create table .. as select ... if you want to create the table
create table new_table
as
SELECT DISTINCT school, sex, age, address, famsize, "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", reason, nursery, internet
FROM students;
Other wise just use an insert based on a select:
insert into empty_table (school, sex, age, address, famsize, "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", reason, nursery, internet)
SELECT DISTINCT school, sex, age, address, famsize, "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", reason, nursery, internet
FROM students;
Very important: do not put parentheses around the columns in the select list - that creates a single column with an anonymous record type.
insert into destinationTable(dC1, dC2, dC3, dC4, dC5, dC6, dC7, dC8)
select sC1, sC2, sC3, sC4, sC5, sC6, sC7, sC8
from sourceTable
You can join the tables to get the 33 columns.

Copy data with new IDs

Is there any way to COPY some rows into same table with new IDs?
My table is like this:
ID | data
1 | SOMETHING
2 | SOMETHING
3 | SOMETHING
I have old IDs: '{1,513,3,4,5}', and new ones: '{1338,7,512,9,10}' and I need to add row 1338 with data from row 1, 7 <= 513 etc. Like old[0] = new[0].
Currently I am using a loop:
SELECT old_ids INTO oIds FROM vars_table WHERE sid = id;
FOR i IN 0..array_length(new_ids, 1) LOOP
INSERT INTO ids(ID, data)
SELECT new_ids[i], data
FROM ids
WHERE id = oIds[i]
AND NOT EXISTS(SELECT 1 FROM ids WHERE id = new_ids[i]);
END LOOP;
Is there better way to do this? Maybe in 1 query?
There is no need for a loop:
insert into the_table (id, data)
select id + 5, data
from the_table;
However the above requires you to know how many rows there are in the table. To take the current number of rows into account you can do:
insert into the_table (id, data)
select id + (select max(id) from the_table), data
from the_table;
Attention: the above is NOT safe in a multi-user environment. It should only be used if you are the only one doing this.
The best way to deal with this kind of data duplication is to define the ID column as serial and let Postgres deal with creating new values:
create table the_table (id serial not null, data text);
The initial data would then be inserted like this:
insert into the_table (data)
values ('foo'), ('bar'), ('foobar');
Duplicating the data is then as easy as:
insert into the_table (data)
select data
from the_data;