delete row key from cassandra cli - nosql

i set my column family gcgraceseconds to 0;
but stills rowkey is not deleted it remains in my column family
create column family workInfo123
with column_type = 'Standard'
and comparator = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and populate_io_cache_on_flush = true
and gc_grace = 0
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and default_time_to_live = 0
and speculative_retry = 'NONE'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor'}
and index_interval = 128;
see below the view of
[default#winoriatest] list workInfo123;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: a
-------------------
RowKey: xx
2 Rows Returned.
Elapsed time: 17 msec(s).
i am using cassandra -cli
should i have change anything else
UPDATE:-
after using ./nodetool -host 127.0.0.1 compact
[default#winoriatest] list workInfo123;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: xx
2 Rows Returned.
Elapsed time: 11 msec(s).
why xx remains ??

When you delete a row in Cassandra, it does not get deleted straight away. Instead it is marked with a tombstone. The effect is, that you still get a result for the key, but no columns will be delivered. The tombstone is required because
Cassandra data files become read-only once they are "full"; the tombstone is added to the currently open data file containing the deleted row.
you have to give the cluster a chance to propagate the delete to all nodes holding a copy of the row.
For the row and its tombstone to be removed a compaction is required. This process re-organizes the data files and while it does that, it prunes deleted rows. That is, if the GC grace period of the tombstone has been reached. For single-node(!) clusters it is OK to set the grace period to 0 because the delete does not have to be propagated to any other node (that might be down at the point in time you issued the delete).
If you want to enforce the removal of deleted rows, you can trigger a flush (sync memory with data files) and a major compaction via the nodetool utility. E.g.
./nodetool flush your_key_space the_column_family && ./nodetool compact your_key_space the_column_family
After the compaction completes, the deleted rows should truly be gone.

Default GC grace period is ten days(means 846000 sec) in order to remove the rowkey immediately
UPDATE COLUMN FAMILY column_family_name with GC_GRACE= 0;
execute the above cli query follow the nodetool flush and compact operation.

Related

Batch inserts and LAST_INSERT_ID with Slick and MariaDB

I'm trying to insert some data into a MariaDB database. I got two tables and I have to insert the rows (using a batch insert) into the first table and use the IDs of the newly-inserted rows to perform a second batch insert into the second table.
I'm doing so in Scala using Alpakka Slick. For the purpose of this question, let's call tests the main table and dependent the second one.
At the moment, my algorithm is as follows:
Insert the rows into tests
Fetch the ID of the first row in the batch using SELECT LAST_INSERT_ID();
Knowing the ID of the first row and the number of rows in the batch, compute by hand the other IDs and use them for the insertion in the second table
This works pretty well with only one connection at a time. However, I'm trying to simulate a scenario with multiple attempts to write simultaneously. To do that, I'm using Scala parallel collections and Akka Stream Source as follows:
// three sources of 10 random Strings each
val sources = Seq.fill(3)(Source(Seq.fill(10)(Random.alphanumeric.take(3).mkString))).zipWithIndex
val parallelSources: ParSeq[(Source[String, NotUsed], Int)] = sources.par
parallelSources.map { case (source, i) =>
source
.grouped(ChunkSize) // performs batch inserts of a given size
.via(insert(i))
.zipWithIndex
.runWith(Sink.foreach { case (_, chunkIndex) => println(s"Chunk $chunkIndex of source $i done") })
}
I'm adding an index to each Source just to use it a prefix in the data I write in the DB.
Here's the code of the insert Flow I've written so far:
def insert(srcIndex: Int): Flow[Seq[String], Unit, NotUsed] = {
implicit val insertSession: SlickSession = slickSession
system.registerOnTermination(() => insertSession.close())
Flow[Seq[String]]
.via(Slick.flowWithPassThrough { chunk =>
(for {
// insert data into `tests`
_ <- InsTests ++= chunk.map(v => TestProj(s"source$srcIndex-$v"))
// fetch last insert ID and connection ID
queryResult <- sql"SELECT CONNECTION_ID(), LAST_INSERT_ID();".as[(Long, Long)].headOption
_ <- queryResult match {
case Some((connId, firstIdInChunk)) =>
println(s"Source $srcIndex, last insert ID $firstIdInChunk, connection $connId")
// compute IDs by hand and write to `dependent`
val depValues = Seq.fill(ChunkSize)(s"source$srcIndex-${Random.alphanumeric.take(6).mkString}")
val depRows =
(firstIdInChunk to (firstIdInChunk + ChunkSize))
.zip(depValues)
.map { case (index, value) => DependentProj(index, value) }
InsDependent ++= depRows
case None => DBIO.failed(new Exception("..."))
}
} yield ()).transactionally
})
}
Where InsTests and InsDependent are Slick's TableQuery objects. slickSession creates a new session for each different insert and is defined as follows:
private def slickSession = {
val db = Database.forURL(
url = "jdbc:mariadb://localhost:3306/test",
user = "root",
password = "password",
executor = AsyncExecutor(
name = "executor",
minThreads = 20,
maxThreads = 20,
queueSize = 1000,
maxConnections = 20
)
)
val profile = slick.jdbc.MySQLProfile
SlickSession.forDbAndProfile(db, profile)
}
The problem is that the last insert IDs returned by the second step of the algorithm overlap. Every run of this app would print something like:
Source 2, last insert ID 6, connection 66
Source 1, last insert ID 5, connection 68
Source 0, last insert ID 7, connection 67
Chunk 0 of source 0 done
Chunk 0 of source 2 done
Chunk 0 of source 1 done
Source 2, last insert ID 40, connection 70
Source 0, last insert ID 26, connection 69
Source 1, last insert ID 27, connection 71
Chunk 1 of source 2 done
Chunk 1 of source 1 done
Chunk 1 of source 0 done
Where it looks like the connection is a different one for each Source, but the IDs overlap (Source 0 sees 7, source 1 sees 5, source 2 sees 2). It is correct that IDs start from 5, as I'm adding 4 dummy rows right after creating the tables (not shown in this question's code). Obviously, I see multiple rows in dependent with the same tests.id, which shouldn't happen.
It's my understanding that last insert IDs refer to a single connection. How is it possible that three different connections see overlapping IDs, considering that the entire flow is wrapped in a transaction (via Slick's transactionally)?
This happens with innodb_autoinc_lock_mode=1. As far as I've seen so far, it doesn't with innodb_autoinc_lock_mode=0, which makes sense, since InnoDB would lock tests until the whole batch insert terminates.
UPDATE after Georg's answer: For some other constraints in the project, I'd like the solution to be compatible with MariaDB 10.4, which, as far as I understand, doesn't feature INSERT...RETURNING. Additionally, Slick's ++= operator's support for returning is quite bad, as also reported here. I tested it on both MariaDB 10.4 and 10.5, and, according to the query logs, Slick does execute single INSERT INTO statements instead of a batch one. In my case, this is not quite acceptable, as I'm planning on writing several chunks of rows in a streaming fashion.
While I also understand that making assumptions about the auto-increment value being 1 is not ideal, we do have control over the Production setup and do not have multi-master replication.
You cannot generate subsequent values based on LAST_INSERT_ID():
There might be a second transaction which was rolled back running at the same time, so there will be a gap in your auto_incremented ID's.
Iterating over the number of rows by incrementing LAST_INSERT_ID value will not work, since it depends of value of session variable ##auto_increment_increment (which is especially in multi master replication not 1).
Instead, you should use RETURNING to get the ID's of inserted rows:
MariaDB [test]> create table t1 (a int not null auto_increment primary key);
Query OK, 0 rows affected (0,022 sec)
MariaDB [test]> insert into t1 (a) values (1),(3),(NULL), (NULL) returning a;
+---+
| a |
+---+
| 1 |
| 3 |
| 4 |
| 5 |
+---+
4 rows in set (0,006 sec)

Removing duplicatated rows in PostgreSQL

I have thousands of lines of duplicate data in PostgreSQL database. To find out which row are duplicated, I am using this code:
SELECT "Date" FROM stockdata
group by "Date"
having count("Date")>1
This has produced again thousands of lines of date column which have more then 1 entry. How can I remove the row with the date so that just 1 entry of the duplicated item remains.
P.S I cannot use a Primary Key when entering data.
Update
As per the comment. There is no primary key. Also the Date is unique thus there cannot be 2 or more of it.
df look like this:
Date High Low Open Close Volume Adj Close
0 2017-04-03 893.489990 885.419983 888.000000 891.510010 3422300 891.510010
1 2017-04-04 908.539978 890.280029 891.500000 906.830017 4984700 906.830017
2 2017-04-05 923.719971 905.619995 910.820007 909.280029 7508400 909.280029
3 2017-04-06 917.190002 894.489990 913.799988 898.280029 6344100 898.280029
4 2017-04-07 900.090027 889.309998 899.650024 894.880005 3710900 894.880005
... ... ... ... ... ... ... ...
12595 2022-03-28 1097.880005 1053.599976 1065.099976 1091.839966 34168700 1091.839966
12596 2022-03-29 1114.770020 1073.109985 1107.989990 1099.569946 24538300 1099.569946
12597 2022-03-30 1113.949951 1084.000000 1091.170044 1093.989990 19955000 1093.989990
12598 2022-03-31 1103.140015 1076.640015 1094.569946 1077.599976 16265600 1077.599976
12599 2022-04-01 1094.750000 1066.640015 1081.150024 1076.352783 11449987 1076.352783
12600 rows × 7 columns
The data is repeated a few times at places.
However the rows with the same date with have the same data.
This data is not a stock data (i am using it as a troubleshoot example) but from yokogawa datalogger. https://www.yokogawa.com/in/solutions/products-platforms/data-acquisition/data-logger/#Overview
There are redundancies in the system and the earlier integrator had just dumped all the data on 1 database and thus if redundant logger comes online, the database has multiple entries. I need to remove it so we can actually use the data. I don't have access to their software.
Further Update:
Using this code as suggested in the comments:
delete from stockdata s
using
(SELECT "Date" , max(ctid) as max_ctid from stockdata group by "Date") t
where s.ctid<>t.max_ctid
and s."Date"=t."Date";
It was able to do the job but going forward, is this dangerous solution for production?
This should do the trick:
DELETE FROM
stockdata a
USING stockdata b
WHERE
a.id < b.id
AND a.Date = b.Date;
But be careful, this will immediately delete all duplicates. There is no way to restore them.

DB2 Import - how to merge several CSV files

I have several CSV files with the same columns, however, the columns are in different order.
I would like to merge, via "import", all these CSV files.
Please could you help with this import statement? How can I make this import statement match the column order?
With Db2 on Unix/Windows, You can use the IMPORT command or the LOAD command . Additionally other ways are possible with the INGEST command.
With IMPORT or LOAD, There are two ways you can do it, either use the "METHOD P" or specify the order of target-columns on the INSERT clause There are two examples below.
The first example uses "Method P" for import:
there are three CSV files whose three columns are in different order, and a target table with three columns (a,b,c):
create table mytab(a integer not null, b integer not null, c integer not null)
DB20000I The SQL command completed successfully.
!cat 1a.csv
1,2,3
!cat 1b.csv
99,98,97
!cat 1c.csv
55,51,59
import from 1a.csv of del method p(1,2,3) insert into mytab
SQL3109N The utility is beginning to load data from file "1a.csv".
SQL3110N The utility has completed processing. "1" rows were read from the
input file.
SQL3221W ...Begin COMMIT WORK. Input Record Count = "1".
SQL3222W ...COMMIT of any database changes was successful.
SQL3149N "1" rows were processed from the input file. "1" rows were
successfully inserted into the table. "0" rows were rejected.
Number of rows read = 1
Number of rows skipped = 0
Number of rows inserted = 1
Number of rows updated = 0
Number of rows rejected = 0
Number of rows committed = 1
import from 1b.csv of del method p(3,2,1) insert into mytab
SQL3109N The utility is beginning to load data from file "1b.csv".
SQL3110N The utility has completed processing. "1" rows were read from the
input file.
SQL3221W ...Begin COMMIT WORK. Input Record Count = "1".
SQL3222W ...COMMIT of any database changes was successful.
SQL3149N "1" rows were processed from the input file. "1" rows were
successfully inserted into the table. "0" rows were rejected.
Number of rows read = 1
Number of rows skipped = 0
Number of rows inserted = 1
Number of rows updated = 0
Number of rows rejected = 0
Number of rows committed = 1
import from 1c.csv of del method p(2,1,3) insert into mytab
SQL3109N The utility is beginning to load data from file "1c.csv".
SQL3110N The utility has completed processing. "1" rows were read from the
input file.
SQL3221W ...Begin COMMIT WORK. Input Record Count = "1".
SQL3222W ...COMMIT of any database changes was successful.
SQL3149N "1" rows were processed from the input file. "1" rows were
successfully inserted into the table. "0" rows were rejected.
Number of rows read = 1
Number of rows skipped = 0
Number of rows inserted = 1
Number of rows updated = 0
Number of rows rejected = 0
Number of rows committed = 1
select * from mytab
A B C
----------- ----------- -----------
1 2 3
97 98 99
51 55 59
3 record(s) selected.
The second example uses ordered-column targets for the insert to match the column-target order in the CSV file.
create table mynewtab(a integer not null, b integer not null, c integer not null)
DB20000I The SQL command completed successfully.
!cat 1a.csv
1,2,3
!cat 1b.csv
99,98,97
!cat 1c.csv
55,51,59
import from 1a.csv of del insert into mynewtab(a,b,c)
SQL3109N The utility is beginning to load data from file "1a.csv".
SQL3110N The utility has completed processing. "1" rows were read from the
input file.
SQL3221W ...Begin COMMIT WORK. Input Record Count = "1".
SQL3222W ...COMMIT of any database changes was successful.
SQL3149N "1" rows were processed from the input file. "1" rows were
successfully inserted into the table. "0" rows were rejected.
Number of rows read = 1
Number of rows skipped = 0
Number of rows inserted = 1
Number of rows updated = 0
Number of rows rejected = 0
Number of rows committed = 1
import from 1b.csv of del insert into mynewtab(c,b,a)
SQL3109N The utility is beginning to load data from file "1b.csv".
SQL3110N The utility has completed processing. "1" rows were read from the
input file.
SQL3221W ...Begin COMMIT WORK. Input Record Count = "1".
SQL3222W ...COMMIT of any database changes was successful.
SQL3149N "1" rows were processed from the input file. "1" rows were
successfully inserted into the table. "0" rows were rejected.
Number of rows read = 1
Number of rows skipped = 0
Number of rows inserted = 1
Number of rows updated = 0
Number of rows rejected = 0
Number of rows committed = 1
import from 1c.csv of del insert into mynewtab(b,a,c)
SQL3109N The utility is beginning to load data from file "1c.csv".
SQL3110N The utility has completed processing. "1" rows were read from the
input file.
SQL3221W ...Begin COMMIT WORK. Input Record Count = "1".
SQL3222W ...COMMIT of any database changes was successful.
SQL3149N "1" rows were processed from the input file. "1" rows were
successfully inserted into the table. "0" rows were rejected.
Number of rows read = 1
Number of rows skipped = 0
Number of rows inserted = 1
Number of rows updated = 0
Number of rows rejected = 0
Number of rows committed = 1
select * from mynewtab
A B C
----------- ----------- -----------
1 2 3
97 98 99
51 55 59
3 record(s) selected.

Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

I have a local file movies.dat formatted as movie_id:movie_title:genre. For example:
1:movie1:Comedy
2:movie2:Drama
3:movie3:Horror
...
I create an external table using the following command.
CREATE EXTERNAL TABLE movies(movie_id INT, movie_title String, genre String)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\:' -- need backslash!!
LOCATION '/exc103320/movies_copy'; -- name of the directory to copy the original file
Then, I load the data to the table by
LOAD DATA LOCAL INPATH 'movies.dat' OVERWRITE INTO TABLE movies;
When I run SELECT * FROM movies LIMIT 3;
I see the first 3 rows.
When I run SELECT movie_id FROM movies LIMIT 3; I get the following error
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1420729875693_6595, Tracking URL = http://cshadoop1.utdallas.edu:8088/proxy/application_1420729875693_6595/
Kill Command = /usr/local/hadoop-2.4.1/bin/hadoop job -kill job_1420729875693_6595
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2015-03-29 17:14:54,820 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1420729875693_6595 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://cshadoop1.utdallas.edu:8088/cluster/app/application_1420729875693_6595
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Any idea why this happens?
I believe you dont need the backlash in the "ROW FORMAT
DELIMITED FIELDS TERMINATED BY" statement.
Try the DDL statement like this and see if it works.
CREATE EXTERNAL TABLE movies(movie_id INT, movie_title String, genre String)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ':'
LOCATION '/exc103320/movies_copy';

SELECT and SET with ROWCOUNT

I have a SP with which I fetch a defined amount of rows. How can I change the value of a column in the fetched rows? Such as 'has been fetched' = 1.
Any advice?
EDIT:
The SP looks something like this
SET ROWCOUNT #numberOfRows
SELECT * FROM tableA where notSent = 1
I would like to change the 'notSent' colum for all the rows that i fetch. Is this possible?
1) Don't use Select * in a stored procedure - always specifically list the fields required as shown below - replace field1,2,3 etc with the actual fields you want returned.
OK - Modified answer:
2) Set a flag on the columns you want to select and update with a value that will not be otherwise programmatically set - e.g. -1 - Then select these records, then update them to set the final value required. Doing it this way will avoid the possibility that you will update a different set of records to those selected, due to inserts occurring half-way through the stored procedure's execution. You could also avoid this by use of locks, but the way below is going to be far healthier.
UPDATE Table set notSent = -1 WHERE notSent = 0
SELECT ... from Table where notSent = -1
UPDATE Table set notSent = 1 where notSent = -1