We are changing partition key of one of our table by removing one column from partition key. Every record in this table also has TTL. Now we want to preserve the data in that table with TTL.
How can we do it?
We can create new table with desired schema and then copy data from old table to new table. However, we loose TTL in this process.
For further information - This Cassandra table is populated by an Apache Storm application which reads events from Kafka. We can re-hydrate Kafka messages but Kafka has some unwanted messages which we don't want to process.
NOTE - TTL is decided based on date column value, which never changes. Because of this TTL would always be same on all the columns.
Before going to specific implementation, it's important to understand that TTL may exist on the individual cell as well as all cells in the row. And when you're performing INSERT or UPDATE operation, you can apply only one TTL value for all columns that are specified in the query, so if you have 2 columns with different TTLs, then you'll need to perform 2 queries - for each column, with different TTLs.
Regarding the tooling - there are 2 more or less ready-to-use options here:
Use DSBulk. This approach is described in details in the example 30.1 of this blog post. Basically, you need to unload data to disk using the query that will extract column values & TTLs for them, and then load data by generating batches for every column that have separate TTL. From example:
dsbulk unload -h localhost -query \
"SELECT id, petal_length, WRITETIME(petal_length) AS w_petal_length, TTL(petal_length) AS l_petal_length, .... FROM dsbulkblog.iris_with_id" \
-url /tmp/dsbulkblog/migrate
dsbulk load -h localhost -query \
"BEGIN BATCH INSERT INTO dsbulkblog.iris_with_id(id, petal_length) VALUES (:id, :petal_length) USING TIMESTAMP :w_petal_length AND TTL :l_petal_length; ... APPLY BATCH;" \
-url /tmp/dsbulkblog/migrate --batch.mode DISABLED
Use Spark Cassandra Connector - it supports reading & writing the data with TTL & WriteTime. But you'll need to develop the code that is doing it, and correctly handle things such as collections, static columns etc. (or wait for SPARKC-596 implemented)
Related
I have a requirement to produce data from multiple MongoDB tables and push to the same Kafka Topic using the mongo-kafka connector. Also I have to ensure that the data for the same table key columns always go to the same partition every time to ensure message ordering.
For example :
tables --> customer , address
table key columns -->CustomerID(for table customer) ,AddressID(for table address)
For CustomerID =12345 , it will always go to partition 1
For AddressID = 54321 , it will always go to partition 2
For a single table , the second requirement is easy to achieve using chained transformations. However for multiple tables->1 topic , finding it difficult to achieve since each of these tables has different key column names.
Is there any way available to fulfil both requirements using the Kafka connector?
If you use ExtractField$Key transform and IntegerConverter, all matching IDs should go to the same partition.
If you have two columns and one table, or end up with keys like {"CustomerID": 12345} then you have a composite/object key, meaning the whole key will be hashed when used to compute partitioning, not the ID itself.
You cannot set partition for specific fields within any record without setting producer.override.partitioner.class in Connector config. In other words, you need to implement a partitioner that will deserialize your data, parse the values, then compute and return the respective partition.
Is there any way to alter the Cassandra column from timestamp to date without data lost? For example '2021-02-25 20:30:00+0000' to '2021-02-25'
If not, what is the easiest way to migrate this column(timestamp) to the new column(date)?
It's impossible to change a type of the existing column, so you need to add a new column with correct data type, and perform migration. Migration could be done via Spark + Spark Cassandra Connector - it could be most flexible solution, and even could be done via single node machine with Spark running in the local master mode (default). Code could look something like this (try on test data first):
import pyspark.sql.functions as F
options = { "table": "tbl", "keyspace": "ks"}
spark.read.format("org.apache.spark.sql.cassandra").options(**options).load()\
.select("pk_col1", "pk_col2", F.col("timestamp_col").cast("date").alias("new_name"))\
.write.format("org.apache.spark.sql.cassandra").options(**options).save()
P.S. you can use DSBulk, for example, but you need to have enough space to offload the data (although you need only primary key column + your timestamp)
To add to Alex Ott's answer, there are validations done in Cassandra that prevents changing the data type of a column. The reason is that SSTables (Cassandra data files) are immutable -- once they are written to disk, they are never modified/edited/updated. They can only be compacted to new SSTables.
Some try to get around it by dropping the column from the table then adding it back in with a new data type. Unlike traditional RDBMS, the existing data in the SSTables don't get updated so if you tried to read the old data, you'll get a CorruptSSTableException because the CQL type of the data on disk won't match that of the schema.
For this reason, it is no longer possible to drop/recreate columns with the same name (CASSANDRA-14948). If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/8018/. Cheers!
You can use ToDate to change it. For example: Table Email has column Date with format: 2001-08-29 13:03:35.000000+0000.
Select Date, ToDate(Date) as Convert from keyspace.Email:
date | convert ---------------------------------+------------ 2001-08-29 13:03:35.000000+0000 | 2001-08-29
I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location
When attempting a MERGE statement, BigQuery is only scanning the requested partitions UNTIL the DELETE statment is added, at which point it reverts to scanning the whole dataset (blossoming from 1GB to >1TB in this case).
Is there a way to use the full features of MERGE, including DELETE, without incurring the extra cost?
Generic sample that matches my effort below:
MERGE target_table AS t *## All Dates, partitioned on
activity_date*
USING source_table AS s ## one date, only yesterday
ON t.field_a = s.field_a
AND t.activity_date >=
DATE_ADD(DATE(current_timestamp(),'America/Los_Angeles'), INTERVAL -1 DAY) ## use partition to limit to yesterday
WHEN MATCHED
THEN UPDATE SET
field_b = s.field_b
WHEN NOT MATCHED
THEN INSERT
(field_a, field_b)
VALUES
(field_a, field_b)
WHEN NOT MATCHED BY SOURCE
THEN DELETE
Based on the query you have provided, it is not expected behavior for it to apply the merge on the whole dataset. After the query has run, you should analyze your dataset and check its validity to ensure that the query only ran on the specific partitions.
If, after further inspection, no unexpected changes were made to your dataset, the 1 TB of data noted may be simply explained as BigQuery ingesting that data into memory as a side step to be able to run the query.
However, to confirm it is recommended to submit a ticket in the issue tracker with your BigQuery JobID so that BigQuery engineering can properly inspect the issue.
I'm trying to populate a prices and quotes database using AquaQ's TorQ. For this purpose I use the .loader.loadallfiles function. The difference being that prices is daily data and quotes is more intraday e.g. FX rates.
I do the loading as follows:
/- check the location of database directory
hdbdir:hsym `$getenv[`KDBHDB]
/hdbdir:#[value;`hdbdir;`:hdb]
rawdatadir:hsym `$getenv[`KDBRAWDATA]
target:hdbdir;
rawdatadir:hsym `$("" sv (getenv[`KDBRAWDATA]; "prices"));
.loader.loadallfiles[`headers`types`separator`tablename`dbdir`partitioncol`partitiontype`dataprocessfunc!(`date`sym`open`close`low`high`volume;"DSFFFFF";enlist ",";`prices;target;`date;`year;{[p;t] `date`sym`open`close`low`high`volume xcols update volume:"i"$volume from t}); rawdatadir];
rawdatadir:hsym `$("" sv (getenv[`KDBRAWDATA]; "quotes"));
.loader.loadallfiles[`headers`types`separator`tablename`dbdir`partitioncol`partitiontype`dataprocessfunc!(`date`sym`bid`ask;"ZSFF";enlist ",";`quotes;target;`date;`year;{[p;t] `date`sym`bid`ask`mid xcols update mid:(bid+ask)%2.0 from t}); rawdatadir];
and this works fine. However when loading the database I get errors attemping to select from either table. The reason is that for some partitions there aren't any prices or or there aren't any quotes data. e.g. attempting to:
quotes::`date`sym xkey select from quotes;
fails with an error saying the the partition for year e.g. hdb/2000/ doesn't exist which is true, there are only prices for year 2000 and no quotes
As I see there are two possible solutions but neither I know how to implement:
Tell .loader.loadallfiles to create empty schema for quotes and prices in partitions for which there isn't any data.
While loading the database, gracefully handle the case where there is no data for a given partition i.e. select from ... where ignore empty partitions
Try using .Q.chk[`:hdb]
Where `:hdb is the filepath of your HDB
This fills in missing tables, which will then allow you to preform queries.
Alternatively you can use .Q.bv, where the wiki states:
If your table exists in the latest partition (so there is a prototype
for the schema), then you could use .Q.bv[] to create empty tables
on the fly at run-time without having to create those empties on disk.