Cassandra timeseries datamodel - nosql

Let assume 10 devices(dev01,dev02,dev03..etc).
It send data with some interval time,we collect those data,so our data schema is
dev01 :int
signalname :string
signaltime :date/time[with YY-MM-DD HHMMSS.mm]
Extradata :String
I want to push data into cassandra ,which way is best to store those data?
My Query is Like ,
1 Need to retrive device based current day data,or with some date range?
2 5 Device current day data?
I am not sure the following way to store data into cassadra is best model
Standard columnfamily Name:signalname
row key :dev01
columnname :timeseries(20120801124204)[YYMMDD HHMMSS]
columnvalue :Json data
columnname :timeseries(20120801124205)[YYMMDD HHMMSS][next second data]
columnvalue :Json data
row key :dev02
columnname :timeseries(20120801124204)[YYMMDD HHMMSS]
columnvalue :Json data
columnname :timeseries(20120801124205)[YYMMDD HHMMSS][next second data]
columnvalue :Json data
Or
Super columnfamily :signalname
row key :Clientid1
supercolumnname :dev01
columnname :timeseries(20120801124204)[YYMMDD HHMMSS]
columnvalue :Json data
supercolumnname :dev02
columnname :timeseries(20120801124204)[YYMMDD HHMMSS]
columnvalue :Json data
row key :Clientid2
supercolumnname :dev03
columnname :timeseries(20120801124204)[YYMMDD HHMMSS]
columnvalue :Json data
supercolumnname :dev04
columnname :timeseries(20120801124204)[YYMMDD HHMMSS]
columnvalue :Json data
kindly help me out regarding this issue,
Any other Way?
Thanks&Regards,
Kannadhasan

I see 3 issues with your approach here which I will address below:
super column families,
thrift vs cql3,
json data as cell values.
Before you go ahead: the use super column families is discouraged. Read more here. Composite keys (as described below) are the way to go.
Also, you might need to read up on CQL3, since thrift is a legacy API since 1.2.
Instead of storing json data, you may make use of native collection data types like lists, and maps etc. If you still want to work with JSON, there is improved JSON support in in Cassandra since version 2.2.
In general, it is pretty straightforward to query per device and per timeperiod:
you row key would be the device id and the column key a timeuuid
To avoid hot spots, you could add "bucket" counters to the row key (create a composite row/partition key) to rotate the nodes
You can then query for time ranges if you know the row/device id.
Alternatively you could use your signal type as a row key (and timeuuid/timestamp as a column key) if you want to query data for multiple devices (but one event type) at once. Read more on timeseries data in cassandra in this blog entry.
Hope that helps!

Related

Creating a postgres column which allows all datatypes

I want to create a logging table which tracks changes in a certain table, like so:
CREATE TABLE logging.zaak_history (
event_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
tstamp timestamp DEFAULT NOW(),
schemaname text,
tabname text,
columnname text,
operation text,
who text DEFAULT current_user,
new_val <any_type>,
old_val <any_type>
);
However, the column that I want to track can take different datatypes, such as text, boolean and numeric. Is there a datatype that support the functionality?
Currently I am thinking about storing is as jsonb, as this will deal with the datatype in the json formatting, but I was wondering if there is a better way.
There is no postgres data type that isn't strongly typed, because the "any" data type that is available as a pseudo type cannot be used as a column (it can be used in functions, etc.)
You could store the binary representation of your data, because every type does have a binary representation.
Your approach of using JSON seems more flexible, as you can also store meta data (such as type information).
However, I recommend looking at how other people have solved the same issue for alternative ideas. For example, most wikis store a copy of the entire record for history, which is easy to reconstruct, can be referenced independently, and has no typing issues.

SQLAlchemy: Efficiently subsitute integer code for string name when inserting data

What is the most efficient way to substitute an integer key from a lookup table for a string in my input data?
For example let's say I have a table that has country names in string format, with the primary key of the lookup table as a foreign key column on a second table, "cities". I have list of tuples containing data for the "cities" table, and one of those fields is the string name of the country.
So each time I input a row for a new city, I must select the the PK from the lookup table where the "country_name" string column is equal to the input string. Then the integer PK for that country row needs to be put into the FK "country_id" column in the row being added to cities.
Is there a canonial way to do this in SQLAlchemy? The most obvious way would be to write a function that gets the appropriate PK with something like select(Country.country_id).where(.country_name == 'Ruritania').
But I wonder if SQLAlchemy has a more efficient way to do it, especially for the bulk insertion of records.
"Association Proxies" sound like what I want, but I don't understand them well enough to know how to use them in the context of bulk inserts. From what I have gathered so far, an ENUM data type would be too constraining as it cannot be updated easily, but I would consider such a solution if there is a way around that caveat.
Are there ways to make sure that values are not repeatedly read from the lookup table in a batch of operations?
I am using Postgres for my database.

How to store data in PostgreSQL already sorted

Is there some way to hint PostgreSQL Database to store data in a sorted manner?
For example, I have this SQL schema:
create table DataHolder(
name text,
path ltree primary key
);
create table Datas(
time int primary key,
foreign_dataholder_id ltree references Exchange(path)
);
And I can populate it like so:
insert into DataHolder(name, path) values ('A', 'Sec1.SubSec1');
insert into DataHolder(name, path) values ('B', 'Sec1.SubSec2');
insert into DataHolder(name, path) values ('C', 'Sec2.SubSec1');
...
insert into Datas (time, foreign_exchange_id) values (1513889449, 'Sec1.SubSec1');
insert into Datas (time, foreign_exchange_id) values (1513889451, 'Sec1.SubSec1');
...
So I have some DataHolder table that stores some tree like data containers like Sec1.SubSec1, for each of these containers, I have a lot of Data.
But the crucial part is that the data always have it's primary key as a unix timestamp, and it is always queried as a time range region of interest and sorted in ascendant order.
For example:
select * from Datas where foreign_dataholder_id = 'Sec1.SubSec1' and
time >= 1513889450 and time <= 1513889451 order by time asc;
Since I always query in that manner, it would be the best, performance wise, to tell PostgreSQL to already store these data internally sorted by Datas.time in ascendant order so it doesn't need to do that in every query I do.
Is that possible? If not, is there some other way to design these tables?
Edit:
I though about some solutions, an array for example could work since I can always guarantee it's sorted and the majority of my additions to it would be at it's end where no sort is needed.
But, since there is occasionally insertions in the middle of the table, it could be somewhat difficult to do it or not efficient.. I don't know.
Another idea would be a linked list, but I'm not aware if PostgreSQL supports this data type or how to use.

How the select the last record from a time series in Cassandra?

I want to store some encoded 'data' into cassadra, versioned by timestamp. My tentative schema is:
CREATE TABLE items (
item_id varchar,
timestamp timestamp,
data blob,
PRIMARY KEY (item_id, timestamp)
);
I would like to be able to return the list of items, returning only the latest ( highest timestamp) for each item_id; Is it possible with this schema?
It is not possible to express such a query in a single CQL statement for this table, so the answer is no.
You can try creating another table, e.g. latest_items, and only storing the last update there, so the schema would be:
CREATE TABLE latest_items (
item_id varchar,
timestamp timestamp,
data blob,
PRIMARY KEY (item_id)
);
If your rows are inserted in timestamp order, the table would naturally contain only the latest row for each item. Then you can just run select * from latest_items limit 10000000;. This will of course be expensive, because you're fetching all rows, but given your requirements where you actually want all of them, there is no way to avoid it.
This second table involves duplicating your data, but this is a common theme with Cassandra. You can avoid duplicating the blob by storing it indirectly, i.e. as a path or URL or somesuch.

XQUERY: selection of data using non XML columns

I have DB2 table having following structure
CREATE TABLE DUMMY
(
ID CHARACTER(10) NOT NULL,
RECORD XML NOT NULL
)
I want to use XQUERY to select data in RECORD column on the basis of ID. and do some XQUERY operations on the data present in RECORD column.
eg: I want to select RECORD having ID 1.
You can use the db2-fn:sqlquery function to do this. See the documentation.