How to ensure spark not read same data twice from cassandra - scala

I'm learning spark and cassandra. My problem is as follows.
I have cassandra table which records row data from a sensor
CREATE TABLE statistics.sensor_row (
name text,
date timestamp,
value int,
PRIMARY KEY (name, date)
)
Now I want to aggregate these rows through a spark batch job (ie. Daily)
So I could write
val rdd = sc.cassandraTable("statistics","sensor_row")
//and do map and reduce to get what i want and perhaps write back to aggregated table.
But my problem is I will be running this code periodically. I need to make sure I dont read same data twice.
One thing I can do is delete rows which I read, which looks pretty ugly, or use filter
sensorRowRDD.where("date >'2016-02-05 07:32:23+0000'")
Second one looks much nicer, but then I need to record when was the job run last and continue from there. However according to DataStax driver data locality, each worker will load data only in its local cassandra node. Which mean instead of tracking a global date, i need to track date of each cassandra/spark node. Still does not look very elegant.
Is there any better ways of doing this ?

The DataFrame filters will be pushed down to Cassandra, so this is an efficient solution to the problem. But you are right to worry about the consistency issue.
One solution is to set not just a start date, but an end date also. When your job starts, it looks at the clock. It is 2016-02-05 12:00. Perhaps you have a few minutes delay in collecting late-arriving data, and the clocks are not absolutely precise either. You decide to use 10 minutes of delay and set your end time to 2016-02-05 11:50. You record this in a file/database. The end time of the previous run was 2016-02-04 11:48. So your filter is date > '2016-02-04 11:48' and date < '2016-02-05 11:50'.
Because the date ranges cover all time, you will only miss events that have been saved into a past range after the range has been processed. You can increase the delay from 10 minutes if this happens too often.

Related

TimescaleDB Vs. InfluxDB to plc logging

I am 100% new to logging in an database at all and that's proberbly why there will maybe be some stupid questions here and i hope that it's ok.
I would like to logg data from an Beckhoff plc controller into an DB that is placed on the same IPC as my PLC.
The Beckhoff plc has a direct link function to both InfluxDB and to PostgreSQL that TimescaleDB is based on, so the connection will work fine.
We would like to log data to time so we can go back and see what time certain things did happen and also make questions to the database based on time.
I have been talking to different people and most of them recommend to use TimescaleDB so it would be great to hear the benefits between them and what you guys would recommend me to choose.
The data size we will log is pretty small.
We will have an structure of data and that will contain like 10 INT registers so 20 bytes.
We will make an log to the database evry 1 second on the quick machines and sometimes only one time each 20minutes, but this part i will control in my plc.
So putting data in the DB i belive will be pretty straight forward but then i have some toughts about what i would like to do and what is possible.
Is it possible to ask questions to the DB to give me the amount, higest value, lowest value, mean value the last 60 miuntes, or 24hours etc and then can the database retur these values based on the time frame i give the database in my question to it?
The resolution i log with that is controlled from the plc is only needed to be in that high for 7 days and after that i would like to "downstream / compress" the data. Is that possible in both these databases and is there any benefit in one of them? Maybe easier in one of them?
Is there in one of these two databases a possibility to not write to the HD / disk everytime my plc is putting data to it? Or it will write to the disk everytime automaticly? I did read about something called WAL, what is that or that will not use the RAM to store the data before it writes more data and not so often to the disk?
is there any big difference in setting up these two databases?
I proberly have more questions but these are the main functions that i need in the system.
Many thanks
Is it possible to ask questions to the DB to give me the amount, higest value, lowest value, mean value the last 60 miuntes, or 24hours etc and then can the database retur these values based on the time frame i give the database in my question to it?
Yes! You can use queries to make it. Consider the following table structure:
CREATE TABLE conditions (
time TIMESTAMPTZ NOT NULL,
device INTEGER NOT NULL,
temperature FLOAT NOT NULL
);
SELECT * FROM create_hypertable('conditions', 'time');
ALTER TABLE conditions SET (timescaledb.compress, timescaledb.compress_orderby='time');
The resolution i log with that is controlled from the plc is only needed to be in that high for 7 days and after that i would like to "downstream / compress" the data. Is that possible in both these databases and is there any benefit in one of them? Maybe easier in one of them?
You can create a continuous aggregates that is a fast method to keep your resumed data materialized.
CREATE MATERIALIZED VIEW conditions_hourly(time, device, low, high, average )
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', time) as time,
device,
min(temperature) as low,
max(temperature) as high,
AVG(temperature) as average
FROM conditions
GROUP BY 1,2;
And then you can add a retention policy for keeping only the last 7 days.
SELECT add_retention_policy('conditions', INTERVAL '7 day');
And add a continuous aggregates policy that will keep your view up to date every hour:
SELECT add_continuous_aggregate_policy('conditions_hourly',
start_offset => INTERVAL '1 day',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour');
Is there in one of these two databases a possibility to not write to the HD / disk everytime my plc is putting data to it? Or it will write to the disk everytime automaticly? I did read about something called WAL, what is that or that will not use the RAM to store the data before it writes more data and not so often to the disk?
In Postgresql you can use async commits: https://www.postgresql.org/docs/current/wal-async-commit.html

PostgreSQL delete and aggregate data periodically

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

Re-index data more than one in Apache Druid

I want to get last one hour and day aggregation result from druid. Most queries I use includes ad-hoc queries. I want to ask two question;
1- Is a good idea that ingest all raw data without rollup? Without rollup, Can I re-index data with multiple times?. For example; one task reindex data to find unique user counts for each hour, and another task re-index the same data to find total count for each 10 minutes.
2- If rollup enabled to find some basic summarizes, this prevent to get information from the raw data(because it is summerized). When I want to reindex data, some useful informations may not found. Is good practise that enable rollup in streaming mode?
Whether to enable roll-up depends on your data size. Normally we
keep data outside of druid to replay and reindex again in the
different data sources. If you have a reasonable size of the data
you can keep your segment granularity to be hours/day/ week/month
ensuring that each segment doesn't exceed the ideal segment size (
500 MB recommended ). And query granularity to the none at index
time, so you can do this unique and total count aggregation at query
time.
You can actually set your query granularity at the index time to be
10 mins and it can still provide you uniques in 1 hr and total count
received in 1 hr.
Also, you can index data in multiple data sources if that's what you
are asking. If you are reindexing data for the same data source, it
will create duplicates and skew your results.
It depends on your use case. Rollup will help you better performance
and space optimization in druid cluster. Ideally, I would suggest
keeping your archived data separate in replayable format to reuse.

Why not to use timestamp with Interleaved Sortkey?

I'm trying to figure out the different types of sortkeys in Amazon Redshift and I encountered a strange warning here, which is not explained:
Important: Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
And yet, in their own example, Amazon uses interleaved key on a date column with good performance.
So, my question is - what's the explanation to this warning and should I take it seriously? More precisely - is there a problem with using interleaved key over a timestamp column?
I think it might have been explained later on when they describe issues around vacuuming/reindexing:
When tables are initially loaded, Amazon Redshift analyzes the
distribution of the values in the sort key columns and uses that
information for optimal interleaving of the sort key columns. As a
table grows, the distribution of the values in the sort key columns
can change, or skew, especially with date or timestamp columns. If the
skew becomes too large, performance might be affected.
So if that is the only reason, then it just means you will have increased maintenance on index.
From https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html
As you add rows to a sorted table that already contains data, the
unsorted region grows, which has a significant effect on performance.
The effect is greater when the table uses interleaved sorting,
especially when the sort columns include data that increases
monotonically, such as date or timestamp columns.
The key point in the original quote is not that that data is a date or timestamp, it's that it increases "monotonically", which in this context presumably means increasing sequentially such as an event timestamp or an Id number.
The date(not timestamp) column as a interleaved sort key makes sense when you know in an average X number of rows are processed everyday and you are going to filter based on it, if you are not going to use it then leave it out.
Also a note on vacuum - when the VACUUM process is in progress, it needs temporary space to be able to complete the task by sorting and then merging the data in chunks. Cancelling the VACUUM process mid flight will cause extra spaces to be not reclaimed so if for some reason any Vacuum has ever been cancelled in your cluster this can be accounted to the space increase. See the link https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html#r_VACUUM_usage_notes and point 3 the last point is of particular interest.
In my case the tables ended up growing very rapidly compared to the amount of rows inserted and had to build an auto table creation using deep copy
Timestamp column may go to hours, minutes, seconds and milliseconds which is costly to sort the data. A data with milliseconds granularity is like having too much degree of zone maps to keep a record where the data starts and ends within dataset. Same is not true with date column in sort key. Date column has less degress of zone maps to keep a track of data residing in the table.

Random Lookup Methodology

I have a postgres database with a table that contains rows I want to look up at pseudorandom intervals. Some I want to look up once an hour, some once a day, and some once a week. I would like the lookups to be at pseudorandom intervals inside their time window. So, the look up I want to do once a day should happen at a different time each time that it runs.
I suspect there is an easier way to do this, but here's the rough plan I have:
Have a settings column for each lookup item. When the script starts, it randomizes the epoch time for each lookup and sets it in the settings column, identifying the time for the next lookup. I then run a continuous loop with a wait 1 to see if the epoch time matches any of the requested lookups. Upon running the lookup, it recalculate when the next lookup should be.
My questions:
Even in the design phase, this looks like it's going to be a duct tape and twine routine. What's the right way to do this?
If by chance, my idea is the right way to do this, is my idea of repeating the loop with a wait 1 the right way to go? If I had 2 lookups back to back, there's a chance I could miss one but I can live with that.
Thanks for your help!
Add a column to the table for NextCheckTime. You could use either a timestamp or just an integer with the raw epoch time. Add a (non-unique) index on NextCheckTime.
When you add a row to the database, populate NextCheckTime by taking the current time, adding the base interval, and adding/subtracting a random factor (maybe 25% of the base interval, or whatever is appropriate for your situation). For example:
my $interval = 3600; # 1 hour in seconds
my $next_check = time + int($interval * (0.75 + rand 0.5));
Then in your loop, just SELECT * FROM table ORDER BY NextCheckTime LIMIT 1. Then sleep until the NextCheckTime returned by that (assuming it's not already in the past), perform the lookup, and update NextCheckTime as described above.
If you need to handle rows newly added by some other process, you might put a limit on the sleep. If the NextCheckTime is more than 10 minutes in the future, then sleep 10 minutes and repeat the SELECT to see if any new rows have been added. (Again, the exact limit depends on your situation.)
How big is your data set? If it's a few thousand rows than just randomizing the whole list and grabbing the first x rows is ok. As the size of your set grows, this becomes less and less scalable. The performance drops off at a non-linear rate. But if you only need to run this once an hour at most, then it's no big deal if it takes a minute or two as long as it doesn't kill other processes on the same box.
If you have a gapless sequence, whether there from the beginning or added on, then you can use indexes with something like:
$i=random(0,sizeofset-1);
select * From table where seqid=$i;
and get good scalability to millions and millions of rows.