Re-index data more than one in Apache Druid - streaming

I want to get last one hour and day aggregation result from druid. Most queries I use includes ad-hoc queries. I want to ask two question;
1- Is a good idea that ingest all raw data without rollup? Without rollup, Can I re-index data with multiple times?. For example; one task reindex data to find unique user counts for each hour, and another task re-index the same data to find total count for each 10 minutes.
2- If rollup enabled to find some basic summarizes, this prevent to get information from the raw data(because it is summerized). When I want to reindex data, some useful informations may not found. Is good practise that enable rollup in streaming mode?

Whether to enable roll-up depends on your data size. Normally we
keep data outside of druid to replay and reindex again in the
different data sources. If you have a reasonable size of the data
you can keep your segment granularity to be hours/day/ week/month
ensuring that each segment doesn't exceed the ideal segment size (
500 MB recommended ). And query granularity to the none at index
time, so you can do this unique and total count aggregation at query
time.
You can actually set your query granularity at the index time to be
10 mins and it can still provide you uniques in 1 hr and total count
received in 1 hr.
Also, you can index data in multiple data sources if that's what you
are asking. If you are reindexing data for the same data source, it
will create duplicates and skew your results.
It depends on your use case. Rollup will help you better performance
and space optimization in druid cluster. Ideally, I would suggest
keeping your archived data separate in replayable format to reuse.

Related

PostgreSQL delete and aggregate data periodically

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

TimescaleDB: performance of a hypertable with append vs. midpoint inserts and indexing

I have some time-series data that I am about to import into TimescaleDB, as (time, item_id, value) tuples in a hypertable.
I have created an index:
CREATE INDEX ON time_series (item_id, timestamp DESC);
Does TimescaleDB have different performance characteristics when inserting rows in the middle of time series vs. appending them at the end of the time? I know this is an issue for some of native PostgreSQL data structures like BRIN indexes.
I am asking because for some item_ids I might have patchy data and I need to insert those values after other item_ids have filled the tip of time series. Basically, some items might be old data that is seriously behind the rest of the items.
I don't think It reacts differently,
in your case the insert performance will be depends on
how many indexes you have on that table?are they all really needed?
Are those indexes have minimum required columns?
Use parallel insert/copy. see here for more info.
Insert rows in batches
configure your shared_buffers properly (25% of available RAM recommended by documentations)
but this tip is going to help you the best
Write data in loose time order
When chunks are sized appropriately, the latest chunk(s) and their associated indexes are naturally maintained in memory. New rows inserted with recent timestamps will be written to these chunks and indexes already in memory.
If a row with a sufficiently older timestamp is inserted – i.e., it's an out-of-order or backfilled write – the disk pages corresponding to the older chunk (and its indexes) will need to be read in from disk. This will significantly increase write latency and lower insert throughput.
Particularly, when you are loading data for the first time, try to load data in sorted, increasing timestamp order.
Be careful if you're bulk loading data about many different servers, devices, and so forth:
Do not bulk insert data sequentially by server (i.e., all data for server A, then server B, then C, and so forth). This will cause disk thrashing as loading each server will walk through all chunks before starting anew.
Instead, arrange your bulk load so that data from all servers are inserted in loose timestamp order (e.g., day 1 across all servers in parallel, then day 2 across all servers in parallel, etc.)
source: TimeScale blog

Is there a Clickhouse feature that simplifies this aggregation schema (every record for a week; 5 minute totals for a year)

I have a reporting Clickhouse database that stores a very large amount of DNS traffic logs (big enough that it's only practical to store 2 days of raw query logs). A main table stores records of individual DNS queries, then materialised views aggregate that down into useful chunks for graphing/reporting (eg ip_protocol by query_types by region).
Storing even those aggregates is very large (we can keep about 1 week of data).
So for long term (yearly) reporting I also aggregate that down into totals for each 5min block using materialised views with SummingMergeTree & toStartOfFiveMinute().
ie
[original data] -> [mat-view SummingMergeTree(counts), TTL 7 DAYS]
[original data] -> [mat-view SummingMergeTree(5minTimestamp, counts), TTL 1 YEAR].
That works, but it means our graphs have to hit two different views (there are a lot of views and a lot of graphs). And it just feels a bit clunky.
Clickhouse is so good at reporting I wonder if there's some aggregation method or storage engine that would let me store this in a single place. Something that could merge over ' toSecond(..) if timestamp < 7 days old, then toStartOfFiveMinute(..) "
Is there some built in feature, or structure I'm unaware of?
Thank you.
You may find useful TTL Group by https://kb.altinity.com/altinity-kb-queries-and-syntax/ttl-group-by-examples
It allows to reduce data in a single table.
Other option is GrapiteMergeTree https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/graphitemergetree/

Sphinx: Real-Time Search w/Expiration?

I am designing a search that will be fed around 50 to 200 GB of text data per day (similar to logs) and it only needs to retain that data for week or two. This data will be piped at a constant rate (5,000/per second for example), non-stop, 24 hours a day. After a week or two, the document should drop out of the index never to be heard from again.
The index should be searchable with free-form text across only 1 field (pretty small in size, around 512 characters max). At most, the schema could have 2 attributes that could be categorized.
The system needs to be indexed in near real-time as data is fed to it. A delay of 15 to 30 seconds is acceptable.
We prefer to stream data into the indexer/service with a constant stream of pipe data.
Lastly, a single stand-alone solution is prefer over any type of distribution setup (this will be part of a package to deploy and setup on local machines for testers).
I'm looking closely at Sphinx search engine with RT updates via the API as it checks off most of these. But, I am not seeing an easy way to expire documents after a certain length of time.
I am aware that I could track the IDs and a timestamp and issue a batch DELETE through the Sphinx API. But, that creates an issue of tracking large amounts of IDs in a separate datastore that will need the same kind of 5,000/per second inserts and deleting them when done.
I also have a concern around Sphinx Fragmentation of mass-inserting, and mass-deleting in the middle of inserting.
We would really prefer the search engine/indexer to handle the expiration itself.
I think I can perform a WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO as the where clause in the Sphinx API in order to gather the Document IDs to delete. The problem with that is if the system does not stay ontop of the deletes, the total number of documents/search results will be in the 10s of millions, maybe even billions in count after a two week timeframe if it has to gather a few days worth of document ids to delete. That's not a feasible query.
You can actually run
DELETE FROM rt WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO
As a query to delete the old documents, which is much simpler :)
You will also need to call OPTIMIZE INDEX from time to time.
Both these will have to be called on some sort of 'cron' schedule, as they wont be run automatically.
You might be better not using Sphinxes DELETE function at all. When writing RT indexes, as soon as the RAM chunk is full its writen out as a disk chunk. So you end up with a number of disk chunks on the disk. The oldest documents will be in the oldest chunk, sequentially.
So to clear out the oldest documents, you could just dispose of the oldest chunks. (on a rolling basis)
The problem is sphinx does not include a function to delete individual chunks.
Will need to shutdown searchd, delete the chunk(s), manipulate the header files and then restart Sphinx. Not an easy process.
But in the more general sense, not sure if sphinx will be able to keep up with a continuous stream of 5,000/documents per second (even ignoreing delete for a moment) - Sphinx is generally designed for write-infrequently, read-frequently. It builds a (for the most part) monolithic inverted index. This is great for querying, but is very hard to keep updated. Its not great for incremental updates.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.