Handle Range based queries - key-value

We came across a case, where we want to retrieve data from a time series. Let say we have time based data : [“t1-t2” : {data1}, “t2-t3” : {data2}, “t3-t4”:{dat3}]
With the above kind of data, we would want to look up exact data w.r.t time. For example, for a given time t1.5, the data has to come as data1, and for t2.6 it should come as data2.
To solve the above problem, we are planning to store the data in a sorted map in aerospike as mentioned below {“t1”:{data1}, “t2”:{dat2}, “t3”: {data3}}
When a client asks for t1.5, we must return data1. To achieve this, we implemented a UDF at the server level to do a binary search for the nearest and lowest value for the given input (i.e t1.5), which will return t1 's value ,i.e data1.
Is there a better way of achieving the same, as it incurs cost at server level for every request. Even UDF to do a binary search requires loading all the data in memory, can we avoid it?
We are planning to use a Aerospike for this. Is there a better data store to handle such queries..?

Thinking aloud… Storing t1-t2, t2-t3 is redundant on t2. Just store t1, t2 is inferred from next key:value. { t1:data, t2:data, …} - store key sorted (map policy) You must know max difference between any ‘t1’ and ‘t2’ Build secondary index on MAPKEY and type numeric (this essentially does the bulk of the sort work for you upfront in the RAM) Search for records where t between t-maxdiff and t+maxdiff ==> a set of few records and pass these to your UDF. Invoke UDF on these few records subset to return the data. This will be a very simple UDF. Note: UDFs are limited to 128 concurrent executions at any given time.

I'm not sure I understand the problem. First, you should be inserting into a K-ordered map, where the key is the timestamp (in millisecond or second or another resolution). The value would be a map of the attributes.
To get back any range of time you'd use a get_by_key_interval (for example the Python client's Client.map_get_by_key_range). You can figure out how to build the range but it's simply all between two timestamps.
Don't use a UDF for this, it is not going to perform as well or scale as the native map/list operations would.

Related

InfluxDB: Query From Multiple Series

I have an influxdb instance, in which an upstream server is logging measurements into. I have multiple series of the shape: web.[domain].[status], for example: web.www.foobar.com.2xx, web.www.quux.com.3xx etc. There are two "variables" encoded into the series name: the domain and the status (2xx, 3xx, etc. are already aggregated).
Now I'd like to see how many of those requests I get. One possibility would be to just list the series:
select sum("value") from "web.www.quux.com.2xx","web.www.quux.com.3xx",...
But this is neither practical (too many) nor actually feasible (new domains are added and removed all the time), so I need a more flexible approach.
Is there some kind of wildcard syntax allowed in the from clause? The documentation doesn't mention any. Or is there another way to approach this?
You should to avoid this kind of measurement naming convention:
https://docs.influxdata.com/influxdb/v1.8/concepts/schema_and_data_layout/#avoid-encoding-data-in-measurement-names
hAvoid encoding data in measurement names
InfluxDB queries merge data that falls within the same measurement; it’s better to differentiate data with tags than with detailed measurement names. If you encode data in a measurement name, you must use a regular expression to query the data, making some queries more complicated or impossible.

Can we change data type of dimension post ingestion in Druid

We are doing POC on Druid to check whether it fits our use cases. Though we are able to ingest data but not sure on following:
How druid supports schemaless input: Let's say input dimension are on end user discretion. Then there is no defined schema here. Thus onus lies on application to identify new dimension, identify data type and ingest. Any way to achieve this?
How druid support data type changes: Lets say in course (say after ingesting 100GBs of data), there is need to change data type of dimension from string to long or long to string (or other). What are receommended way to do it without hampering ongoing ingestion?
I looked over docs but could not get a substantial overview for both use cases.
For question 1 I'd ingest everything as string and figure it out later. It should be possible to query string columns in druid as numbers
Getting the possible behaviours explained in: https://github.com/apache/incubator-druid/issues/4888
Consider values are zeros, do not try to parse string values. Seems this is the current behaviour.
Try to parse string values, and consider values are zero if they are not parseable, or null, or multiple-value
One current inconsistency is that with expression-based column selectors (anything that goes through Parser/Expr) the behavior is (2). See IdentifierExpr + how it handles strings that are treated as numbers. But with direct column selectors the behavior is (1). In particular this means that e.g. a longSum aggregator behaves differently if it's "fieldName" : "x" vs. "expression" : "x" even though you might think they should behave the same.
You can follow the entire discussion here: https://github.com/apache/incubator-druid/issues/4888
For question 2 it think it is necessary a reindex of the data
- http://druid.io/docs/latest/ingestion/update-existing-data.html
- http://druid.io/docs/latest/ingestion/schema-changes.html
I hope this helps
1) In such cases, you don't need to specify any dimension columns in druid ingestion spec and druid will treat all columns which are not timestamp as a dimension.
More detail about such approach can be found here:
Druid Schema less Ingestion
2) For 2nd question, you can make changes to schema and druid will create new segments with new data type while your old segments will still use old data type.
In cases if you want to keep all your segments with new data type then you can reindex all the segments. Please checkout this link for further description about reindexing all segments. http://druid.io/docs/latest/ingestion/update-existing-data.html
Additional info on schema changes can be found here:
http://druid.io/docs/latest/ingestion/schema-changes.html

Spark: groupBy taking lot of time

In my application when taking perfromance numbers, groupby is eating away lot of time.
My RDD is of below strcuture:
JavaPairRDD<CustomTuple, Map<String, Double>>
CustomTuple:
This object contains information about the current row in RDD like which week, month, city, etc.
public class CustomTuple implements Serializable{
private Map hierarchyMap = null;
private Map granularMap = null;
private String timePeriod = null;
private String sourceKey = null;
}
Map
This map contains the statistical data about that row like how much investment, how many GRPs, etc.
<"Inv", 20>
<"GRP", 30>
I was executing below DAG on this RDD
apply filter on this RDD and scope out relevant rows : Filter
apply filter on this RDD and scope out relevant rows : Filter
Join the RDDs: Join
apply map phase to compute investment: Map
apply GroupBy phase to group the data according to the desired view: GroupBy
apply a map phase to aggregate the data as per the grouping achieved in above step (say view data across timeperiod) and also create new objects based on the resultset desired to be collected: Map
collect the result: Collect
So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):
<timeperiod1, value>
When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.
IMO, we can replace GroupBy and subsequent Map operations by a sing reduce.
But reduce will work on object of type JavaPairRDD>.
So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.
Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.
Also, I am not sure how aggregate function works and will it be able to help me in this case.
Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.
<"market1", 20>
<"market2", 30>
This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.
Any insight is appreciated.
[EDIT]We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.[EDIT]
TIA!
The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)
If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.
combineByKey have three arguments:
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)
mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value
mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that
The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.
I hope this will be usefull
Shuffling is triggered by any change in the key of a [K,V] pair, or by a repartition() call. The partitioning is calculated based on the K (key) value. By default partitioning is calculated using the Hash value of your key, implemented by the hashCode() method. In your case your Key contains two Map instance variables. The default implementation of the hashCode() method will have to calculate the hashCode() of those maps as well, causing an iteration to happen over all it elements to in turn again calculate the hashCode() of those elements.
The solutions are:
Do not include the Map instances in your Key. This seems highly unusual.
Implement and override your own hashCode() that avoids going through the Map Instance variables.
Possibly you can avoid using the Map objects completely. If it is something that is shared amongst multiple elements, you might need to consider using broadcast variables in spark. The overhead of serializing your Maps during shuffling might also be a big contributing factor.
Avoid any shuffling, by tuning your hashing between two consecutive group-by's.
Keep shuffling Node local, by choosing a Partitioner that will have an affinity of keeping partitions local during consecutive use.
Good reading on hashCode(), including a reference to quotes by Josh Bloch can be found in wiki.

How to stream data in KDB?

I have access to a realtime KDB server that has tables with new data arriving every millisecond.
Currently, I'm just using a naive method which is basically like:
.z.ts:{
newData: getNewData[]; / get data arriving in the last second
data::data uj newData;
};
\t 100;
to ensure that my data (named data) is constantly updated.
However, the uj is very slow (probably due to constant reallocation of memory) and polling is just plain awkward.
I've heard KDB is meant to be good at handling this kind of streaming tick data, so is there a better way? Perhaps some push-based method without need for uj?
Rather than polling. Use kdb+tick the publish subscriber architecture for kdb+.
Official Manual: https://github.com/KxSystems/kdb/blob/master/d/tick.htm
kdb+ Tick overview: http://www.timestored.com/kdb-guides/kdb-tick-data-store
Source code: https://github.com/KxSystems/kdb-tick
If there is a realtime, presumably there's a tickerplant feeding it. You can subscribe to the tickerplant:
.u.sub[`;`];
That means sub all tables, all symbols. The result of the call is an array where 0th element is the table name and 1th element is the current data the tickerplant holds for that table (usually empty or a small number of rows). The tickerplant will then cache the handle to your kdb instance and keep sending it data. BUT it assumes there is a upd function on your kdb instance that can handle the request.
upd:{[t;x] t insert x}
OR
upd:insert
(same thing)
The upd function is called with a table symbol name (t) and the data to insert into it (x).
So a good straightforward implementation overall would be:
upd:insert;
#[`.;:;t:.u.sub[`;`][0];t[1]]; //set result of sub to t, set t to t[1] (initial result)

Whats more efficent Core Data Fetch or manipulate/create arrays?

I have a core data application and I would like to get results from the db, based on certain parameters. For example if I want to grab only the events that occured in the last week, and the events that occured in the last month. Is it better to do a fetch for the whole entity and then work with that result array, to create arrays out of that for each situation, or is it better to use predicates and make multiple fetches?
The answer depends on a lot of factors. I'd recommend perusing the documentation's description of the various store types. If you use the SQLite store type, for example, it's far more efficient to make proper use of date range predicates and fetch only those in the given range.
Conversely, say you use a non-standard attribute like searching for a substring in an encrypted string - you'll have to pull everything in, decrypt the strings, do your search, and note the matches.
On the far end of the spectrum, you have the binary store type, which means the whole thing will always be pulled into memory regardless of what kind of fetches you might do.
You'll need to describe your managed object model and the types of fetches you plan to do in order to get a more specific answer.