Differences between Change Capture stage and Difference stage in DataStage - datastage

I'm trying to implement SCD Type1 loading using Change Capture and Difference stages in DataStage. Both these jobs are working fine without any errors, but i would like to know what was differences between these two stages and which one will deliver a better performance?
I had tried several test cases to find the differences few that i had found were
In Change capture stage we need to have both the inputs with same number of columns and same column names with similar datatypes but that was not the case in Difference stage.
Could someone help me to figure out what the actual important analogy between these 2 stages. (Any related web links are welcome)
Thank you.

Performance is not the point - they should be compareable - more important is the functionality point of view.
The DIFFERENCE stage returns the "before" columns and after columns only in some situations (if they are not part of the key or value columns and are named differently)
The CHANGE CAPTURE stage returns the "after" columns. BEfore columns are only available for Deletes

Try the Slowly Changing Dimension stage. It does a lot more of the work for you.

Related

ObjectBox dynamic queries

e.g. the end user makes selections from two of five possible filters, the last three filters being left as ‘all’.
Rather than me creating queries for every possible combination of the 5 filters (25 different queries in total), what is the most efficient syntax for handling this?
Should I use .and to chain the queries together, and then can I specify ‘all’ for any which are not required?
A query builder can be used to build the query according to the selected filters. That is adding query criteria only using inside if conditions checking for the the filters.
I solved this as follows. Use a ternary operator ?: and in the second condition query one of the values with .notNull()
This gives the result of 'all', effectively ignoring this part of the query.
This is a hack, but it works. It is obviously an expensive solution, as the ideal would be to skip over unwanted filters completely.
Note to developer: 'if' cannot be used within query structure in dart. Thanks for finding time to respond, hopefully my additional info helps.

Can we change data type of dimension post ingestion in Druid

We are doing POC on Druid to check whether it fits our use cases. Though we are able to ingest data but not sure on following:
How druid supports schemaless input: Let's say input dimension are on end user discretion. Then there is no defined schema here. Thus onus lies on application to identify new dimension, identify data type and ingest. Any way to achieve this?
How druid support data type changes: Lets say in course (say after ingesting 100GBs of data), there is need to change data type of dimension from string to long or long to string (or other). What are receommended way to do it without hampering ongoing ingestion?
I looked over docs but could not get a substantial overview for both use cases.
For question 1 I'd ingest everything as string and figure it out later. It should be possible to query string columns in druid as numbers
Getting the possible behaviours explained in: https://github.com/apache/incubator-druid/issues/4888
Consider values are zeros, do not try to parse string values. Seems this is the current behaviour.
Try to parse string values, and consider values are zero if they are not parseable, or null, or multiple-value
One current inconsistency is that with expression-based column selectors (anything that goes through Parser/Expr) the behavior is (2). See IdentifierExpr + how it handles strings that are treated as numbers. But with direct column selectors the behavior is (1). In particular this means that e.g. a longSum aggregator behaves differently if it's "fieldName" : "x" vs. "expression" : "x" even though you might think they should behave the same.
You can follow the entire discussion here: https://github.com/apache/incubator-druid/issues/4888
For question 2 it think it is necessary a reindex of the data
- http://druid.io/docs/latest/ingestion/update-existing-data.html
- http://druid.io/docs/latest/ingestion/schema-changes.html
I hope this helps
1) In such cases, you don't need to specify any dimension columns in druid ingestion spec and druid will treat all columns which are not timestamp as a dimension.
More detail about such approach can be found here:
Druid Schema less Ingestion
2) For 2nd question, you can make changes to schema and druid will create new segments with new data type while your old segments will still use old data type.
In cases if you want to keep all your segments with new data type then you can reindex all the segments. Please checkout this link for further description about reindexing all segments. http://druid.io/docs/latest/ingestion/update-existing-data.html
Additional info on schema changes can be found here:
http://druid.io/docs/latest/ingestion/schema-changes.html

Make a Class Schedule Report

How can I make my Crystal Report look like the attached image? I have had no success creating it with a crosstab.
The short answer is that Crystal Reports isn't really equipped to handle the format you're dealing with. And here's why:
Let's assume for a moment you've already figured out how to interpret your query into something usable. Since we aren't using a Cross Table, the best you could hope for would be setting a Details section for each individual time slot and arranging a large number of formulas into a grid shape:
The problem is that every Formula would need to be unique; interpreting whether there is a Class at that Time and Date, and which Class it is. There would be up to 168 of those formulas and you'd have to manually go in and modify each one to check for their own unique combination of Date and Time. Which defeats the whole purpose of using a computer - to make repeated tasks easier.
Plus you'll have difficulty with the formatting: You'd need to program every "cell" to use a unique set of colors based on the displayed Class. That part is technically doable, but there's no way to "merge the cells" when classes last longer than a half hour. You'd end up with something like this:
So don't torture yourself trying to make this happen in Crystal. Even with all the time and effort it would take to formulate the grid, there's no good way to make it look like your screenshot.
That said, it looks as though you managed to put a schedule together in Excel. Is there any reason you can't use Excel instead? It's a much more powerful tool, and a cursory Google search suggests it can handle queries as well.

SSAS Cube Semi Aditive measure LastChild Across Time but MAX across Few Others and Sum Accross Portfolio

I have a SQL SERVER 2008R2 Standard Edition. The Cube has one measure called "AUM". Basically this measure is only additive across One Dimension Portfolio.
Across Time I need to pick LastChild, Across Security I need to pick Max and Across Portfolio I need to pick SUM.
How should I create the measure ? what should be the Aggregation property to achieve all 3 types of calculations.
currently we have written SCOPE statement for Security and Time to overwrite Default SUM behavior. this works great but as the members in security and Time Dimension increases the SSRS reporting query gets slow down a lot.
I am currently testing creating new persisted measures with changing the property of aggregations and combinations of some additional create member statements to see if I can avoid scope statements.
Any kind of help be great. Thanks
It's a nice problem. some thoughts
There is a problem on the order you evalute your tuple if the aggregation is not associative. I'd take care with scope (when you're evaluating a tuple that does not fall inside). Check this presentation from Chris Webb (nice SSAS guru)
The order of you aggregation is important, LastChild(Max(Sum( tuple) ) ) is not Max( LasChild (Sum (tuple) ) ). I'd go for a calculated member if performance is a problem :
First calculating the LastChild with data in your time dimension. Here you can use any aggregation method with a nonempty. Once you got using another measure to get properly the max.
P.S. : I think in SSAS you can define special aggregation methods (somewhere there is an use case for that).
hope it helps

Couchbase: Synchronizing views with a bucket

I've got a question for you couchbase pros: Is it possible to synchronize a subset of documents (eg. the documents within a view) with an other bucket?
So that the other bucket documents are always a direct subset of the "master" bucket?
if so, isn't that to much expensive in terms of perfomance? or does couchbase have any functionality to only create deeplinks to the documents instead of copying it?
Alternatively: is it possible to write views on views?
Thank you in advance!
--- EDIT ----
Let's say I want to have two sets (buckets) of documents S1 and S2. S2 is a subset of S1. Each set contains the same views V1, V2 and V3 since I want to be able to query any of them with the same logic/interface. In my case set S2 is build per user/company/store/whatever, in production there should be like 1000ish subsets S2 - to stay abstract let's call them S2a S2b and S2c.
The selection of documents which to be contained in any subset is done by a filtering instance (for example a view). Let's call these filtering instances F1 for filtering S1 to S2 hence F1a, F1b and F1c.
So with my actual knowledge of couchbase this results in the following design/view architecture: I've got the three "base" views to display V1,V2 and V3, and to realize S2a, S2b and S2c I must create the design views S2aV1, S2aV2, S2aV3, S2bV1, S2bV2, etc. (9 Views).
One could say "Well choose your keys wisely and you can avoid the sub views" but in my opinion this isn't that easy because of the following circumstances: In worst case the filter parameters change every minute and contain many WHERE IN constraints which could (at my actual point of view) not be handled efficiently querying k/v lists.
This leads to the following thoughts and the question I initially asked. If I use the same views in any subset (defined by a filter) shouldn't it be possible to build up an entity which helps me handling complex filtering? For example a function which is called during runtime while generating the view output? This could look like /design/view?filter=F1 or something like that.
Or do you have any other ideas to solve this problem? Or should I use SQL since it's more capable of handling frequently changing filters?
Generally speaking for most models you don't really need to have bucket "subsets", is there a particular reason you are trying to do this and why you would want that data broken out? You can also query your views, or instead of a view on a view, you can just make a separate view that maps/filters further based on your needs (i.e. does the same job as a view on a view).
We are working on Elastic Search integration. Maybe better for your use case
I think what you want to do is write a view on your original bucket, and then copy the key/values from that view, to be documents in a new bucket.
It shouldn't be hard to write an automated framework for managing this so that you can keep the derived data up to date in near real time.