Azure Data Flow compare two string with lookup - azure-data-factory

I'm using Azure Data flow to do some transformation on the data but I'm facing some challenges.
I have a use case where I have two streams, these two streams have some common data, and what I'm looking for is to output the common data between these two streams.
I do matching data with some common fields(product_name(string) and brand(string)), I have not got ID.
to do the matching, I picked lookup activity and I tried to compare the brand in the two streams, but THE RESULT IS NOT CORRECT because for example:
left stream : the brand = Estēe Lauder
right stream. : the brand = Estée Lauder
for me this is the same brand, but they have different text format, I wanted to use 'like' operator but lookup activity does not support it, I'm using '==' operator to compare.
is there a way to override this problem please ?

If you use the Exists transformation instead of Lookup, you will have much more flexibility because you can use custom expressions including regex matching. Also, you can look at using fuzzy matching functions in the Exists expression like soundex(), rlike(), etc.

Related

In ObjectBox for Flutter, is there a way to compare two properties?

I'm new to using ObjectBox, so I've been trying to do some experimenting with its query system to familarize myself with it. One of the queries I've been unable to do is a query comparing two properties. Ignoring the errors they throw, these are some examples of what I'm looking to do:
// Get objects where first number is bigger than second number
boxA.query(ObjectA_.firstNumber.greaterThan(ObjectA_.secondNumber))
// Get parent objects where one of its children has a specific value from the parent
parentBox.query().linkMany(ParentObject_.children, ChildObject_.name.equals(ParentObject_.favoriteChild));
I know based on this question that it's possible in Java using filters, but I also know that query filters are not in ObjectBox for Dart. One of the workaround I've been testing is querying for one property, getting the values, and using each value to query for the second property. But that becomes unsustainable at even moderately sized amounts of data.
If anyone knows of a "proper" way to do this without the use of Java filters, that would be appreciated. Otherwise, if there's a more performant workaround than the one I came up with, that would be great too.
There is no query filter API for Dart in ObjectBox, because Dart already has the where API.
E.g. for a result list write results.where((a) => a.firstNumber >= a.secondNumber).

Flutter/Supabase stream filters - why the one field filter limit?

Using the "flutter_supabase" package, I've been trying to add a dynamic filtered stream to my Flutter app, and have found that an exception is thrown if more than filter is applied.
Why the one field limit? Is there any way around this?
In my case, I want to apply two filter fields to the stream, and the exact fields are applied dynamically based on user selections.
Supabase has advised that this is limit set by their real time streaming layer.
Work arounds to solve this:
a). Build logical views to represent the different filters
b). add a 'where' clause to the stream results and filter at the client end

How can I apply different windows to one PCollection at once?

So my usecase is that the elements in my PCollection should be put into windows of different lengths (which are specified in the Row itself), but the following operations like the GroupBy are the same, so I don't want to split up the PCollection at this point.
So what I'm trying to do is basically this:
windowed_items = (
items
| 'windowing' >> beam.WindowInto(window.SlidingWindows(lambda row: int(row.WINDOW_LENGTH), 60))
)
However, when building the pipeline I get the error TypeError: '<=' not supported between instances of 'function' and 'int'.
An alternative to applying different windows to one PCollection would be to split/branch the PCollection based on the defined window into multiple PCollections and apply the respective window to each. However, this would mean to hardcode the windowing for every allowed value, and in my case this is possibly a huge number which is why I want to avoid it.
So from the error I'm getting (but not being able to find it explicitely in the docs) I understand that the SlidingWindows parameters have to be provided when building the pipeline and cannot be determined at runtime. Is this correct? Is there some workaround how I can apply different windows to one PCollection at once or is it simply not possible? If that is the case, are there any other alternative approaches to the one I outlined above?
I believe that custom session windowing is what you are looking for. However, it's not supported in the Python SDK yet.

Can we change data type of dimension post ingestion in Druid

We are doing POC on Druid to check whether it fits our use cases. Though we are able to ingest data but not sure on following:
How druid supports schemaless input: Let's say input dimension are on end user discretion. Then there is no defined schema here. Thus onus lies on application to identify new dimension, identify data type and ingest. Any way to achieve this?
How druid support data type changes: Lets say in course (say after ingesting 100GBs of data), there is need to change data type of dimension from string to long or long to string (or other). What are receommended way to do it without hampering ongoing ingestion?
I looked over docs but could not get a substantial overview for both use cases.
For question 1 I'd ingest everything as string and figure it out later. It should be possible to query string columns in druid as numbers
Getting the possible behaviours explained in: https://github.com/apache/incubator-druid/issues/4888
Consider values are zeros, do not try to parse string values. Seems this is the current behaviour.
Try to parse string values, and consider values are zero if they are not parseable, or null, or multiple-value
One current inconsistency is that with expression-based column selectors (anything that goes through Parser/Expr) the behavior is (2). See IdentifierExpr + how it handles strings that are treated as numbers. But with direct column selectors the behavior is (1). In particular this means that e.g. a longSum aggregator behaves differently if it's "fieldName" : "x" vs. "expression" : "x" even though you might think they should behave the same.
You can follow the entire discussion here: https://github.com/apache/incubator-druid/issues/4888
For question 2 it think it is necessary a reindex of the data
- http://druid.io/docs/latest/ingestion/update-existing-data.html
- http://druid.io/docs/latest/ingestion/schema-changes.html
I hope this helps
1) In such cases, you don't need to specify any dimension columns in druid ingestion spec and druid will treat all columns which are not timestamp as a dimension.
More detail about such approach can be found here:
Druid Schema less Ingestion
2) For 2nd question, you can make changes to schema and druid will create new segments with new data type while your old segments will still use old data type.
In cases if you want to keep all your segments with new data type then you can reindex all the segments. Please checkout this link for further description about reindexing all segments. http://druid.io/docs/latest/ingestion/update-existing-data.html
Additional info on schema changes can be found here:
http://druid.io/docs/latest/ingestion/schema-changes.html

Whats more efficent Core Data Fetch or manipulate/create arrays?

I have a core data application and I would like to get results from the db, based on certain parameters. For example if I want to grab only the events that occured in the last week, and the events that occured in the last month. Is it better to do a fetch for the whole entity and then work with that result array, to create arrays out of that for each situation, or is it better to use predicates and make multiple fetches?
The answer depends on a lot of factors. I'd recommend perusing the documentation's description of the various store types. If you use the SQLite store type, for example, it's far more efficient to make proper use of date range predicates and fetch only those in the given range.
Conversely, say you use a non-standard attribute like searching for a substring in an encrypted string - you'll have to pull everything in, decrypt the strings, do your search, and note the matches.
On the far end of the spectrum, you have the binary store type, which means the whole thing will always be pulled into memory regardless of what kind of fetches you might do.
You'll need to describe your managed object model and the types of fetches you plan to do in order to get a more specific answer.