Kafka Connect - json path - regex condition not working - apache-kafka

I have an application where I use a connector to save data to a database.
I want to filter the messages saved by removing the ones that have a certain property very long.
My messages are like this :
{
field_a : value,
field_b : value,
field_c : possible very long value
}
So, I used in the kafka connector the Confluent Filter like this :
transforms: filterSpam
transforms.filterSpam.type: io.confluent.connect.transforms.Filter$Value
transforms.filterSpam.filter.condition: $[?(#.field_c =~ /^.{32000,}$/)]
transforms.filterSpam.filter.type: exclude
transforms.filterSpam.missing.or.null.behavior: include
For some reason the filter is not working. All messages pass through.
I tried also with the negation :
$[?(!(#.field_c =~ /^.{1,32000}$/))]
In this case, the very long were filtered out, but also some of the shorter ones were.
I do not understand where the issue is coming from. Any help ?

Actually, I needed to update my regex knowledge.
The issue was related to the fact that the string field on which I tried to apply the regex sometimes was multiline.
Thanks to this I managed to use a proper validation of the size of this field.
The final solution is :
transforms.filterSpam.filter.condition: $[?(#.field_c =~ /(\s)^.{32000,}$/)]

Related

Filtering column strings that contain substring

I am working on an if else in the Tmap, and one of the conditions is if a column contains a substring.
I am unsure exactly how to go about this being fairly new to talend.
This is the current syntax that I am using.
row16.Location.contains("clos")?"Pending":""
I have not been able to find any good examples of the correct way to go about this, other than the one above.
Talend uses Java as an underlying language, so you need to use the ternary operator of Java:
row16.Location.contains("clos") ? "Pending" : ""
But make sure you first check row16.Location for null, otherwise you'll get a NullPointerException if Location is null :
row16.Location != null && row16.Location.contains("clos") ? "Pending" : ""

Elasticsearch mongodb river script in index doesn't work

I'm trying to change few fields strings using javascript.
For example take only the last part of the URL taken from mongo through the river so in elasticsearch I'll have only the end of it.
When creating the index (using curl) I added under "options" the following script:
"script": "ctx.document.shorturl = ctx.document.url.substr(-4);delete ctx.document.url;
I tried some manipulations such as adding \"...\" or use ctx['doc']['url'] and others but nothing seems to work.
I always get only url field with the full url (shorturl is not created at all).
Can anyone suggest what is the right syntax to make it work?
Another thing I need to do is combine to fields - lat & long, to one "location" field in order to use it in Kibana, can anyone suggest the right script for that? (create new field called "location" which contain both field "lat" & "long" with comma between them).
Thanks.
You did substring(-4), hence it will return the whole string. You should use substring(4) instead:
ctx.document.shorturl = ctx.document.url.substr(4);delete ctx.document.url;

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)

Get statuscode text in C#

I'm using a plugin and want to perform an action based on the records statuscode value. I've seen online that you can use entity.FormattedValues["statuscode"] to get values from option sets but when try it I get an error saying "The given key was not present in the dictionary".
I know this can happen when the plugin cant find the change for the field you're looking for, but i've already checked that this does exist using entity.Contains("statuscode") and it passes by that fine but still hits this error.
Can anyone help me figure out why its failing?
Thanks
I've not seen the entity.FormattedValues before.
I usually use the entity.Attributes, e.g. entity.Attributes["statuscode"].
MSDN
Edit
Crm wraps many of the values in objects which hold additional information, in this case statuscode uses the OptionSetValue, so to get the value you need to:
((OptionSetValue)entity.Attributes["statuscode"]).Value
This will return a number, as this is the underlying value in Crm.
If you open up the customisation options in Crm, you will usually (some system fields are locked down) be able to see the label and value for each option.
If you need the label, you could either do some hardcoding based on the information in Crm.
Or you could retrieve it from the metadata services as described here.
To avoid your error, you need to check the collection you wish to use (rather than the Attributes collection):
if (entity.FormattedValues.Contains("statuscode")){
var myStatusCode = entity.FormattedValues["statuscode"];
}
However although the SDK fails to confirm this, I suspect that FormattedValues are only ever present for numeric or currency attributes. (Part-speculation on my part though).
entity.FormattedValues work only for string display value.
For example you have an optionset with display names as 1, 2, 3,
The above statement do not recognize these values because those are integers. If You have seen the exact defintion of formatted values in the below link
http://msdn.microsoft.com/en-in/library/microsoft.xrm.sdk.formattedvaluecollection.aspx
you will find this statement is valid for only string display values. If you try to use this statement with Integer values it will throw key not found in dictionary exception.
So try to avoid this statement for retrieving integer display name optionset in your code.
Try this
string Title = (bool)entity.Attributes.Contains("title") ? entity.FormattedValues["title"].ToString() : "";
When you are talking about Option set, you have value and label. What this will give you is the label. '?' will make sure that the null value is never passed.

Is there any Open topic Classifier API for demo purposes

Is there an Open Classifier/Categorizer API which returns categories/topics for a given entity string. For example for an input string like : "Barack Obama" it should return topics like : "Politics" similarly for "Albert Einsten" it should emit "Physicist, Nobel Laureate" etc.
try texlexan. it gave me some really good results but with slightly longer input texts.