Case Insensitive filtering using Google Guava - guava

Current I am using following piece of code to create a filter, in a map to match and give a filtered list of resultset.
final Map filteredMap = Maps.filterKeys(mymap, Predicates.containsPattern("^Xyz"));
However Guava Predicates.containsPattern does case-sensitive matching.
How should I use containsPattern for doing case-Insensitive matching.

Use
Predicates.contains(Pattern.compile("^Xyz", Pattern.CASE_INSENSITIVE))
as predicate instead. See core Java Pattern and Predicates.contains.
EDIT (after OP's comment): yes, you can write:
Predicates.containsPattern("(?i)^Xyz"))
(see Pattern's documentation: Case-insensitive matching can also be enabled via the embedded flag expression (?i).) but it's IMO less self-explaining, plus compiled Pattern from first example can be cached to some private static final constant when used in a loop, which can improve performance.

Related

Serilog FilterExpression to check if all string Properties of a LogEvent meet a length constraint?

In my appsettings.json, I want to filter Serilog log events to include only log events where all scalar properties with string values meet a certain length constraint. In C# approach, the predicate would be
logEvent => logEvent.Properties.Values
.OfType<ScalarValue>()
.Select(x => x.Value)
.OfType<string>()
.All(x => x.Length <= 128);
In a json approach, based on the docs,
I sort of think that there may be a hack with regular expressions like
Contains(#Properties[*], /^.{0,128}$/)
or maybe
Length(#Properties[*]) <= 128
but apparently none of these works
Any ideas how to check if any string properties is below length limit?
The above filter expressions do not work because Serilog filter expression compiler has a special treatment for Properties: at some point in the internal filter expression compilation pipeline, Properties['somekey'] gets replaced with somekey.
This is logical, because in effect Properties['somekey'] is in fact an access to a property named somekey. The wildcards ? and * are not exempt from this rule.
This explains why the examples in the question do compile to some kind of Func<FilterExpression, object> internally, but fail to produce the results I expected.

Algolia search results for partial string matches

Trying to do a pretty basic search implementation of partial matching. For instance, I'd like 'ia hu' to return 'Ian Hunter'. I've got first and last name split so we're indexing first, last and combined.
Was reading the suggestion in here, but this just isn't a very elegant or feasible way to solve: https://www.algolia.com/doc/faq/troubleshooting/how-can-i-make-queries-within-the-middle-of-a-word.
I don't think we should have to generate a ton of substring combos for first and last name to get this to return results.
Has anyone implemented a more elegant solution?
In this specific use case (matching "Ian Hunter" with "ia hu"), you can turn prefix matching on all words with queryType=prefixAll (see documentation).
This will not allow infix matching, so "an hu" or "ia un" will not match "Ian Hunter". This cannot therefore be considered a general solution to your question. However, in practice, prefix matching tends to be what people use instinctively; infix matching is relatively rare in my experience.

Extractor composition with dependent sub-patterns

Is it possible to define an extractor that is composed such that a sub-pattern depends on a previously matched sub-pattern?
Consider matching a date pattern, where valid "days" depends on the matched "month".
This is to avoid a guard to compare values bound by the sub-patterns, and also to avoid providing an overly-customized extractor.
Sample syntax:
case r"\d{4}-$month\d{2}-${day filter month.allows}\d{2}" => s"$month $day"
Perhaps you can formulate it under the aegis of this behavior:
https://issues.scala-lang.org/browse/SI-796
That is, before they fix it.

Why don't scala collections have any human-readable methods like .append, .push, etc

Scala collections have a bunch of readable and almost readable operators like :+ and +:, but why aren't there any human readable synonyms like append?
All mutable collections in Scala have the BufferLike trait and it defines an append method.
Immutable collections do not have the BufferLike trait and hence only define the other methods that do not change the collection in place but generate a new one.
Symbolic method names allow the combination with the assignment operation =.
For instance, if you have a method ++ which creates a new collection, you can automatically use ++= to assign the new collection to some variable:
var array = Array(1,2,3)
array ++= Array(4,5,6)
// array is now Array(1,2,3,4,5,6)
This is not possible without symbolic method names.
In fact they often some human-readable synonyms:
foldLeft is equivalent to /:
foldRight is equivalent to :\
The remaining ones are addition operators, which are quite human readable as they are:
++ is equivalent to java addAll
:+ is append
+: is prepend
The position of the semi-colon indicates the receiver instance.
Finally, some weird operators are legacies of other functional programming languages. Such as list construction (SML) or actor messaging (erlang).
Is it any different than any other language?
Let's take Java. What's the human readable version of +, -, * and / on int? Or, let's take String: what's the human readable version of +? Note that concat is not the same thing -- it doesn't accept non-String parameters.
Perhaps you are bothered by it because in Java -- unlike, say, C++ -- either things use exclusively non-alphabetic operators, or alphabetic operators -- with the exception of String's +.
The Scala standard library does not set out to be Java friendly. Instead, adapters are provided to convert between Java and Scala collections.
Attempting to provide a Java friendly API would not only constrain the choice of identifiers (or mandate that aliases should be provided), but also limit the way that generics and function types were used. Substantially more testing would be required to validate the design.
On the same topic, I remember some debate as to whether the 2.8 collections should implement java.util.Iterable.
http://scala-programming-language.1934581.n4.nabble.com/How-to-set-the-scale-for-scala-BigDecimal-s-method-td1948885.html
http://www.scala-lang.org/node/2177

Lucene.Net Underscores causing token split

I've scripted a MsSqlServer databases tables,views and stored procedures into a directory structure that I am then indexing with Lucene.net. Most of my table, view and procedure names contain underscores.
I use the StandardAnalyzer. If I query for a table named tIr_InvoiceBtnWtn01, for example, I recieve hits back for tIr and for InvoiceBtnWtn01, rather than for just tIr_InvoiceBtnWtn01.
I think the issue is the tokenizer is splitting on _ (underscore) since it is punctuation.
Is there a (simple) way to remove underscores from the punctuation list or is there another analyzer that I should be using for sql and programming languages?
Yes, the StandardAnalyzer splits on underscore. WhitespaceAnalyzer does not. Note that you can use a PerFieldAnalyzerWrapper to use different analyzers for each field - you might want to keep some of the standard analyzer's functionality for everything except table/column name.
WhitespaceAnalyzer only does whitespace splitting though. It won't lowercase your tokens, for example. So you might want to make your own analyzer which combines WhitespaceTokenizer and LowercaseFilter, or look into LowercaseTokenizer.
EDIT: Simple custom analyzer (in C#, but you can translate it to Java pretty easily):
// Chains together standard tokenizer, standard filter, and lowercase filter
class MyAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
StandardTokenizer baseTokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
StandardFilter standardFilter = new StandardFilter(baseTokenizer);
LowerCaseFilter lcFilter = new LowerCaseFilter(standardFilter);
return lcFilter;
}
}