Lucene.Net Underscores causing token split

Lucene.Net Underscores causing token split - tsql

I've scripted a MsSqlServer databases tables,views and stored procedures into a directory structure that I am then indexing with Lucene.net. Most of my table, view and procedure names contain underscores.
I use the StandardAnalyzer. If I query for a table named tIr_InvoiceBtnWtn01, for example, I recieve hits back for tIr and for InvoiceBtnWtn01, rather than for just tIr_InvoiceBtnWtn01.
I think the issue is the tokenizer is splitting on _ (underscore) since it is punctuation.
Is there a (simple) way to remove underscores from the punctuation list or is there another analyzer that I should be using for sql and programming languages?

Yes, the StandardAnalyzer splits on underscore. WhitespaceAnalyzer does not. Note that you can use a PerFieldAnalyzerWrapper to use different analyzers for each field - you might want to keep some of the standard analyzer's functionality for everything except table/column name.
WhitespaceAnalyzer only does whitespace splitting though. It won't lowercase your tokens, for example. So you might want to make your own analyzer which combines WhitespaceTokenizer and LowercaseFilter, or look into LowercaseTokenizer.
EDIT: Simple custom analyzer (in C#, but you can translate it to Java pretty easily):
// Chains together standard tokenizer, standard filter, and lowercase filter
class MyAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
StandardTokenizer baseTokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
StandardFilter standardFilter = new StandardFilter(baseTokenizer);
LowerCaseFilter lcFilter = new LowerCaseFilter(standardFilter);
return lcFilter;
}
}

Related

How to use MDriven OclPs to find all objects matching a list of strings?

In an application I receive a list of strings. Then I want to use OclPs to find all objects where a specific attribute equals any of the strings in the list. E.g. if we have Person objects and receive a list of last names, find all persons whose last name appears in the list.
Although this can surely be done in MDriven's in-memory OCL engine, I can't seem to achieve this in the more limited OclPs (which translates the OCL to SQL and evaluates it as such in the database).
Attempt 1: First assign the list of names to vNames (collection of strings), then:
Person.allInstances->select(p|vNames->exists(n|n = p.LastName))
This gives error "Loop variables can only have class type, not System.String".
Attempt 2: First assign a "|" separated string of the sought names, including leading and trailing "|", to vNames, then:
Person.allInstances->select(p|vNames.SqlLike('%|' + p.LastName + '|%'))
This gives error saying strings cant be added in Firebird. But Firebird does support string concatenation using the || operator.
Trying with .Contains(...) instead of .SqlLike(...) says it's not supported in OclPs. Besides, it would find persons with a last name that is CONTAINED in any of the sought names, i.e. an incorrect search.
I'm out of ideas...

In this case when you have a long list of strings and you want a list of objects I think you best option is to use SQL with sqlpassthroughobjects:
https://wiki.mdriven.net/index.php/OCLOperators_sqlpassthroughobjects
Person.sqlpassthroughobjects('select personid from person where lastname in ('+vNames->collect(n|'\''+n+'\'')->asCommaList+')')

oclPS only implements a quite small subset of OCL because its converting the OCL to sql.
For example, collections of "non-objects" can't be used.
Person.allInstances->select(p|vNames.SqlLike('%|' + p.LastName + '|%'))
Instead, look at sqlpassthroughobjects here
https://wiki.mdriven.net/index.php/OCLOperators_sqlpassthroughobjects
You could also insert the names as objects of a class into the database and then use oclPS.

I suggest that you use a genuine OCL tool. You seem to be demonstrating that MDriven OclPs is not OCL.

Cool class and method names wrapped in ``: class `This is a cool class` {}?

I just found some scala code which has a strange class name:
class `This is a cool class` {}
and method name:
def `cool method` = {}
We can use a sentence for a class or method name!
It's very cool and useful for unit-testing:
class UserTest {
def `user can be saved to db` {
// testing
}
}
But why we can do this? How to understand it?

This feature exists for the sake of interoperability. If Scala has a reserved word (with, for example), then you can still refer to code from other languages which use it as a method or variable or whatever, by using backticks.
Since there was no reason to forbid nearly arbitrary strings, you can use nearly arbitrary strings.

As #Rex Kerr answered, this feature is for interoperablility. For example,
To call a java method,
Thread.yield()
you need to write
Thread.`yield`()
since yield is a keyword in scala.

The Scala Language Specification:
There are three ways to form an identifier. First, an identifier can
start with a letter which can be followed by an arbitrary sequence of
letters and digits. This may be followed by underscore ‘_’ characters
and another string composed of either letters and digits or of
operator characters. Second, an identifier can start with an operator
character followed by an arbitrary sequence of operator characters.
The preceding two forms are called plain identifiers. Finally, an
identifier may also be formed by an arbitrary string between
back-quotes (host systems may impose some restrictions on which
strings are legal for identifiers). The identifier then is composed of
all characters excluding the backquotes themselves.

Strings wrapped in ` are valid identifiers in Scala, not only to class names and methods but to functions and variables, too.

To me it is just that the parser and the compiler were built in a way that enables that, so the Scala team implemented it.
I think that it can be cool for a coder to be able to give real names to functions instead of getThisIncredibleItem or get_this_other_item.
Thanks for your questions which learnt me something new in Scala!

Is string concatenation in scala as costly as it is in Java?

In Java, it's a common best practice to do string concatenation with StringBuilder due to the poor performance of appending strings using the + operator. Is the same practice recommended for Scala or has the language improved on how java performs its string concatenation?

Scala uses Java strings (java.lang.String), so its string concatenation is the same as Java's: the same thing is taking place in both. (The runtime is the same, after all.) There is a special StringBuilder class in Scala, that "provides an API compatible with java.lang.StringBuilder"; see http://www.scala-lang.org/api/2.7.5/scala/StringBuilder.html.
But in terms of "best practices", I think most people would generally consider it better to write simple, clear code than maximally efficient code, except when there's an actual performance problem or a good reason to expect one. The + operator doesn't really have "poor performance", it's just that s += "foo" is equivalent to s = s + "foo" (i.e. it creates a new String object), which means that, if you're doing a lot of concatenations to (what looks like) "a single string", you can avoid creating unnecessary objects — and repeatedly recopying earlier portions from one string to another — by using a StringBuilder instead of a String. Usually the difference is not important. (Of course, "simple, clear code" is slightly contradictory: using += is simpler, using StringBuilder is clearer. But still, the decision should usually be based on code-writing considerations rather than minor performance considerations.)

Scalas String concatenation works the same way as Javas does.
val x = 5
"a"+"b"+x+"c"
is translated to
new StringBuilder()).append("ab").append(BoxesRunTime.boxToInteger(x)).append("c").toString()
StringBuilder is scala.collection.mutable.StringBuilder. That's the reason why the value appended to the StringBuilder is boxed by the compiler.
You can check the behavior by decompile the bytecode with javap.

I want to add: if you have a sequence of strings, then there is already a method to create a new string out of them (all items, concatenated). It's called mkString.
Example: (http://ideone.com/QJhkAG)
val example = Seq("11111", "2222", "333", "444444")
val result = example.mkString
println(result) // prints "111112222333444444"

Scala uses java.lang.String as the type for strings, so it is subject to the same characteristics.

Scala multiline string placeholder

This question is related to ( Why is there no string interpolation in Scala? ), but deals more specifically with multi-line strings.
I've just about bought into Martin's suggestion for simple string placeholder where
msg = "Hello {name}!"
can be be represented without much difference in Scala today like this:
msg = "Hello"+name+"!"
However, I don't think that approach holds with multi-line strings. And, in some cases it may be encouraging other poor practices in favor of readability. Note that in the Scala Play ANORM database mapping how the framework tries to preserve readability in plain SQL (using placeholders), but at the expense of duplicating the {countryCode} variable name and in a non-type-safe way, see...
.on("countryCode" -> "FRA")
SQL(
"""
select * from Country c
join CountryLanguage l on l.CountryCode = c.Code
where c.code = {countryCode};
"""
).on("countryCode" -> "FRA")
Additionally, assuming no change in Scala to address this, what would be the implication of using inline XML? How would performance, memory, etc. with something like:
val countryCode = "FRA"
SQL(<c>
select * from Country c
join CountryLanguage l on l.CountryCode = c.Code
where c.code = {countryCode};
</c>.text)

A scala.xml.Elem would be constructed which had the string contents represented as an ArrayBuffer, chopped up for every { } substitution. I'm certainly no authority but I believe what would happen is that there's a little extra overhead in construction the object and then getting the children and concatenating them together at runtime but, at least in this example, as soon as it's passed to the SQL function which then extracts the string it wants (or perhaps this would be done with an implicit) the Elem object would be discarded so there'd be a little extra memory usage, but only briefly.
But in the bigger picture, I don't think it's performance that would hinder the adoption of this solution but I guess a lot of people would be uncomfortable abusing XML in this way by using a made-up tag. The problem would be with other users reading the code later trying to figure out the semantic meaning of the tag... only to find there isn't one.

The example you give is almost certainly not doing string concatenation, it's creating parameterized SQL statements (probably via JDBC's PreparedStatement).
Ironically, the lack of easy string concatenation is probably slightly encouraging best practices in this case (although I certainly wouldn't use that as an argument either way on the topic).

If you are coming to this question from the future, multi-line string interpolation is now a thing.
val when = "now"
println(s"""this is $when a thing.""")
// this is now a thing

Can anyone explain the below code please

`where client.name.ToLower().Contains(name.ToLower())

Now it's clearer. It's a (badly done) case insensitive search for the name in the client.name. True if name is contained in client.name. Badly done because using international letters (clearly "international letters" don't exist. I mean letters from a culture different from you own. The classical example is the Turkish culture. Read this: http://www.i18nguy.com/unicode/turkish-i18n.html , the part titled Turkish Has An Important Difference), you can break it. The "right" way to do it is: client.name.IndexOf(name, StringComparison.CurrentCultureIgnoreCase) != -1. Instead of StringComparison.CurrentCultureIgnoreCase you can use StringComparison.InvariantCultureIgnoreCase. If you have to use tricks like the ToLower, it has been suggested that it's better to ToUpper both sides of the comparison (but it's MUCH better to use StringComparison.*)

Looks like LINQ to me.
I'm not really up-to-date on .NET these days, but I'd read that as looking for client objects whose name property is a case-insensitive match with the ToString property of the client variable, while allowing additional characters before or after, much like WHERE foo is like '%:some_value%' in SQL. If I'm right, btw, client is a terrible variable name in this instance.

This is a strange piece of code. It would be good to know a bit more about the client object. Essentially it is checking if the case insensitive name value on the client object contains the case insensitive value of the client object (as a string). So if the client name contains the string name of the class itself essentially.

.ToLower() returns the same string you call it on in all lowercase letters. Basically, this statement returns true if name.ToLower() is embedded anywhere within client.name.ToLower().
//If:<br/>
client.name = "nick, bob, jason";
name = "nick";
//Then:<br/>
client.name.ToLower().Contains(name.ToLower());
//would return true