How do I group list items in different languages? - unicode

I have an application that displays user data in a sorted list. The list has an index, which in English is the letters A-Z. Tapping on a letter in the index jumps to items starting with that letter. This works for English like languages, but completely fails for languages using different character sets (such as Chinese).
I can use ICU to collate the list of items to the correct order, but how can I find a correct set of indexes for other languages? Note that I don't know the entire list ahead of time, so generating the index from the data is not possible.
The indexes could be recalculated for each supported language, but in that case how would I locate such lists?

The “index characters” information in CLDR exists for such purposes:
“The index characters are an ordered list of characters for use as a UI "index", that is, a list of clickable characters (or character sequences) that allow the user to see a segment of a larger "target" list.”
( http://www.unicode.org/reports/tr35/#Character_Elements )
I’m afraid such information isn’t in ICU yet, but if you need this for a few languages only, you could copy the data from
http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.indexCharacters.html

Related

Partially matching a post code with Algolia

I've loaded a dataset into an Algolia search index. Each item in the index is a shop with a catchment area (the catchment area is just an array of UK Postcodes that a store covers). For example:
['DS4 6','DS4 7', 'DS5 8, 'DS6 9' ... ]
The search feature is working to a point. If people search for "DS4" then Algolia returns several stores, but most people are typing their full post code (for example DS4 8XX) and this isn't returning anything even though "DS4" is indexed several times.
Is there a configuration in Algolia to search for the first part of a word, even when a person has 'typed past it'?
To clarify this a bit further. I could store every single individual postcode in a catchment area but there are millions and millions of them. A full UK postcode would be "DS4 7EN", so there are two more characters on the end representing a street in the UK. I've got the first part of a postcode: eg "DS4 7" because it seems excessive to store everything when I only really care about the wider area, ie: DS4, DS5, CV43, AB2 (and so on).
I could also probably use a places api and geocode the address. But I already have this catchment area postcode data, so it seems a shame not to use it if I can.
Algolia, like most search engines supports prefix search in order to allow search-as-you-type results, which is leveraged with InstantSearch libraries, where results are updated live as the user types. Without prefix search, you would have to wait for the user to enter an entire word before displaying any meaningful result.
In your case, since the catchment areas are indexed, e.g., DS4 6, when a user types DS4 6XX, no records will match the query since the query acts as a filter on the records based on their searchable attributes.
That said, I see two possible workaround that you can implement.
The first solution is to use the removeWordsIfNoResults index setting and set it to "Last Word". This will remove the last word of the query if there are no results. For instance, with the query DS4 6XX it will remove 6XX to just keep DS4 and retrieve the items that match this query. Note that this solution relies on the fact that DS4 6XX has two words (separated by a space) and it won't work with DS46XX.
The second solution is to change the structure of the records to add the full postcode in each item of the index. Since these are shops, I believe that it should be possible. This way your users will be able to search for both the full postcode DS4 6XX and the catchment areas DS4 6. Unless I misunderstood your problem, I don't see the need to store the full list of postcodes associated to a catchment area.

Italicize a specified string inside of a field in FileMaker Pro

I administer a simple Filemaker Pro 12 database for a company. The current project we are working on requires us to italicize proper names. For example, If the database was movie database I would have the following caption:
Wendy,
Peter Pan
At the moment all captions like these are stored in one field, I would normally have two fields to separate the proper name from the character name, but doing so at this point would be very time consuming. I would like to make a script the italicizes the property names in this field, by looping through an array of proper names, and when a match is found it italicizes that name. This would be extremely useful, normally I could do this easily with another language, but Filemakers scripting language is foreign to me. This is simple in other languages using a foreach loop with a string array. Is there a simple solution someone can point me in the direction of.
You could probably loop through the list of proper names (where is it, and in what form?) and set the field to a calculation using:
Substitute ( field ; searchString ; TextStyleAdd ( searchString ; italic ) )
where searchString is the current value of the inner loop. The outer loop is, of course, looping through all found records. Hard to be more specific with so few details.
That said, IMHO it would take no more time and effort to fix the real problem here once and for all, i.e. separate the two facts into two individual fields.
Note also that there is an assumption here that the proper names match only themselves; for example, "Peter Pan, Peter Pan" would become "Peter Pan, Peter Pan" using the above method.

elasticsearch array field of keywords - how to index it

I've got input that is analogous to tags, where there are a couple of strings per record, and they should be thought of as keywords, not to be tokenized or broken up or analyzed in any particular way. I want it to show up in faceting "as-is", including spaces, slashes, dashes and ampersands.
I don't think I need multi_field here. There is one input value per record "keyPhrases" but the input value is a simple json array of strings.
I want elasticsearch to insert into the facets each of the values, and tag the record with all of the phrases.
Usually there are only one or two or three phrases per record, but there could be more. The set of keyPhrases is fairly small, like 30 or at most like 50. They could be thought of as "categories".
The faceting keeps breaking up the input strings and using lowercasing, even though I'm trying to specify not_analyzed, keyword tokenizer, keyword analyzer, and trying things like that.
I have other fields that keep their spacing and capitalization as I desire in the facets returned, however those fields are not_analyzed and are also store: true, but are also just exactly 1 string input per record, as opposed to many per record.
I could just take the top 1 keyPhrase per record and flatten it, but ideally all the tags would work and be available as facets.
Any ideas on how to do this?
Well, this is embarrassing.
My strict mapping wasn't actually committed to the server at the time I was trying this.
(I was dropping the index and creating the index again with each new mapping, and hadn't realized it, and this was not the final mapping, so it was getting loaded and then dropped.)

How to form an Endeca query where a field must start with certain letters

Is it possible to form an Endeca query to retrieve a field that must start with certain letters? Say like get all users who's first letter is A? I checked with Range filters but it is supporting only numerical fields as well as Wild card search. But nothing worked well so far.
Creating a dimension is one way of approaching the problem as Paul Lemke mentioned.Wildcard is not an option since the performance overhead as well as irrelevant records.
But we solved it using couple of other alternatives.
Create a new property for the Object called "StartWith", store the first letter of the Object and make it searchable. We found it easier than creating a Dimension.
There is a problem where letters like 'A' are usually stop words in Endeca. We can do you a couple of work around to solve this.
Get the ASCII value of the first letter and store the numerical value in to that property. One more advantage with this approach is that we can use Range Filters. But you can't search for 'AB' kind of requirements.
Pre-pend some characters like ^^^My name and search for ^^^M. The advantage with this approach is you can search conditions like letters starts with AB.
Endeca at it's current version (6.1) does not have a search filter that works like a "startswith" function in other programming languages.
I do have two options that might possibly get you close:
If you are truly only looking for the first letter you can setup a Dimension value for each letter of the Alphabet (A,B,C...). You can then refine on each letter and see only the values that start with letter A, B, C, etc. The only downside to this is you can only filter based on how many dimension values you setup. So if you added "A", you couldn't filter anything that started with "AB". You could go down the line and add "AB", "BA, "CA", and so on but that would get unwieldy very fast.
If you want something closer to a "startswith" function the only other option is to use a wildcard search. Basically you would do a property search like this: N=0&Ntk=Username&Ntt=ab*
The trick with wildcard searching is it will do that across multiple words in that property. So assuming you had a data set of these values:
Smithers Smith
Larry Smith
Jenna-Smith
Doing a search of sm* would actually return all 3 results because "sm" was in their last name. Even the one with the dash would return as Endeca think's that is a seperate word. (It might be possible to turn that off though, not sure).
So basically it comes down to this: Stick a one word in a property, set that property to allow wildcard search, then do a "blah*" against that property and you should have the results you're looking for.
Have you tried the First relevance rank module which is supposed to rank based on proximity to the beginning of the field?
It sounds similar to what you are looking for and together with a wild card may produce your intended results.

Is it possible to perform a Sphinx search on one string attribute?

sql_query=SELECT id,headline,summary,body,tags,issues,published_at
FROM sphinx_search
I am working on the search feature of my Web site and I am using Sphinx, Perl and Sphinx::Search. As long as I want to search in all the attributes and I don't restrict it to just one, everything goes well. However when the user searches for a specific tag, I can't just give the result of a fuzzy search, I want to use the power of Sphinx to search only on tags or issues, maybe sometimes the user wants to search on headline and issues.
How can I perform such a task?
You need to put it in Extended Match Mode
https://metacpan.org/module/JJSCHUTZ/Sphinx-Search-0.27.2/lib/Sphinx/Search.pm#SetMatchMode
Then you can use Extended Query syntax
http://sphinxsearch.com/docs/current.html#extended-syntax
Which includes the field search operator
#tags keyword1
(Be careful with sphinx, the word "attribute" has a specific meaning - values attached to the document, useful for sorting/grouping/filtering and returning with the resultset. Whereas I think you are talking about fields. All the columns from the sql_query you dont mark as an attribute, are a field - and full text searchable)