Solr search error when dealing with Arabic string - unicode

I'm struggling with Solr search Arabic for several days and made some experiment. Here is the simple reflection of the problem.
After I store some Arabic sentence (now only 1 word السوري ) into database and have Solr index it, then query it by q=*:*&wt=python,(if no wt part, it was garbled chars) the response is:
'\u00d8\u00a7\u00d9\u201e\u00d8\u00b3\u00d9\u02c6\u00d8\u00b1\u00d9\u0160'
The actual word I store there for index is coding in another way:
'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'
As you can tell, there is a one-to-to corresponding from \xd8↔\u00d8. But I don't know what is the name of this coding, thus I cannot convert it. And when I do the search as: <>/select/?q=السوري&wt=python,the response is:
{'responseHeader':{'status':0,'QTime':0,'params':{'wt':'python','q':u'\u0627\u0644\u0633\u0648\u0631\u064a'}},'response':{'numFound':0,'start':0,'docs':[]}}
No docs found and it seems using a third version for coding u'\u0627\u0644\u0633\u0648\u0631\u064a'. if I take it and encode('utf8') then it convert back to '\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'.
In summary, when it (السوري) is in my code (python) or in data base (mysql),
it presents as 'form1':
'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'
When it is indexed by Solr, it converts to form2:
'\u00d8\u00a7\u00d9\u201e\u00d8\u00b3\u00d9\u02c6\u00d8\u00b1\u00d9\u0160'
And when I use <>/select/?q=السوري&wt=python, to query from browser (Google chrome), it becomes form3:
'\u0627\u0644\u0633\u0648\u0631\u064a'
(which could convert back to form1 by encode('utf8') But since they are different, the search matches nothing.
Therefore, those three different encode strategy may be the core problem. Could anyone help me figure it out and solve the search problem?
Thanks in advance.

Related

Mozilla Deep Speech SST suddenly can't spell

I am using deep speech for speech to text. Up to 0.8.1, when I ran transcriptions like:
byte_encoding = subprocess.check_output(
"deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav", shell=True)
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I would get back results that were pretty good. But since 0.8.2, where the scorer argument was removed, my results are just rife with misspellings that make me think I am now getting a character level model where I used to get a word-level model. The errors are in a direction that looks like the model isn't correctly specified somehow.
Now I when I call:
byte_encoding = subprocess.check_output(
['deepspeech', '--model', 'deepspeech-0.8.2-models.pbmm', '--audio', myfile])
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I now see errors like
endless -> "endules"
service -> "servic"
legacy -> "legaci"
earning -> "erting"
before -> "befir"
I'm not 100% that it is related to removing the scorer from the API, but it is one thing I see changing between releases, and the documentation suggested accuracy improvements in particular.
Short: The scorer matches letter output from the audio to actual words. You shouldn't leave it out.
Long: If you leave out the scorer argument, you won't be able to detect real world sentences as it matches the output from the acoustic model to words and word combinations present in the textual language model that is part of the scorer. And bear in mind that each scorer has specific lm_alpha and lm_beta values that make the search even more accurate.
The 0.8.2 version should be able to take the scorer argument. Otherwise update to 0.9.0, which has it as well. Maybe your environment is changed in a way. I would start in a new dir and venv.
Assuming you are using Python, you could add this to your code:
ds.enableExternalScorer(args.scorer)
ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
And check the example script.

Query string parsing as number when it should be a string

I am trying to send a search input to a REST service. In some cases the form input is a long string of numbers (example: 1234567890000000000123456789). I am getting 500 error, and it looks like something is trying the convert the string to a number. The data type for the source database is a string.
Is there something that can be done in building the query string that will force the input to be interpreted as a string?
The service is an implementation of ArcGIS server.
More information on this issue per request.
To test, I have been using a client form provided with the service installation (see illustration below).
I have attempted to add single and double quotes, plus wildcard characters in the form entry. The form submission does not error, but no results are found. If I shorten the number("1234"), or add some alpha numeric characters ("1234A"), the form submission does not error.
The problem surfaced after a recent upgrade to 10.1. I have looked for information that would tie this to a known problem, but not found anything yet.
In terms of forcing the input to be interpreted as a string, you enclose the input in single quotes (e.g., '1234567890000000000123456789'). Though if you are querying a field of type string then you need to enclose all search strings in single quotes, and in that case none of your queries should be working. So it's a little hard to tell from the information you've provided what exactly you are doing and what might be going wrong. Can you provide more detail and/or code? Are you formatting a where clause that you are using in a Query object via one of Esri's client side API's (such as the JavaScript API)? In that case, for fields of data type string you definitely need to enclose the search text in single quotes. For example if the field you are querying were called 'FIELD', this is how you'd format the where clause:
FIELD = '1234'
or
FIELD Like '1234%'
for a wildcard search. If you are trying to enter query criteria directly into the Query form of a published ArcGIS Server service/layer, then there too you need to enclose the search in single quotes, as in the above examples.
According to an Esri help technician, this is known bug.

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)

TermQuery not returning on a known search term, but WildcardQuery does

Am hoping someone with enough insight into the inner workings of Lucene might be able to point me in the right direction =)
I'll skip most of the surrounding irellevant code, and cut right to the chase. I have a Lucene index, to which I am adding the following field to the index (variables replaced by their literal values):
document.Add( new Field("Typenummer", "E5CEB501A244410EB1FFC4761F79E7B7",
Field.Store.YES , Field.Index.UN_TOKENIZED));
Later, when I search my index (using other types of queries), I am able to verify that this field does indeed appear in my index - like when looping through all Fields returned by Document.GetFields()
Field: Typenummer, Value: E5CEB501A244410EB1FFC4761F79E7B7
So far so good :-)
Now the real problem is - why can I not use a TermQuery to search against this value and actually get a result.
This code produces 0 hits:
// Returns 0 hits
bq.Add( new TermQuery( new Term( "Typenummer",
"E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );
But if I switch this to a WildcardQuery (with no wildcards), I get the 1 hit I expect.
// returns the 1 hit I expect
bq.Add( new WildcardQuery( new Term( "Typenummer",
"E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );
I've checked field lengths, I've checked that I am using the same Analyzer and so on and I am still on square 1 as to why this is.
Can anyone point me in a direction I should be looking?
I finally figured out what was going on. I'm expanding the tags for this question as it, much to my surprise, actually turned out to be an issue with the CMS this particular problem exists in. In summary, the problem came down to this:
The field is stored UN_TOKENIZED, meaning Lucene will store it excactly "as-is"
The BooleanQuery I pasted snippets from gets sent to the Sitecore SearchManager inside a PreparedQuery wrapper
The behaviour I expected from this was, that my query (having already been prepared) would go - unaltered - to the Lucene API
Turns out I was wrong. It passes through a RewriteQuery method that copies my entire set of nested queries as-is, with one exception - all the Term arguments are passed through a LowercaseStrategy()
As I indexed an UPPERCASE Term (UN_TOKENIZED), and Sitecore changes my PreparedQuery to lowercase - 0 results are returned
Am not going to start an argument of whether this is "by design" or "by design flaw" implementation of the Lucene Wrapper API - I'll just note that rewriting my query when using the PreparedQuery overload is... to me... unexpected ;-)
Further teachings from this; storing the field as TOKENIZED will eliminate this problem too, as the StandardAnalyzer by default will lowercase all tokens.

Single barcode with Code128B and Code128C with iTextSharp

I wish to generate a barcode mixing code128B and code128C with iTextSharp DLL. Do you know how to do that ? I currently know only with a single codeset.
By example, I wish to generate a barcode with the value 8L1 91450 883421 0550 001065
where "8L1 91450" is in code128B and "883421 0550 001065" is in code128C.
Thanks
Barcode128 will actually automatically switch from B to C if and when it can but it sounds like you don't want this. For the control that you're looking for you'll need to set your barcode's CodeType property to Barcode.CODE128_RAW and manually set the raw values.
There's a couple of posts out there that give the basic idea but unfortunately they tend to assume to much knowledge of iText or too much knowledge of barcodes.
I'm not a barcode expert either but the basic idea is to create a string that starts with Barcode128.START_B, then the first part of your text, then Barcode128.START_C and then the second. When in raw mode, text isn't ASCII, however. You can use this site to get the character codes for various ASCII values. But basically instead of sending the letter L you'd send (char)44.
Hopefully this gets you started at least.