What's the correct interpretation of this byte string? - unicode

In a friend's music directory, I came across this path and filename:
Ministry/Κî•Î¦Î‘Î›Î—Îžî˜ (Psalm 69)/Ministry - Κî•Î¦Î‘Î›Î—Îžî˜ (Psalm 69) - 06 - Scarecrow.mp3
You can google Ministry Κî•Î¦Î‘Î›Î—Îžî˜ and get results. If I feed it into a url encoder, I get %C2%9Ai%C2%95i%C2%A6i%C2%91i%C2%9Bi%C2%97i%C2%9Ei%C2%98.
It's clearly mangled in some way by traversing multiple incorrect encode/decode cycles. What is it supposed to be? How did you get that answer?
I've tried various paper and pencil scribblings with UTF-8, but can't figure out anything that makes sense.

It is supposed to be ΚΕΦΑΛΗΞΘ, which is the title of the Ministry album commonly known as Psalm 69. ΚΕΦΑΛΗΞΘ is what it looks like when the UTF-8 encoded ΚΕΦΑΛΗΞΘ is interpreted as Windows-1252.
This is close, but not identical to your Κî•Î¦Î‘Î›Î—Îžî˜ which has îs in place of two of the Îs. My guess for the discrepancies is, given their change and position, somewhere along the way a TitleCase conversion happened as well.
Got there by way of an educated guess, testing, and #Remy's helpful comment.

Related

Mozilla Deep Speech SST suddenly can't spell

I am using deep speech for speech to text. Up to 0.8.1, when I ran transcriptions like:
byte_encoding = subprocess.check_output(
"deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav", shell=True)
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I would get back results that were pretty good. But since 0.8.2, where the scorer argument was removed, my results are just rife with misspellings that make me think I am now getting a character level model where I used to get a word-level model. The errors are in a direction that looks like the model isn't correctly specified somehow.
Now I when I call:
byte_encoding = subprocess.check_output(
['deepspeech', '--model', 'deepspeech-0.8.2-models.pbmm', '--audio', myfile])
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I now see errors like
endless -> "endules"
service -> "servic"
legacy -> "legaci"
earning -> "erting"
before -> "befir"
I'm not 100% that it is related to removing the scorer from the API, but it is one thing I see changing between releases, and the documentation suggested accuracy improvements in particular.
Short: The scorer matches letter output from the audio to actual words. You shouldn't leave it out.
Long: If you leave out the scorer argument, you won't be able to detect real world sentences as it matches the output from the acoustic model to words and word combinations present in the textual language model that is part of the scorer. And bear in mind that each scorer has specific lm_alpha and lm_beta values that make the search even more accurate.
The 0.8.2 version should be able to take the scorer argument. Otherwise update to 0.9.0, which has it as well. Maybe your environment is changed in a way. I would start in a new dir and venv.
Assuming you are using Python, you could add this to your code:
ds.enableExternalScorer(args.scorer)
ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
And check the example script.

Grok filter for a time counter HH:MM

I'm quite new to ELK and Grok-filtering, and I'm struggling with parsing this particular pattern in my grok filter.
I've used the grok debugger to try and solve this, but although I like the tool, I just get confused by the custom patterns.
Eventually, I hope to parse lots of log files sent by filebeat to logstash, then send the parsed logs to elasticsearch and display with kibana or some similar visualization tool.
The lines that I need to parse follow the following pattern:
1310 2017-01-01 16:48:54 [325:51] [326:49] [359:57] Some log info text
The first four digits is a log type identifier, and will be used for grouping. I've called the field "LogLineID".
The date is formatted YYYY-MM-DD HH:MM:SS, and is parsed ok. I called the field "LogDate".
But now the problem begins. Within the square brackets, I have counters, formatted as MM:SS if you like. I cannot for the life of me find a way to sort these out, but I need to compare these times, hence I want to store them as minutes and seconds, not just numbers.
The first is a counter "TimeSpent",
the second is a counter "TimeStarted" and
the third is a counter "TimeSinceDown".
Then, last, comes the info text, which I've managed to grok with simply applying %{GREEDYDATA:LogInfo}.
I notice that the amount of minutes could be far higher than the standard 60 minutes within an hour, so I may be barking up the wrong tree here trying to parse it with date patterns such as TIMESTAMP_ISO8601, but then, I don't really know how else to do this.
So, I came this far:
%{NUMBER:LogLineID} %{TIMESTAMP_ISO8601:LogDate}
and were as mentioned able to (by cutting away the square bracket parts) to parse the log info text with
%{GREEDYDATA:LogInfo}
to create a field LogInfo.
But that's were I'm stuck. Could someone please help me figure out the rest?
Massive thanks in advance.
PS! I also found %{NUMBER:duration}, but it could as far as I could tell only parse timestamps with dot, not colon..
grok regex expression can help you solve the problem.
but first I wanna make sure that do you mean [325:51] [326:49] [359:57] are the three component that you wanna to fetch? And it will returns the result like :
TimeSpent: 325:51
TimeStarted: 326:49
TimeSinceDown: 359:57
were i get the point , you can use my ways in on of the following suggestions:
define your own custom pattern files and add the pattern in your file.
just use the expression in filter part of logstash conf file
hope it will helps you
Ah, there was a space.. Actually, I was misleading myself and everybody in my question, as it was not actually that log line that was causing problems. I just took the first one, not realizing where the problem really were, but the one causing problems had a space within the brackets as such: [ 42:31]. There are also some parts where there are two spaces, so the way I managed to solve this was to include a %{SPACE} between the \[ and the %{NUMBER}:
%{NUMBER:LogLineID} %{TIMESTAMP_ISO8601:LogDate} \[%{SPACE}%{NUMBER:TimeSpentMinutes}\:%{NUMBER:TimeSpentSeconds}\] \[%{SPACE}%{NUMBER:TimeStartedMinutes}\:%{NUMBER:TimeStartedSeconds}\] \[%{SPACE}%{NUMBER:TimeSinceDownMinutes}\:%{NUMBER:TimeSinceDownSeconds}\] %{GREEDYDATA:LogText}
I still haven't solved the merging of minutes and seconds, but this I can also handle in a later stage.
Thanks to Lin Don for showing an interest in my problem, and sorry for not replying sooner.
Hope the solution will help others (or even myself) if their stuck on the same kind of problem.
Note to myself: Read the logs more carefully before grok'ing.. :)

Single barcode with Code128B and Code128C with iTextSharp

I wish to generate a barcode mixing code128B and code128C with iTextSharp DLL. Do you know how to do that ? I currently know only with a single codeset.
By example, I wish to generate a barcode with the value 8L1 91450 883421 0550 001065
where "8L1 91450" is in code128B and "883421 0550 001065" is in code128C.
Thanks
Barcode128 will actually automatically switch from B to C if and when it can but it sounds like you don't want this. For the control that you're looking for you'll need to set your barcode's CodeType property to Barcode.CODE128_RAW and manually set the raw values.
There's a couple of posts out there that give the basic idea but unfortunately they tend to assume to much knowledge of iText or too much knowledge of barcodes.
I'm not a barcode expert either but the basic idea is to create a string that starts with Barcode128.START_B, then the first part of your text, then Barcode128.START_C and then the second. When in raw mode, text isn't ASCII, however. You can use this site to get the character codes for various ASCII values. But basically instead of sending the letter L you'd send (char)44.
Hopefully this gets you started at least.

what is the difference between doc.Content.Text and doc.Range(start, end).Text

Could you please explain what is the difference between doc.Content.Text and doc.Range(start, end).Text
Actually, if I extract a string like
doc.Content.Text.SubString(start, lenofText)
and if I do the same with
doc.Range(start, start + lenofText)
I get correct result for doc.Content.Text but incorrect result with doc.Range ... do you know the reason? I need to find a text and then convert it to a Hyper LINK but the doc.Range does not give the me the correct results...
Your description is a little vague (for instance, how is it not the correct results?) but a document is actually comprised of as many as 17 story parts (which includes things like the main story [the document area], footers, headers, footnotes, and comments). 'Content' refers specifically to the main text story. ‘Doc.Range’ is broader and can include more than one story. If the results are not correct because it looks like the text is offset by a certain number of characters, it may be counting other stories. If you want to limit the results to the body text, specify one of the following:
doc.Content
doc.StoryRanges(wdMainTextStory)

Double-metaphone errors

I'm using Lawrence Philips Double-Metaphone algorithm with great success, but I have found the odd "unexpected result" for some combinations.
Does anyone else have additions or changes to the algorithm for other parts of it they wouldn't mind sharing, or just the combinations that they've found that do not work as expected.
eg. I had issues between:
Peashill and Bushley. (both match with PXL)
Rockliffe and Rockcliffe (RKLF and RKKL)
All Soundex, Metaphone and variant schemes are occasionally going to give results that aren't identical to what you expect. This is unavoidable - they can be regarded as more or less simple hash algorithms with special information preserving properties, and will sometimes produce collisions when you'd rather they didn't, and will sometimes produce differences when you'd rather they didn't.
One possible way of improving things is using 'synonym rings'. This basically produces lists of words that should be regarded as synonyms, independent of the spelling. I encountered them in the context of name matching. For example, variants on Chaudri
included:
CHAUDARY
CHAUDERI
CHAUDERY
CHAUDHARY
CHAUDHERI
CHAUDHERY
CHAUDHRI
CHAUDHRY
CHAUDHURI
CHAUDHURY
CHAUDHY
CHAUDREY
CHAUDRI
CHAUDRY
CHAUDURI
CHAWDHARY
CHAWDHRY
CHAWDHURY
CHDRY
CHODARY
CHODHARI
CHODHOURY
CHODHRY
CHODREY
CHODRY
CHODURY
CHOUDARI
CHOUDARY
CHOUDERY
CHOUDHARI
CHOUDHARY
CHOUDHERY
CHOUDHOURY
CHOUDHRI
CHOUDHRY
CHOUDHURI
CHOUDHURY
CHOUDREY
CHOUDRI
CHOUDRY
CHOUDURY
CHOUWDHRY
CHOWDARI
CHOWDARY
CHOWDHARY
CHOWDHERY
CHOWDHRI
CHOWDHRY
CHOWDHURI
CHOWDHURRYY
CHOWDHURY
CHOWDORY
CHOWDRAY
CHOWDREY
CHOWDRI
CHOWDRURY
CHOWDRY
CHOWDURI
CHOWDURY
CHUDARY
CHUDHRY
CHUDORY
COWDHURY
regular metaphone is returning a difference between Peashill and Bushley
Peashill PXL
Bushley BXL