Grok filter for a time counter HH:MM - date

I'm quite new to ELK and Grok-filtering, and I'm struggling with parsing this particular pattern in my grok filter.
I've used the grok debugger to try and solve this, but although I like the tool, I just get confused by the custom patterns.
Eventually, I hope to parse lots of log files sent by filebeat to logstash, then send the parsed logs to elasticsearch and display with kibana or some similar visualization tool.
The lines that I need to parse follow the following pattern:
1310 2017-01-01 16:48:54 [325:51] [326:49] [359:57] Some log info text
The first four digits is a log type identifier, and will be used for grouping. I've called the field "LogLineID".
The date is formatted YYYY-MM-DD HH:MM:SS, and is parsed ok. I called the field "LogDate".
But now the problem begins. Within the square brackets, I have counters, formatted as MM:SS if you like. I cannot for the life of me find a way to sort these out, but I need to compare these times, hence I want to store them as minutes and seconds, not just numbers.
The first is a counter "TimeSpent",
the second is a counter "TimeStarted" and
the third is a counter "TimeSinceDown".
Then, last, comes the info text, which I've managed to grok with simply applying %{GREEDYDATA:LogInfo}.
I notice that the amount of minutes could be far higher than the standard 60 minutes within an hour, so I may be barking up the wrong tree here trying to parse it with date patterns such as TIMESTAMP_ISO8601, but then, I don't really know how else to do this.
So, I came this far:
%{NUMBER:LogLineID} %{TIMESTAMP_ISO8601:LogDate}
and were as mentioned able to (by cutting away the square bracket parts) to parse the log info text with
%{GREEDYDATA:LogInfo}
to create a field LogInfo.
But that's were I'm stuck. Could someone please help me figure out the rest?
Massive thanks in advance.
PS! I also found %{NUMBER:duration}, but it could as far as I could tell only parse timestamps with dot, not colon..

grok regex expression can help you solve the problem.
but first I wanna make sure that do you mean [325:51] [326:49] [359:57] are the three component that you wanna to fetch? And it will returns the result like :
TimeSpent: 325:51
TimeStarted: 326:49
TimeSinceDown: 359:57
were i get the point , you can use my ways in on of the following suggestions:
define your own custom pattern files and add the pattern in your file.
just use the expression in filter part of logstash conf file
hope it will helps you

Ah, there was a space.. Actually, I was misleading myself and everybody in my question, as it was not actually that log line that was causing problems. I just took the first one, not realizing where the problem really were, but the one causing problems had a space within the brackets as such: [ 42:31]. There are also some parts where there are two spaces, so the way I managed to solve this was to include a %{SPACE} between the \[ and the %{NUMBER}:
%{NUMBER:LogLineID} %{TIMESTAMP_ISO8601:LogDate} \[%{SPACE}%{NUMBER:TimeSpentMinutes}\:%{NUMBER:TimeSpentSeconds}\] \[%{SPACE}%{NUMBER:TimeStartedMinutes}\:%{NUMBER:TimeStartedSeconds}\] \[%{SPACE}%{NUMBER:TimeSinceDownMinutes}\:%{NUMBER:TimeSinceDownSeconds}\] %{GREEDYDATA:LogText}
I still haven't solved the merging of minutes and seconds, but this I can also handle in a later stage.
Thanks to Lin Don for showing an interest in my problem, and sorry for not replying sooner.
Hope the solution will help others (or even myself) if their stuck on the same kind of problem.
Note to myself: Read the logs more carefully before grok'ing.. :)

Related

Mozilla Deep Speech SST suddenly can't spell

I am using deep speech for speech to text. Up to 0.8.1, when I ran transcriptions like:
byte_encoding = subprocess.check_output(
"deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav", shell=True)
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I would get back results that were pretty good. But since 0.8.2, where the scorer argument was removed, my results are just rife with misspellings that make me think I am now getting a character level model where I used to get a word-level model. The errors are in a direction that looks like the model isn't correctly specified somehow.
Now I when I call:
byte_encoding = subprocess.check_output(
['deepspeech', '--model', 'deepspeech-0.8.2-models.pbmm', '--audio', myfile])
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I now see errors like
endless -> "endules"
service -> "servic"
legacy -> "legaci"
earning -> "erting"
before -> "befir"
I'm not 100% that it is related to removing the scorer from the API, but it is one thing I see changing between releases, and the documentation suggested accuracy improvements in particular.
Short: The scorer matches letter output from the audio to actual words. You shouldn't leave it out.
Long: If you leave out the scorer argument, you won't be able to detect real world sentences as it matches the output from the acoustic model to words and word combinations present in the textual language model that is part of the scorer. And bear in mind that each scorer has specific lm_alpha and lm_beta values that make the search even more accurate.
The 0.8.2 version should be able to take the scorer argument. Otherwise update to 0.9.0, which has it as well. Maybe your environment is changed in a way. I would start in a new dir and venv.
Assuming you are using Python, you could add this to your code:
ds.enableExternalScorer(args.scorer)
ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
And check the example script.

Libreoffice : italics breaking off (messes with replace-function)

I have been trying to use automatic replace function to place entries starting with italics on their own lines, so for example
lähteä60*F【动】
1. 离开, 出发, 走掉líkāi, chūfā, zǒu diào(poistua). Vieraat lähtevät.客人走了kèrén zǒule.Juna lähtee raiteelta kaksi. 火车两点离站huǒchē liǎng diǎn lí zhàn.
would turn into
1. 离开, 出发, 走掉líkāi, chūfā, zǒu diào(poistua).
Vieraat lähtevät.客人走了kèrén zǒule.
Juna lähtee raiteelta kaksi. 火车两点离站huǒchē liǎng diǎn lí zhàn.
However, for some reason unknown to me, the replace function (cf. screenshot) breaks up the lines like this:
lähteä60*F【动】
1. 离开, 出发, 走掉líkāi, chūfā, zǒu diào(poistua).
Vieraat lähtev
ä
t.客人走了kèrén zǒule.
Juna lähtee raiteelta kaksi. 火车两点离站huǒchē liǎng diǎn lí zhàn.
(so the first line "Vieraat lähtevät.客人走了kèrén zǒule." is broken up.)
As far as I can tell, it should be all italics, so I have no idea why it breaks up and no way to examine what's the problem. Trying to save to different formats doesn't seem to help either. There's thousands of pages of this stuff, so the automatic function is really required.
A small sample file of the stuff can be downloaded from here:
http://shakki.info/test.docx.
Some screenshots of the problem:

How to sscanf this string?: "+CPMS: \"ME\",18,255,\"ME\",18,255,\"ME\",18,255"

So, I'm developing an application in C and I need to sscanf a string.
+CPMS: \"ME\",18,255,\"ME\",18,255,\"ME\",18,255
I need to get the number between the first and second commas, 18 in this example, but it can be from 0 to 255.
I'm trying to create the placeholder to get this but I can't seem to make it work.
I've tried lots of thing, but I can't understand why:
sscanf(pointer, "+CPMS: \"%*s\",%d", &intPointer);
doesn't work.
Can anyone help me?
Thank you.
Well, I'm going to answer my own question.
sscanf(pointer, "+CPMS: \"%*2s\",%d", &intPointer);
It looks like I needed to put the number of characters to ignore.
Hope it help someone else.

Negation of osm class or type

If you search for an airport (aeroway=aerodrome) around brescia, italy, you will also receive a hit for a military airfield, which happens to be tagged as an aerodrome also (it's taggged: aeroway=aerodrome, landuse=military, military=airfield). To avoid this I want to search for aeroway=aerodrome but exclude [military]. I've tried [! military] and [military~"^$"]. Any suggestions?
This particular case may be rare, I realize, but the concept of negating multi-classed elements is useful. And multi-classed elements is not a rare occurance. In general, they seem to be complimentary, not conflicting, so it's not an issue. I also realize that I can weed out conflicting hits with some back-end processing. I wasn't expecting a military airfield to appear with a commercial aerodrome.
In any case, here is a shortened version of my query. I include node, way and relation in full query:
http://overpass-api.de/api/interpreter?
data=[out:json][timeout:25][bbox:45.400861,9.868469,45.641408,10.542755];
(node[aeroway~%22aero|term|heli%22][! military]; ... ) out etc
or:
http://overpass-api.de/api/interpreter?
data=[out:json][timeout:25][bbox:45.400861,9.868469,45.641408,10.542755];
(node[aeroway~%22aero|term|heli%22][military~%22^$%22]; ... ) out etc
If you try to run it, you'll need to include way and relation.
Also, as you can see I don't exactly ask for aeroway=aerodrome. I include terminal and variations on heliport. My experience has been that some aerodromes are tagged only as "terminal", so if you're looking for an airport, asking for "aerodrome" isn't enough.
The correct syntax for negation is as follows:
[military !~ ".*"]
Please see the documentation on the OSM wiki for details.

grok pattern match for aceess log event

I am new to ELK and grok pattern matching. I am trying to build grok pattern match for my access log event and I am getting grokparsefailure message.
Here is my event log:
111.22.333.44 2015-09-15 14:27:02 POST /test/service/testservice 200 359 0.016
Grok pattern (after soem reasrch I came up with this):
%{IP:client}%{DATESTAMP_EVENTLOG:logeventtime}%{WORD:method}%{URIPATHPARAM:request}%{NUMBER:HTTPStatus}%{NUMBER:bytes}%{NUMBER:duration}
I suspected the issue might be with date match above and I tried to remove the psace between the date and time and try pattern matching and that did not work either. I removed the date and time all together and tried for the remaning and that also was giving same error. I am at a loss to where the issue is. any inputs would be helpful. Thanks!
Start here: http://grokdebug.herokuapp.com/discover?#
%{HOST:client} %{TIMESTAMP_ISO8601:event_date_time} %{WORD:http_method} %{URIPATHPARAM} %{NUMBER:status} %{NUMBER:bytes} %{NUMBER:duration}