Data transformation during load

Data transformation during load - mongodb

On my server runs an application called 'logstash'. This application receives log entries from multiple servers and uploads them as JSON documents to MongoDB. Works like a charm.
Example:
{
u'syslog_message': u'[10724525.839722] [UFW BLOCK] IN=venet0 OUT= MAC= SRC=1.2.3.4 DST=9.8.7.6 LEN=52 TOS=0x08 PREC=0x20 TTL=50 ID=55384 PROTO=TCP SPT=349 DPT=123 WINDOW=14600 RES=0x00 SYN URGP=0 ',
u'received_from': u'1.3.5.7:1234',
u'#version': u'1',
u'#timestamp': datetime.datetime(2014, 11, 20, 15, 9, 55),
u'syslog_timestamp': u'Nov 20 15:09:55',
u'syslog_facility': u'user-level',
u'syslog_severity': u'notice',
u'host': u'2.4.6.8:2468',
u'syslog_program': u'kernel',
u'syslog_hostname': u'server01',
u'received_at': u'2014-11-20 20:09:55 UTC',
u'message': u'<4>Nov 20 15:09:55 server01 kernel: [10724525.839722] [UFW BLOCK] IN=venet0 OUT= MAC= SRC=1.2.3.4 DST=2.3.4.5 LEN=52 TOS=0x08 PREC=0x20 TTL=50 ID=55384 PROTO=TCP SPT=1234 DPT=543 WINDOW=14600 RES=0x00 SYN URGP=0 ',
u'_id': ObjectId('546e4a93e98673fe8f11a4d2'),
u'type': u'syslog',
u'syslog_severity_code': 5,
u'syslog_facility_code': 1
}
But the data is not exactly how I want it to be. I need to convert string to date, add some elements based on other elements and more 'transformations' on each document that is loaded into a specific collection.
What is the default way to handle this and where is this entire process documented?

Logstash has a number of filter plugins that can be used to add, delete, and modify message fields. The Logstash documentation lists them all. Judging by your example message above I'd say there already are a number of filters in place. It sounds like you'll at least need an additional date filter and a mutate filter to accomplish what you're outlining.

Related

Telegraf inputs.tail with zimbra.log

I have some questions, how I can set telegraf.conf file for collect logs from the "zimbra.conf" file?
Now I tried to use this config text, but it does not work :(((
I want to send this logs to grafana
One of the lines "zimbra.conf" for example:
Oct 1 10:20:46 webmail postfix/smtp[7677]: BD5BAE9999: to=user#mail.com, relay=mo94.cloud.mail.com[92.97.907.14]:25, delay=0.73, delays=0.09/0.01/0.58/0.19, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 4C25fk2pjFz32N5)
And I do not understand exactly how works the "grok_patterns ="
[[inputs.tail]]
files = ["/var/log/zimbra.log"]
from_beginning = false
grok_patterns = ['%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST} %{DATA:program}(?:\[%{POSINT}\])?: %{GREEDYDATA:message}']
name_override = "zimbra_access_log"
grok_custom_pattern_files = []
grok_custom_patterns = '''
TS_UNIX %{MONTH}%{SPACE}%{MONTHDAY}%{SPACE}%{HOUR}:%{MINUTE}:%{SECOND}
TS_CUSTOM %{MONTH}%{SPACE}%{MONTHDAY} %{HOUR}:%{MINUTE}:%{SECOND}
'''
grok_timezone = "Local"
data_format = "grok"

I have copied your example line into a log file called Prueba.txt wich contains the following lines:
Oct 3 00:52:32 webmail postfix/smtp[7677]: BD5BAE9999: to=user#mail.com, relay=mo94.cloud.mail.com[92.97.907.14]:25, delay=0.73, delays=0.09/0.01/0.58/0.19, dsn=2.0.0, status=sent (250 2.0$
Oct 13 06:25:01 webmail systemd-logind[949]: New session 229478 of user zimbra.
Oct 13 06:25:02 webmail zmconfigd[27437]: Shutting down. Received signal 15
Oct 13 06:25:02 webmail systemd-logind[949]: Removed session c296.
Oct 13 06:25:03 webmail sshd[28005]: Failed password for invalid user julianne from 120.131.2.210 port 10570 ssh2
I have been able to parse the data with this configuration of the tail.input plugin:
[[inputs.tail]]
files = ["Prueba.txt"]
from_beginning = true
data_format = "grok"
grok_patterns = ['%{TIMESTAMP_ZIMBRA} %{GREEDYDATA:source} %{DATA:program}(?:\[%{POSINT}\])?: %{GREEDYDATA:message}']
grok_custom_patterns = '''
TIMESTAMP_ZIMBRA (\w{3} \d{1,2} \d{2}:\d{2}:\d{2})
'''
name_override = "log_frames"
You need to match the input string with regular expressions. For that there are some predefined patters such as GREEDYDATA = .* that you can use to match your input (another example will be NUMBER = (?:%{BASE10NUM}) BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))) . You can also define your own patterns in grok_custom_patterns. Take a look at this website with some patters: https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Apx-GrokPatterns/GrokPatterns_title.html
In this case I defined a TIMESTAMP_ZIMBRA pattern for matching Oct 3 00:52:32 and Oct 03 00:52:33 alike inputs.
Here is the collected metric by Prometheus:
# HELP log_frames_delay Telegraf collected metric
# TYPE log_frames_delay untyped
log_frames_delay{delays="0.09/0.01/0.58/0.19",dsn="2.0.0",host="localhost.localdomain",message="BD5BAE9999:",path="Prueba.txt",program="postfix/smtp",relay="mo94.cloud.mail.com[92.97.907.14]:25",source="webmail",status="sent (250 2.0.0 Ok: queued as 4C25fk2pjFz32N5)",to="user#mail.com"} 0.73
P.D.: Ensure that telegraf has access to the log files.

How to trigger events by count in windowing for apache beam?

I appreciate if I could get help with windowing in apache beam 2.13.0 .
I use python 3.7.3 .
[ywatanabe#fedora-30-00 bdd100k-to-es]$ python3 -V
Python 3.7.3
I want to do exactly what this example is doing against bounded data though. Which is to group events by each trigger and pass it to next transform.
8.4.1.2. Discarding mode
If our trigger is set to discarding mode, the trigger emits the following values on each firing:
First trigger firing: [5, 8, 3]
Second trigger firing: [15, 19, 23]
Third trigger firing: [9, 13, 10]
Referencing the example I have wrote my code as below,
es = (gps | 'window:gps' >> WindowInto(
FixedWindows(1 * 60),
trigger=Repeatedly(
AfterAny(
AfterCount(1000000),
AfterProcessingTime(1 * 60)
)
),
accumulation_mode=AccumulationMode.DISCARDING
)
| 'bulk:gps' >> beam.ParDo(BulkToESFn(esHost), tag_gps))
However in above code , it looks like the trigger is fired almost every millisec instead of every minute or 1,000,000 events.
2019-07-15 20:13:20,401 INFO Sending bulk request to elasticsearch. Doc counts: 11 Docs: {'track_id': '514df98862de83a07e7aff62dff77c3d', 'media_id': 'afe35b87-0a9acea6', 'ride_id': 'afe35b87d0b69e1928dd0a4fd75a1416', 'filename': '0a9acea6-62d6-4540-b048-41e34e2407c6.mov', 'timestamp': 1505287487.0, 'timezone': 'America/Los_Angeles', 'coordinates': {'lat': 37.786611081350365, 'lon': -122.3994713602353}, 'altitude': 16.06207275390625, 'vertical_accuracy': 4.0, 'horizantal_accuracy': 10.0, 'speed': 2.3399999141693115}
2019-07-15 20:13:20,403 INFO Sending bulk request to elasticsearch. Doc counts: 11 Docs: {'track_id': '514df98862de83a07e7aff62dff77c3d', 'media_id': 'afe35b87-0a9acea6', 'ride_id': 'afe35b87d0b69e1928dd0a4fd75a1416', 'filename': '0a9acea6-62d6-4540-b048-41e34e2407c6.mov', 'timestamp': 1505287488.0, 'timezone': 'America/Los_Angeles', 'coordinates': {'lat': 37.78659459994027, 'lon': -122.39945105706596}, 'altitude': 15.888671875, 'vertical_accuracy': 4.0, 'horizantal_accuracy': 10.0, 'speed': 2.3299999237060547}
2019-07-15 20:13:20,406 INFO Sending bulk request to elasticsearch. Doc counts: 11 Docs: {'track_id': '514df98862de83a07e7aff62dff77c3d', 'media_id': 'afe35b87-0a9acea6', 'ride_id': 'afe35b87d0b69e1928dd0a4fd75a1416', 'filename': '0a9acea6-62d6-4540-b048-41e34e2407c6.mov', 'timestamp': 1505287489.0, 'timezone': 'America/Los_Angeles', 'coordinates': {'lat': 37.78657796009011, 'lon': -122.39943055871701}, 'altitude': 15.741912841796875, 'vertical_accuracy': 4.0, 'horizantal_accuracy': 10.0, 'speed': 2.549999952316284}
Do I need any other option for windowing for this case ?

I think window strategy and trigger strategy takes effect at the GBK step.
https://beam.apache.org/documentation/programming-guide/#windowing
In your case I think you can implement the DoFn(BulkToESFn) in a way that it buffers data and only writes to ES when count exceed a predefined value.
class BulkToESFn(DoFn):
def __init__(self,
batch_size=1000000):
self.batch_size = batch_size
self.batch = []
def finish_bundle(self):
self._flush()
def process(self, element, *args, **kwargs):
self.batch.append(element)
if len(self.batch) >= self.batch_size:
self._flush()
def _flush(self):
writeToES(self.batch)
self.batch = []

creating arrays of results based upon chunks of time

I've got an app that has lots of sensor_events being saved; I'd like to get results by a date then map those into chunks of 15 minute times... this is not the same as doing a group by in postgres as that would only return something averaged and I need the specific events...
What I'm thinking is given a day I get the beginning_of_day and split it up as 15 minute chunks as keys to a hash of arrays ie
def return_time_chunk_hash
t=Date.today
st = t.beginning_of_day
times = Hash.new
while st<t.end_of_day
times[st.to_formatted_s(:time)] = Array.new
st = st + 15.minutes
end
return times
end
And from that I would compare the sensor_events created_at date, find which bucket it belonged to and plop it in there. Once I've got it that way I know whether or not a chunk has any (.count) and if so can do all the data manipulation on the specific events.
Does this seem nutty? Is there a simpler way I'm not seeing?
Update:
I liked the way jgraft was thinking but thought it wouldn't work as I'd have to do multiple queries based upon the group column flag, but then I thought of the group_by of Enumerable so I tried something like this in the actual SensorEvent model:
def chunk
Time.at((self.created_at.to_f / 15.minutes).floor * 15.minutes).to_formatted_s(:time)
end
This allows me to get all the sensor events I need as usual (ie #se=SensorEvent.where(sensor_id: 10)) but then I could do #se.group_by(&:chunk) and I get those singular events grouped into a hash ie:
{"13:30"=>
[#<SensorEvent:0x007ffac0128438
id: 25006,
force: 0.0,
xaccel: 502.0,
yaccel: 495.0,
zaccel: 616.0,
battery: 0.0,
position: 25.0,
created_at: Thu, 18 Jun 2015 13:33:37 EDT -04:00,
updated_at: Thu, 18 Jun 2015 15:51:32 EDT -04:00,
deviceID: "D36330135FE3D36",
location: 3,
sensor_id: 10>,
#<SensorEvent:0x007ffac0128140
id: 25007,
force: 0.0,
xaccel: 502.0,
yaccel: 495.0,
zaccel: 616.0,
battery: 0.0,
position: 27.0,
created_at: Thu, 18 Jun 2015 13:39:46 EDT -04:00,
updated_at: Thu, 18 Jun 2015 15:51:32 EDT -04:00,
deviceID: "D36330135FE3D36",
location: 3,
sensor_id: 10>,
.........
The trouble is of course not every chunk of time might be created since there was no event to spawn it; also that being a hash it's not sorted in anyway:
res.keys
=> ["13:30",
"13:45",
"14:00",
"13:00",
"15:45",
"16:00",
"16:15",
"16:45",
"17:00",
"17:15",
"17:30",
"14:15",
"14:30",
I have to do calculations on the chunks of events; I might keep a master TIMECHUNKS array to compare / lookup in order...

Why not just add a column to the sensor_events table that specifies which block it belongs to? Would basically give you an array as you could do a query like:
SensorEvent.where(date: Date.today, block: 1)
and return a relation of the data in an array-esque format.
You could just add an after_create callback to the SensorEvents model that sets the block column.
class SensorEvent < ActiveRecord::Base
after_create :set_block
private
def set_block
value = ((4 * created_at.hour) + (created_at.min.to_f / 15).ceil).to_i
self.update_columns(block: value)
end

perl Dowload email headers from gmail for parsing

I am writing an Icinga plugin to check if the smtp server we have contracted with a third party gets blacklisted.
The service uses an unknown number of smtp relays. I need to download all the "Received" sections of the headers, and parse them to get the different IPs of the SMTP relays.
I am trying to use Mail::IMAPClient, and I can perform some operations on the account (login, chose folder, search the messages, etc), but I haven't found a way to get the whole header nor the sections of it I need.
I don't mind using a different module if needed.

You could try using the parse_headers function. According to the example in the documentation, you can use it like this:
$hashref = $imap->parse_headers(1,"Date","Received","Subject","To");
And then you get a hash reference that maps field names to references to array of values, like this:
$hashref = {
"Date" => [ "Thu, 09 Sep 1999 09:49:04 -0400" ] ,
"Received" => [ q/
from mailhub ([111.11.111.111]) by mailhost.bigco.com
(Netscape Messaging Server 3.6) with ESMTP id AAA527D for
<bigshot#bigco.com>; Fri, 18 Jun 1999 16:29:07 +0000
/, q/
from directory-daemon by mailhub.bigco.com (PMDF V5.2-31 #38473)
id <0FDJ0010174HF7#mailhub.bigco.com> for bigshot#bigco.com
(ORCPT rfc822;big.shot#bigco.com); Fri, 18 Jun 1999 16:29:05 +0000 (GMT)
/, q/
from someplace ([999.9.99.99]) by smtp-relay.bigco.com (PMDF V5.2-31 #38473)
with ESMTP id <0FDJ0000P74H0W#smtp-relay.bigco.com> for big.shot#bigco.com; Fri,
18 Jun 1999 16:29:05 +0000 (GMT)
/] ,
"Subject" => [ qw/ Help! I've fallen and I can't get up!/ ] ,
"To" => [ "Big Shot <big.shot#bigco.com> ] ,
};
That should give you all the Received headers in a single array.

How to read files using for loop in Matlab?

I need to read 8760 files (365 days * 24 hours = 8760) of small size (60 kb) and aggregate values and take average of some values.
Earlier, I have used the below stated code for reading *.csv files:
for a=1:365
for b=1:24
s1=int2str(a);
s2=int2str(b);
s3=strcat('temperature_humidity',s1,'_'s2);
data = load(s3);
% Code for aggregation, etc
end
end
I was able to run this code. However now the file name is little different and I am not sure how to read these files.
Files are named like this:
2005_M01_D01_0000(UTC-0800)_L00_NOX_1HR_CONC.DAT
where M = Month, so the values are 01, 01, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12;
D = Day, so the values are 01, 02, 03, ..., 31;
Hours is in this format: 0000, 0100, 0200, ..., 1800, ..., 2300.
Please take a look at the attached image for file name . I need to read these files. Please help me.
Thank you very much.

I would use dir:
files=dir('*.dat')
Or you can construct the filenames with
name = sprintf('%d_M%2d etc.',...)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Data transformation during load - mongodb

Related

Telegraf inputs.tail with zimbra.log

How to trigger events by count in windowing for apache beam?

creating arrays of results based upon chunks of time

perl Dowload email headers from gmail for parsing

How to read files using for loop in Matlab?

Categories

Resources