How to trigger events by count in windowing for apache beam? - apache-beam

I appreciate if I could get help with windowing in apache beam 2.13.0 .
I use python 3.7.3 .
[ywatanabe#fedora-30-00 bdd100k-to-es]$ python3 -V
Python 3.7.3
I want to do exactly what this example is doing against bounded data though. Which is to group events by each trigger and pass it to next transform.
8.4.1.2. Discarding mode
If our trigger is set to discarding mode, the trigger emits the following values on each firing:
First trigger firing: [5, 8, 3]
Second trigger firing: [15, 19, 23]
Third trigger firing: [9, 13, 10]
Referencing the example I have wrote my code as below,
es = (gps | 'window:gps' >> WindowInto(
FixedWindows(1 * 60),
trigger=Repeatedly(
AfterAny(
AfterCount(1000000),
AfterProcessingTime(1 * 60)
)
),
accumulation_mode=AccumulationMode.DISCARDING
)
| 'bulk:gps' >> beam.ParDo(BulkToESFn(esHost), tag_gps))
However in above code , it looks like the trigger is fired almost every millisec instead of every minute or 1,000,000 events.
2019-07-15 20:13:20,401 INFO Sending bulk request to elasticsearch. Doc counts: 11 Docs: {'track_id': '514df98862de83a07e7aff62dff77c3d', 'media_id': 'afe35b87-0a9acea6', 'ride_id': 'afe35b87d0b69e1928dd0a4fd75a1416', 'filename': '0a9acea6-62d6-4540-b048-41e34e2407c6.mov', 'timestamp': 1505287487.0, 'timezone': 'America/Los_Angeles', 'coordinates': {'lat': 37.786611081350365, 'lon': -122.3994713602353}, 'altitude': 16.06207275390625, 'vertical_accuracy': 4.0, 'horizantal_accuracy': 10.0, 'speed': 2.3399999141693115}
2019-07-15 20:13:20,403 INFO Sending bulk request to elasticsearch. Doc counts: 11 Docs: {'track_id': '514df98862de83a07e7aff62dff77c3d', 'media_id': 'afe35b87-0a9acea6', 'ride_id': 'afe35b87d0b69e1928dd0a4fd75a1416', 'filename': '0a9acea6-62d6-4540-b048-41e34e2407c6.mov', 'timestamp': 1505287488.0, 'timezone': 'America/Los_Angeles', 'coordinates': {'lat': 37.78659459994027, 'lon': -122.39945105706596}, 'altitude': 15.888671875, 'vertical_accuracy': 4.0, 'horizantal_accuracy': 10.0, 'speed': 2.3299999237060547}
2019-07-15 20:13:20,406 INFO Sending bulk request to elasticsearch. Doc counts: 11 Docs: {'track_id': '514df98862de83a07e7aff62dff77c3d', 'media_id': 'afe35b87-0a9acea6', 'ride_id': 'afe35b87d0b69e1928dd0a4fd75a1416', 'filename': '0a9acea6-62d6-4540-b048-41e34e2407c6.mov', 'timestamp': 1505287489.0, 'timezone': 'America/Los_Angeles', 'coordinates': {'lat': 37.78657796009011, 'lon': -122.39943055871701}, 'altitude': 15.741912841796875, 'vertical_accuracy': 4.0, 'horizantal_accuracy': 10.0, 'speed': 2.549999952316284}
Do I need any other option for windowing for this case ?

I think window strategy and trigger strategy takes effect at the GBK step.
https://beam.apache.org/documentation/programming-guide/#windowing
In your case I think you can implement the DoFn(BulkToESFn) in a way that it buffers data and only writes to ES when count exceed a predefined value.
class BulkToESFn(DoFn):
def __init__(self,
batch_size=1000000):
self.batch_size = batch_size
self.batch = []
def finish_bundle(self):
self._flush()
def process(self, element, *args, **kwargs):
self.batch.append(element)
if len(self.batch) >= self.batch_size:
self._flush()
def _flush(self):
writeToES(self.batch)
self.batch = []

Related

How to use regex in CF template for conditions

I am using a condition in troposphere CF template, but unfortunately there are more than 10 conditions and AWS CF supports only 10 of them. The condition checks if the app name start with particular name. Is there a way to use regex in condition so I can write only one condition instead of 10, stating do something if the name start with appname*
I am adding the conditions for each role but since aws supports only 10, I cant add more than that.
conditions = {
"RoleEqualCollectors01" : Equals(
Ref(ThorRole),
"collectors01",
),
...,
...,
"RoleEqualCollectors22" : Equals(
Ref(ThorRole),
"collectors22",
),
"Collector" : Or(
Condition("RoleEqualCollectors01"),
...,
...,
Condition("RoleEqualCollectors22")
),
is there a way I can specify like this,
conditions = {
"RoleEqualCollectors" : Equals(
Ref(ThorRole),
"collectors*",
),
"Collector" : Or(
Condition("RoleEqualCollectors*"),
),
Just found out that AWS has a limit for Or condition, it needs minimum 2 and maximum 10 conditions, there is a work around, I did three separate Or conditions and then a Final_Or condition that combines all of them.
or1: Fn::Or condition for 1,2, 3, 4, 5, 6, 7, 8, 9, 10
or2: Fn::Or condition for 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
or3: Fn::Or condition for 21, 22
Final_Or: Fn::Or for or1, or2, or3

How do you get directory listing sorted by date in Elixir?

How do you get directory listing sorted by date in Elixir?
File.ls/1 gives list sorted by filename only.
No other functions in File module seem relevant for this.
Maybe there's a built-in function I don't know about, but you can make your own by using File.stat!/2:
File.ls!("path/to/dir")
|> Enum.map(&{&1, File.stat!("path/to/dir" <> &1).ctime})
|> Enum.sort(fn {_, time1}, {_, time2} -> time1 <= time2 end)
Example output:
[
{"test", {{2019, 3, 9}, {23, 55, 48}}},
{"config", {{2019, 3, 9}, {23, 55, 48}}},
{"README.md", {{2019, 3, 9}, {23, 55, 48}}},
{"_build", {{2019, 3, 9}, {23, 59, 48}}},
{"test.xml", {{2019, 3, 23}, {22, 1, 28}}},
{"foo.ex", {{2019, 4, 20}, {4, 26, 5}}},
{"foo", {{2019, 4, 21}, {3, 59, 29}}},
{"mix.exs", {{2019, 7, 27}, {8, 45, 0}}},
{"mix.lock", {{2019, 7, 27}, {8, 45, 7}}},
{"deps", {{2019, 7, 27}, {8, 45, 7}}},
{"lib", {{2019, 7, 27}, {9, 5, 36}}}
]
Edit:
As pointed out in a comment, this assumes you're in the directory you want to see the output for. If this is not the case, you can specify it by adding the :cd option, like so:
System.cmd("ls", ["-lt"], cd: "path/to/dir")
You can also make use of System.cmd/3 to achieve this.
Particularly you want to use the "ls" command with the flag "-t" which will sort by modification date and maybe "-l" which will provide extra information.
Therefore you can use it like this:
# To simply get the filenames sorted by modification date
System.cmd("ls", ["-t"])
# Or with extra info
System.cmd("ls", ["-lt"])
This will return a tuple containing a String with the results and a number with the exit status.
So, if you just call it like that, it will produce something like:
iex> System.cmd("ls", ["-t"])
{"test_file2.txt\ntest_file1.txt\n", 0}
Having this, you can do lots of things, even pattern match over the exit code to process the output accordingly:
case System.cmd("ls", ["-t"]) do
{contents, 0} ->
# You can for instance retrieve a list with the filenames
String.split(contents, "\n")
{_contents, exit_code} ->
# Or provide an error message
{:error, "Directory contents could not be read. Exit code: #{exit_code}"
end
If you don't want to handle the exit code and just care about the contents you can simply run:
System.cmd("ls", ["-t"]) |> elem(0) |> String.split("\n")
Notice that this will however include an empty string at the end, because the output string ends with a newline "\n".

Unable To Plot Graph From PostgreSQL Query Results In Dash App

I am attempting to write a simple code to simply plot a bar graph of some fruit names in the x-axis vs corresponding sales units. The aim of this code is just to understand how to query postgres results from heroku hosted database through a dash app.
Below is the code,
from dash import dash
import dash_core_components as dcc
import dash_html_components as html
import plotly.graph_objs as go
import psycopg2
import os
DATABASE_URL = os.environ['DATABASE_URL']
conn = psycopg2.connect(DATABASE_URL, sslmode='require')
cur = conn.cursor()
cur.execute("SELECT fruits FROM pgrt_table")
fruits1=cur.fetchall()
#print(fruits1)
cur.execute("SELECT sales FROM pgrt_table")
sales1=cur.fetchall()
app = dash.Dash()
app.layout = html.Div(children=[
html.H1(
children='Hello Dash'
),
html.Div(
children='''Dash: A web application framework for Python.'''
),
dcc.Graph(
id='example-graph',
figure=go.Figure(
data=[
go.Bar(
x=fruits1, y=sales1, name='SF'),
#{'x': [1, 2, 3], 'y': [2, 4, 5], 'type': 'bar', 'name': u'Montréal'},
],
#'layout':{
# 'title': 'Dash Data Visualization'
#}
)
)
])
if __name__ == '__main__':
app.run_server(debug=True)
The output is below,
Output to the above code
The corresponding output is just the axes with no bar graphs. The connection with the db is working since printing fruits1 or sales1 gives me the values from the columns in postgres. The only issue is the plotting.
NOTE: This question has been heavily modified since the previous draft was extremely vague without any code to show for.
Example:
fruits1 = [('apple',), ('banana',),
('mango',), ('pineapple',),
('peach',), ('watermelon',)]
The output of your database cannot be used directly:
xData = [fruit[0] for fruit in fruits1]
# gives ['apple', 'banana', 'mango', 'pineapple', 'peach', 'watermelon']
yData = [sales[0] for sales in sales1]
You have to assign your data to the go.Bar object:
go.Bar(x=xData, y=yData, name='SF')

creating arrays of results based upon chunks of time

I've got an app that has lots of sensor_events being saved; I'd like to get results by a date then map those into chunks of 15 minute times... this is not the same as doing a group by in postgres as that would only return something averaged and I need the specific events...
What I'm thinking is given a day I get the beginning_of_day and split it up as 15 minute chunks as keys to a hash of arrays ie
def return_time_chunk_hash
t=Date.today
st = t.beginning_of_day
times = Hash.new
while st<t.end_of_day
times[st.to_formatted_s(:time)] = Array.new
st = st + 15.minutes
end
return times
end
And from that I would compare the sensor_events created_at date, find which bucket it belonged to and plop it in there. Once I've got it that way I know whether or not a chunk has any (.count) and if so can do all the data manipulation on the specific events.
Does this seem nutty? Is there a simpler way I'm not seeing?
Update:
I liked the way jgraft was thinking but thought it wouldn't work as I'd have to do multiple queries based upon the group column flag, but then I thought of the group_by of Enumerable so I tried something like this in the actual SensorEvent model:
def chunk
Time.at((self.created_at.to_f / 15.minutes).floor * 15.minutes).to_formatted_s(:time)
end
This allows me to get all the sensor events I need as usual (ie #se=SensorEvent.where(sensor_id: 10)) but then I could do #se.group_by(&:chunk) and I get those singular events grouped into a hash ie:
{"13:30"=>
[#<SensorEvent:0x007ffac0128438
id: 25006,
force: 0.0,
xaccel: 502.0,
yaccel: 495.0,
zaccel: 616.0,
battery: 0.0,
position: 25.0,
created_at: Thu, 18 Jun 2015 13:33:37 EDT -04:00,
updated_at: Thu, 18 Jun 2015 15:51:32 EDT -04:00,
deviceID: "D36330135FE3D36",
location: 3,
sensor_id: 10>,
#<SensorEvent:0x007ffac0128140
id: 25007,
force: 0.0,
xaccel: 502.0,
yaccel: 495.0,
zaccel: 616.0,
battery: 0.0,
position: 27.0,
created_at: Thu, 18 Jun 2015 13:39:46 EDT -04:00,
updated_at: Thu, 18 Jun 2015 15:51:32 EDT -04:00,
deviceID: "D36330135FE3D36",
location: 3,
sensor_id: 10>,
.........
The trouble is of course not every chunk of time might be created since there was no event to spawn it; also that being a hash it's not sorted in anyway:
res.keys
=> ["13:30",
"13:45",
"14:00",
"13:00",
"15:45",
"16:00",
"16:15",
"16:45",
"17:00",
"17:15",
"17:30",
"14:15",
"14:30",
I have to do calculations on the chunks of events; I might keep a master TIMECHUNKS array to compare / lookup in order...
Why not just add a column to the sensor_events table that specifies which block it belongs to? Would basically give you an array as you could do a query like:
SensorEvent.where(date: Date.today, block: 1)
and return a relation of the data in an array-esque format.
You could just add an after_create callback to the SensorEvents model that sets the block column.
class SensorEvent < ActiveRecord::Base
after_create :set_block
private
def set_block
value = ((4 * created_at.hour) + (created_at.min.to_f / 15).ceil).to_i
self.update_columns(block: value)
end

Data transformation during load

On my server runs an application called 'logstash'. This application receives log entries from multiple servers and uploads them as JSON documents to MongoDB. Works like a charm.
Example:
{
u'syslog_message': u'[10724525.839722] [UFW BLOCK] IN=venet0 OUT= MAC= SRC=1.2.3.4 DST=9.8.7.6 LEN=52 TOS=0x08 PREC=0x20 TTL=50 ID=55384 PROTO=TCP SPT=349 DPT=123 WINDOW=14600 RES=0x00 SYN URGP=0 ',
u'received_from': u'1.3.5.7:1234',
u'#version': u'1',
u'#timestamp': datetime.datetime(2014, 11, 20, 15, 9, 55),
u'syslog_timestamp': u'Nov 20 15:09:55',
u'syslog_facility': u'user-level',
u'syslog_severity': u'notice',
u'host': u'2.4.6.8:2468',
u'syslog_program': u'kernel',
u'syslog_hostname': u'server01',
u'received_at': u'2014-11-20 20:09:55 UTC',
u'message': u'<4>Nov 20 15:09:55 server01 kernel: [10724525.839722] [UFW BLOCK] IN=venet0 OUT= MAC= SRC=1.2.3.4 DST=2.3.4.5 LEN=52 TOS=0x08 PREC=0x20 TTL=50 ID=55384 PROTO=TCP SPT=1234 DPT=543 WINDOW=14600 RES=0x00 SYN URGP=0 ',
u'_id': ObjectId('546e4a93e98673fe8f11a4d2'),
u'type': u'syslog',
u'syslog_severity_code': 5,
u'syslog_facility_code': 1
}
But the data is not exactly how I want it to be. I need to convert string to date, add some elements based on other elements and more 'transformations' on each document that is loaded into a specific collection.
What is the default way to handle this and where is this entire process documented?
Logstash has a number of filter plugins that can be used to add, delete, and modify message fields. The Logstash documentation lists them all. Judging by your example message above I'd say there already are a number of filters in place. It sounds like you'll at least need an additional date filter and a mutate filter to accomplish what you're outlining.