How do I track states across runners in a DataFlow Job? - streaming

I'm currently creating a Streaming Dataflow job that only carries out computation if and only if there is an increment in the "Ring" column of my data.
My data flow code
Job= (p | "Read" >> beam.io.ReadFromPubSub(topic=topic)
| "Parse Json" >> beam.Map(json.loads)
| "ParDo Divisors" >> ParDo(UpdateDelayTable()))
Data flowing in from pubsub:
Ring [
{...,"Ring":1},
{...,"Ring":1},
{...,"Ring":1},
{...,"Ring":2}
...]
I want my dataflow to track the current ring number and only triggers a function if and only if the ring number has incremented. How should I go about doing this.

Pub/Sub
There is no guarantee that {"Ring": 2} will definitely be received/sent by Pub/Sub after {"Ring": 1}.
It seems that you have to enable receiving messages in order first for Pub/Sub. And also make sure the Pub/Sub service receives Ring data incrementally.
Dataflow
Then to achieve it with Dataflow, you can use stateful processing.
But be mindful that the "state" of "Ring" is per key (and per window). To do what you want, all the elements need to have the same key and fall into the same window (global window in this case). It's going to be a very "hot" key.
An example code:
from apache_beam.transforms.userstate import ReadModifyWriteStateSpec
from apache_beam.coders import coders
class RingFn(beam.DoFn):
RING_STATE = ReadModifyWriteStateSpec(
name='Ring', coder=coders.VarIntCoder())
def process(self, element, ring=beam.DoFn.StateParam(RING_STATE)):
current_ring = ring.read() or 0
if element['Ring'] > current_ring:
print('Carry out your computation here!')
ring.write(element['Ring'])
# Usage
pcoll | beam.ParDo(RingFn())
# Check your keys if you are not sure what they are.
pcoll | beam.Keys() | beam.Map(print)

Related

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

How to make ReadAllFromText not Block Beam Pipeline?

I would like to implement a very simple beam pipeline:
read google storage links to text files from PubSub topic->read each text line by line->write to BigQuery.
Apache Beam has pre-implemented PTransform for each process.
So pipeline would be:
Pipeline | ReadFromPubSub("topic_name") | ReadAllFromText() | WriteToBigQuery("table_name")
However, ReadAllFromText() blocks the pipeline somehow. Creating custom PTransform which return the random line after reading from PubSub and writing it to BigQuery table works normally (no blocking). Adding the fixed window of 3 seconds or triggering each element doesn't solve the problem either.
Each file is around 10MB and 23K lines.
Unfortunately, I cannot find the documentation about how ReadAllFromText is supposed to work. It would be really weird if it tries to block the pipeline until reading all the files. And I would expect the function to push each line to the pipeline as soon as it reads the line.
Is there any known reason for the above behavior? Is it a bug or am I doing something wrong?
Pipeline code:
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as p:
lines = p | ReadFromPubSub(subscription=source_dict["input"]) \
| 'window' >> beam.WindowInto(window.FixedWindows(3, 0)) \
| ReadAllFromText(skip_header_lines=1)
elements = lines | beam.ParDo(SplitPayload())
elements | WriteToBigQuery(source_dict["output"], write_disposition=BigQueryDisposition.WRITE_APPEND)
.
.
.
class SplitPayload(beam.DoFn):
def process(self, element, *args, **kwargs):
timestamp, id, payload = element.split(";")
return [{
'timestamp': timestamp,
'id': id,
'payload': payload
}]

Can I adjust the graph-on-parent dimensions of a pd subpatch by sending messages to it (dynamic patching)

I'm currently building an abstraction which creates its parts dynamically based on the argument to the abstraction. There is one main element and several of another type of element (as many as the argument dictates) that will be attached to that main element. All need to be visible in the host patch. I know that this in itself will work fine, but I need to be able to expand the graph-on-parent frame dynamically according to how many of the varying number of elements should be created.
I have looked here: https://puredata.info/docs/tutorials/TipsAndTricks#undocumented-pd-internal-messages but this is notoriously undocumented and I wonder which of these messages work and are appropriate...mycnv perhaps?
you are looking for the donecanvasdialog message to the subpatch
donecanvasdialog <xunit> <yunits> <gopmode> <xfrom> <yfrom> <xto> <yto> <width> <height> <xoffset> <yoffset> 1
usually you'll leave most elements at their default values and only change width and height:
donecanvasdialog 0 0 1 0 -1 1 1 $1 $2 100 100 1
to find out these things yourself, you can always run Pd in debug mode to intercept the Pd <-> Pd-Gui communication and see how Pd does it internally (2 shows the GUI->Pd messages, and 1 hows the Pd->GUI messages):
$ pd -nrt -d 2
And of course, all this requires a disclaimer: these messages are internal messages. There are not promises that this message will work in the next bugfix release. (Unlike the object-creation messages which are engraved in the Pd-patch format, these messages only ever occur within a running Pd and are therefore indeed considered private and can change if need ever arises)

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

Intersystems Cache - Maintaining Object Code to ensure Data is Compliant with Object Definition

I am new to using intersytems cache and face an issue where I am querying data stored in cache, exposed by classes which do not seem to accurately represent the data in the underlying system. The data stored in the globals is almost always larger than what is defined in the object code.
As such I get errors like the one below very frequently.
Msg 7347, Level 16, State 1, Line 2
OLE DB provider 'MSDASQL' for linked server 'cache' returned data that does not match expected data length for column '[cache]..[namespace].[tablename].columname'. The (maximum) expected data length is 5, while the returned data length is 6.
Does anyone have any experience with implementing some type of quality process to ensure that the object definitions (sql mappings) are maintained in such away that they can accomodate the data which is being persisted in the globals?
Property columname As %String(MAXLEN = 5, TRUNCATE = 1) [ Required, SqlColumnNumber = 2, SqlFieldName = columname ];
In this particular example the system has the column defined with a max len of 5, however the data stored in the system is 6 characters long.
How can I proactively monitor and repair such situations.
/*
I did not create these object definitions in cache
*/
It's not completely clear what "monitor and repair" would mean for you, but:
How much control do you have over the database side? Cache runs code for a data-type on converting from a global to ODBC using the LogicalToODBC method of the data-type class. If you change the property types from %String to your own class, AppropriatelyNamedString, then you can override that method to automatically truncate. If that's what you want to do. It is possible to change all the %String property types programatically using the %Library.CompiledClass class.
It is also possible to run code within Cache to find records with properties that are above the (somewhat theoretical) maximum length. This obviously would require full table scans. It is even possible to expose that code as a stored procedure.
Again, I don't know what exactly you are trying to do, but those are some options. They probably do require getting deeper into the Cache side than you would prefer.
As far as preventing the bad data in the first place, there is no general answer. Cache allows programmers to directly write to the globals, bypassing any object or table definitions. If that is happening, the code doing so must be fixed directly.
Edit: Here is code that might work in detecting bad data. It might not work if you are doing cetain funny stuff, but it worked for me. It's kind of ugly because I didn't want to break it up into methods or tags. This is meant to run from a command prompt, so it would have to be modified for your purposes probably.
{
S ClassQuery=##CLASS(%ResultSet).%New("%Dictionary.ClassDefinition:SubclassOf")
I 'ClassQuery.Execute("%Library.Persistent") b q
While ClassQuery.Next(.sc) {
If $$$ISERR(sc) b Quit
S ClassName=ClassQuery.Data("Name")
I $E(ClassName)="%" continue
S OneClassQuery=##CLASS(%ResultSet).%New(ClassName_":Extent")
I '$IsObject(OneClassQuery) continue //may not exist
try {
I 'OneClassQuery.Execute() D OneClassQuery.Close() continue
}
catch
{
D OneClassQuery.Close()
continue
}
S PropertyQuery=##CLASS(%ResultSet).%New("%Dictionary.PropertyDefinition:Summary")
K Properties
s sc=PropertyQuery.Execute(ClassName) I 'sc D PropertyQuery.Close() continue
While PropertyQuery.Next()
{
s PropertyName=$G(PropertyQuery.Data("Name"))
S PropertyDefinition=""
S PropertyDefinition=##CLASS(%Dictionary.PropertyDefinition).%OpenId(ClassName_"||"_PropertyName)
I '$IsObject(PropertyDefinition) continue
I PropertyDefinition.Private continue
I PropertyDefinition.SqlFieldName=""
{
S Properties(PropertyName)=PropertyName
}
else
{
I PropertyName'="" S Properties(PropertyDefinition.SqlFieldName)=PropertyName
}
}
D PropertyQuery.Close()
I '$D(Properties) continue
While OneClassQuery.Next(.sc2) {
B:'sc2
S ID=OneClassQuery.Data("ID")
Set OneRowQuery=##class(%ResultSet).%New("%DynamicQuery:SQL")
S sc=OneRowQuery.Prepare("Select * FROM "_ClassName_" WHERE ID=?") continue:'sc
S sc=OneRowQuery.Execute(ID) continue:'sc
I 'OneRowQuery.Next() D OneRowQuery.Close() continue
S PropertyName=""
F S PropertyName=$O(Properties(PropertyName)) Q:PropertyName="" d
. S PropertyValue=$G(OneRowQuery.Data(PropertyName))
. I PropertyValue'="" D
.. S PropertyIsValid=$ZOBJClassMETHOD(ClassName,Properties(PropertyName)_"IsValid",PropertyValue)
.. I 'PropertyIsValid W !,ClassName,":",ID,":",PropertyName," has invalid value of "_PropertyValue
.. //I PropertyIsValid W !,ClassName,":",ID,":",PropertyName," has VALID value of "_PropertyValue
D OneRowQuery.Close()
}
D OneClassQuery.Close()
}
D ClassQuery.Close()
}
The simplest solution is to increase the MAXLEN parameter to 6 or larger. Caché only enforces MAXLEN and TRUNCATE when saving. Within other Caché code this is usually fine, but unfortunately ODBC clients tend to expect this to be enforced more strictly. The other option is to write your SQL like SELECT LEFT(columnname, 5)...
The simplest solution which I use for all Integration Services Packages, for example is to create a query that casts all nvarchar or char data to the correct length. In this way, my data never fails for truncation.
Optional:
First run a query like: SELECT Max(datalength(mycolumnName)) from cachenamespace.tablename.mycolumnName
Your new query : SELECT cast(mycolumnname as varchar(6) ) as mycolumnname,
convert(varchar(8000), memo_field) AS memo_field
from cachenamespace.tablename.mycolumnName
Your pain of getting the data will be lessened but not eliminated.
If you use any type of oledb provider, or if you use an OPENQUERY in SQL Server,
the casts must occur in the query sent to Intersystems CACHE db, not in the the outer query that retrieves data from the inner OPENQUERY.