What does expand method signify in beam? - apache-beam

Is expand like a template that expands to a directed acyclic graph? DoFn seems like something that runs on an individual task or process.

It basically allows for refactoring and re-using code throughout your pipeline.
You can combine several pipeline steps (including DoFn if you wish so) in a PTransform (see official WordCount Example from beam and here, where they actually use expand) which simplifies unit testing, naming and makes for an overall cleaner DAG. The latter can be seen in the following screenshot, which I stole from the official Dataflow documentation page here (<- in this link they also show the corresponding Python and Java Code)

Related

CDK constructs: passing values to constructors vs using the context values

I wrote a small cdk construct that parses the logs in a cloudwatch log group via a lambda and sends a mail when a pattern is matched. This allows a developer to be notified via an sns topic, should an error appear in the logs.
The construct needs to know which log group to monitor, and which pattern to look for. These are currently passed in as parameters to its constructor. The user of my small construct library is supposed to use this construct as part of his stack. However, one could also define them as parameters, or better yet given what the docs say values in a context - basically using this construct in a standalone app.
Would this be an appropriate use of the context? What else it is useful for?
It's hard to say a definitive answer, but I would recommend always passing in properties to a construct explicitly on the constructor.
A) This creates consistency with the rest of the constructs.
B) Nothing is 'hidden' in your construct definition.
The only thing I've generally found context useful for is passing in parameters from the CLI, but even that is pretty rare and there are often better ways to do it.

How to use stubsPerConsumer with restdocs

How do I use the stubsPerConsumer feature when creating a stub from a producer with restdocs?
If this is not supported, is it possible to generate the asciidoc snippets from the groovy DSL contract?
Update
It looks like baseClassMappings is not supported when using spring-cloud-contract with restdocs. Has anyone found a clever way to get this to work using the assembly-plugin (that doesn't require a lot of manual setup for each consumer)?
Currently, it's not supported on the producer side with rest docs out of the box. We treat rest docs as a way to do the producer contract approach. Theoretically what you could do is to create different output snippet directories. Instead of for example target/snippets you could do target/snippets/myconsumer. Then with the assembly plugin you would just pick the target/snippets. At least that's how theory would work.
As for the contracts and adocs you can check out this: https://github.com/spring-cloud-samples/spring-cloud-contract-samples/blob/master/beer_contracts/src/test/java/docs/GenerateAdocsFromContractsTests.java . It's a poor man's version of going through all of the contracts and generation of adoc documentation from them.

Text Preprocessing in Spark-Scala

I want to apply preprocessing phase on a large amount of text data in Spark-Scala such as Lemmatization - Remove Stop Words(using Tf-Idf) - POS tagging , there is any way to implement them in Spark - Scala ?
for example here is one sample of my data:
The perfect fit for my iPod photo. Great sound for a great price. I use it everywhere. it is very usefulness for me.
after preprocessing:
perfect fit iPod photo great sound great price use everywhere very useful
and they have POS tags e.g (iPod,NN) (photo,NN)
there is a POS tagging (sister.arizona) is it applicable in Spark?
Anything is possible. The question is what YOUR preferred way of doing this would be.
For example, do you have a stop word dictionary that works for you (it could just simply be a Set), or would you want to run TF-IDF to automatically pick the stop words (note that this would require some supervision, such as picking the threshold at which the word would be considered a stop word). You can provide the dictionary, and Spark's MLLib already comes with TF-IDF.
The POS tags step is tricky. Most NLP libraries on the JVM (e.g. Stanford CoreNLP) don't implement java.io.Serializable, but you can perform the map step using them, e.g.
myRdd.map(functionToEmitPOSTags)
On the other hand, don't emit an RDD that contains non-serializable classes from that NLP library, since steps such as collect(), saveAsNewAPIHadoopFile, etc. will fail. Also to reduce headaches with serialization, use Kryo instead of the default Java serialization. There are numerous posts about this issue if you google around, but see here and here.
Once you figure out the serialization issues, you need to figure out which NLP library to use to generate the POS tags. There are plenty of those, e.g. Stanford CoreNLP, LingPipe and Mallet for Java, Epic for Scala, etc. Note that you can of course use the Java NLP libraries with Scala, including with wrappers such as the University of Arizona's Sista wrapper around Stanford CoreNLP, etc.
Also, why didn't your example lower-case the processed text? That's pretty much the first thing I would do. If you have special cases such as iPod, you could apply the lower-casing except in those cases. In general, though, I would lower-case everything. If you're removing punctuation, you should probably first split the text into sentences (split on the period using regex, etc.). If you're removing punctuation in general, that can of course be done using regex.
How deeply do you want to stem? For example, the Porter stemmer (there are implementations in every NLP library) stems so deeply that "universe" and "university" become the same resulting stem. Do you really want that? There are less aggressive stemmers out there, depending on your use case. Also, why use stemming if you can use lemmatization, i.e. splitting the word into the grammatical prefix, root and suffix (e.g. walked = walk (root) + ed (suffix)). The roots would then give you better results than stems in most cases. Most NLP libraries that I mentioned above do that.
Also, what's your distinction between a stop word and a non-useful word? For example, you removed the pronoun in the subject form "I" and the possessive form "my," but not the object form "me." I recommend picking up an NLP textbook like "Speech and Language Processing" by Jurafsky and Martin (for the ambitious), or just reading the one of the engineering-centered books about NLP tools such as LingPipe for Java, NLTK for Python, etc., to get a good overview of the terminology, the steps in an NLP pipeline, etc.
There is no built-in NLP capability in Apache Spark. You would have to implement it for yourself, perhaps based on a non-distributed NLP library, as described in marekinfo's excellent answer.
I would suggest you to take a look in spark's ml pipeline. You may not get everything out of the box yet, but you can build your capabililties and use pipeline as a framework..

BDD in Scala - Does it have to be ugly?

I've used lettuce for python in the past. It is a simple BDD framework where specs are written in an external plain text file. Implementation uses regex to identify each step, proving reusable code for each sentence in the specification.
Using scala, either with specs2 or scalatest I'm being forced to write the the specification alongside the implementation, making it impossible to reuse the implementation in another test (sure, we could implement it in a function somewhere) and making it impossible to separate the test implementation from the specification itself (something that I used to do, providing acceptance tests to clients for validation).
Concluding, I raise my question: Considering the importance of validating tests by clients, is there a way in BDD frameworks for scala to load the tests from an external file, raising an exception if a sentence in the test is not implemented yet and executing the test normally if all sentences have been implemented?
I've just discovered a cucumber plugin for sbt. Tests would be implemented under test/scala and specifications would be kept in test/resources as plain txt files. I'm just not sure on how reliable the library is and if it will have support in the future.
Edit:
The above is a wrapper for the following plugin wich solves perfectly the problem and supports Scala.
https://github.com/cucumber/cucumber-jvm
This is all about trade-offs. The cucumber-style of specifications is great because it is pure text, that easily editable and readable by non-coders.
However they are also pretty rigid as specifications because they impose a strict format based on features and Given-When-Then. In specs2 for example we can write any text we want and annotate only the lines which are meant to be actions on the system or verification. The drawback is that the text becomes annotated and that pending must be explicitly specified to indicate what hasn't been implemented yet. Also the annotation is just a reference to some code, living somewhere, and you can of course use the usual programming techniques to get reusability.
BTW, the link above is an interesting example of trade-off: in this file, the first spec is "uglier" but there are more compile-time checks that the When step uses the information from a Given step or that we don't have a sequence of Then -> When steps. The second specification is nicer but also more error-prone.
Then there is the issue of maintaining the regular expressions. If there is a strict separation between the people writing the features and the people implementing them, then it's very easy to break the implementation even if nothing substantial changes.
Finally, there is the question of version control. Who owns the document? How can we be sure that the code is in sync with the spec? Who refactors the specification when required?
There is no, by far, perfect solution. My own conclusion is that BDD artifacts should be in the hand of developers and verified by the other stakeholders, reading the code directly if it's readable or reading an html/pdf output. And if the BDD artifacts are owned by developers they might as well use their own tools to make their life easier with verification (using a compiler when possible) and maintenance (using automated refactorings).
You said yourself that it is easy to make the implementation reusable by the normal methods Scala provides for this kind of stuf (methods, functions, traits, classes, types ...), so there isn't really a problem there.
If you want to give a version without code to your customer, you can still give them the code files, and if they can't ignore a little syntax, you probably could write a custom reporter writing all the text out to a file, maybe even formatted with as html or something.
Another option would be to use JBehave or any other JVM based framework, they should work with Scala without a problem.
Eric's main design criteria was sustainability of executable specification development (through refactoring) and not initial convenience due to "beauty" of simple text.
see http://etorreborre.github.io/specs2/
The features of specs2 are:
Concurrent execution of examples by default
ScalaCheck properties
Mocks with Mockito
Data tables
AutoExamples, where the source code is extracted to describe the example
A rich library of matchers
Easy to create and compose
Usable with must and should
Returning "functional" results or throwing exceptions
Reusable outside of specs2 (in JUnit tests for example)
Forms for writing Fitnesse-like specifications (with Markdown markup)
Html reporting to create documentation for acceptance tests, to create a User Guide
Snippets for documenting APIs with always up-to-date code
Integration with sbt and JUnit tools (maven, IDEs,...)
Specs2 is quite impressive in both design and implementation.
If you look closely you will see the DSL can be extended while you keep the typesafe-ty and strong command of domain code under development.
He who leaves aside the "is more ugly" argument and tries this seriously will find power.
Checkout the structured forms and snippets

What exactly is a source filter?

Whenever I see the term source filter I am left wondering as to what it refers to.
Aside from a formal definition, I think an example would also be helpful to drive the message home.
A source filter is a module that modifies some other code before it is evaluated. Therefore the code that is executed is not what the programmer sees when it is written. You can read more about source filters (in the Perl context) at perldoc perlfilter. Some examples are Smart::Comments which allows the programmer to leave debugging commands in comments in the code and employ them only if desired, another is PDL::NiceSlice which is sugar for slicing PDL objects.
Edit:
For more information on usage (should you wish to brave the beast), read the documentation for Filter::Simple which is a typical way to create filters.
Alternatively there is a new and different way to muck about with the source: Devel::Declare lets you interact with Perl's own parser, letting you do many of the same type of thing as a source filter, but without the source filter. This can be "safer" in some respect, yet it has a more limited scope.
A source filter is a form of module which affects the way in which a file use-ing it will be parsed. They are commonly used to simulate syntactical features which Perl does not have natively -- for instance, the Switch source filter was often used to simulate switch statements before Perl's given { } construction was available.
Source filters work by taking the text of the module as input, performing some processing on it, and outputting the filtered source code. For a simple example of how a source filter is implemented, as well as more details, see the perldoc page for perlfilter.
They are pre-processors. They change the source code before it reaches the Perl compiler. You can do scary things with them, in effect implementing your own language, with all the effects this has on readability (for others), robustness (writing parsers is hard) and maintainability (debugging gets tricky when your idea of what the source code is differs from what compiler and runtime think it is).