Apache Beam: PTransform vs PValue - apache-beam

Given PTransform<PCollection<X>, PCollection<Y>> for arbitrary type X and Y. What exactly is transform and what exactly is PValue for this example? Is PValue defining a last vertex in a graph?

PValue is a common base class for various things that can be inputs and outputs of a PTransform. PCollection is the most common example; other examples are: the trivial PBegin and PDone, PCollectionTuple (a transform can return multiple PCollections - as ParDo.withOutputTags does), and it's possible to define custom PValue's (though it's very rarely needed unless you're a library author), e.g. see here.

Related

Division by zero depending on parameter

I am using the FixedRotation component and get a division by zero error. This happens in a translated expression of the form
var = nominator/fixedRotation.R_rel_inv.T[1,3]
because T[1,3] is 0 for the chosen parameters:
n={0,1,0}
angle=180 deg.
It seems that Openmodelica keeps the symbolic variable and tries to be generic but in this case this leads to division by zero because it chooses to put T[1,3] in the denominator.
What are the modifications in order to tell the compiler that the evaluated values T[1,3] for the compilation shall be considered as if the values were hard coded? R_rel is internally in fixedRotation not defined with Evaluate=true...
Should I use custom version of this block? (when I copy paste the source code to a new model and set the parameters R_rel and R_rel_inv to Evalute=true then the simulation works without division by zero)...
BUT is there a modifier to tell from outside that a parameter shall be Evaluate=true without the need to make a new model?
Any other way to prevent division by zero?
Try propagating the parameter at a higher level and setting annotation(Evaluate=true) on this.
For example:
model A
parameter Real a=1;
end A;
model B
parameter Real aPropagated = 2 annotation(Evaluate=true);
A Ainstance(a=aPropagated);
end B;
I don't understand how the Evaluate annotation should help here. The denominator is obviously zero and this is what shall be in fact treated.
To solve division by zero, there are various possibilities (e.g. to set a particular value for that case or to define a small offset to denominator, you can find examples in the Modelica Standard Library). You can also consider the physical meaning of the equation and handle this accordingly.
Since the denominator depends on a parameter, you can also set an assert() to warn the user there is wrong parameter value.
Btw. R_rel_inv is protected and shall, thus, not be used. Use R_rel instead. Also, to deal with rotation matrices, usage of functions from Modelica.Mechanics.MultiBody.Frames is a preferrable way.
And: to use custom version or own implementation depends on your preferences. Custom version is maintained by the comunity, own version is in your hands.

Parameterized Modules (SystemVerilog)

I have read the book "Digital Design and Computers Architecture" by David Harris and I have a question about SystemVerilog examples in this book. After the introduction in the "parameterized construction", which is # (parameter ...), this operator is used almost in every example.
For example, the "subtractor" module from this book:
module subtractor #(parameter N = 8)
(input logic [N - 1:0] a,b,
output logic [N - 1:0] y);
assign y = a - b;
endmodule
What's the reason of using N in this code?
Can't we just write the following?:
input logic [7:0] a,b,
output logic [7:0] y);
Moreover, such parameters are used in almost every example further in the book but, as for me, there is no reason for using it. We can set the number of bits directly in square brackets without using additional "parameters".
So, what is the reason of such form of coding above?
The use of parameters serves a number of purposes.
It is always a better programming practice to use a symbolic name associated with a value than using a literal value directly. DATA_WIDTH instead of N would have been a more appropriate example. This documents the meaning of the value.
When a change to that value is needed, you have a single place to make that change, and less chance that you'll miss a change, or change an unintended value.
The use of parameters allows you to re-use the same code in many different places by creating a template and then overriding the parameter values as needed.

Current version of the modelica translator can only handle array of components with fixed size

I created an part with the AC library, and when I was trying to simulate the model, there is an error says "Current version of the modelica translator can only handle array of components with fixed size".
Not sure what is the meaning of it, and is there anyone has the same issue like this one?
Thank you
enter image description here
Consider the following simple model:
model M
parameter Integer n(start=3, fixed=false);
initial algorithm
n := n;
end M;
It has a parameter n which can be changed before simulation starts. And array dimensions need to be parameter expressions. So you would think that the following model would be legal:
model M2
Real arr[n] = fill(1, n);
parameter Integer n(start=3, fixed=false);
initial algorithm
n := n;
end M2;
But it isn't since Modelica tools will expand the number of equations and variables to get a fixed number. (According to the language specification, n is a structural parameter; it is not well defined what restrictions these have - most Modelica tools seem to require them to behave like constants which means only fixed=true parameters with a binding equation that depends only on other structural parameters or constants).

Reporting log-likelihood / perplexity of spark LDA model (different in local vs distributed models?)

Given a training corpus docsWithFeatures, I've trained an LDA model in Spark (via Scala API) like so:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
val n_topics = 10;
val lda = new LDA().setK(n_topics).setMaxIterations(20)
val ldaModel = lda.run(docsWithFeatures)
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
And now I want to report the log-likelihood and perplexity of the model.
I can get the log-likelihood like so:
scala> distLDAModel.logLikelihood
res11: Double = -2600097.2875547716
But this is where things get weird. I also wanted the perplexity, which is only implemented for a local model, so I run:
val localModel = distLDAModel.toLocal
Which lets me get the (log) perplexity like so:
scala> localModel.logPerplexity(docsWithFeatures)
res14: Double = 0.36729132682898674
But the local model also supports the log-likelihood calculation, which I run like this:
scala> localModel.logLikelihood(docsWithFeatures)
res15: Double = -3672913.268234148
So what's going on here? Shouldn't the two log-likelihood values be the same? The documentation for a distributed model says
"logLikelihood: log likelihood of the training corpus, given the inferred topics and document-topic distributions"
while for a local model it says:
"logLikelihood(documents): Calculates a lower bound on the provided documents given the inferred topics."
I guess these are different, but it's not clear to me how or why. Which one should I use? That is, which one is the "true" likelihood of the model, given the training documents?
To summarize, two main questions:
1 - How and why are the two log-likelihood values different, and which should I use?
2 - When reporting perplexity, am I correct in thinking that I should use the exponential of the logPerplexity result? (But why does the model give log perplexity instead of just plain perplexity? Am I missing something?)
1) These two log-likelihood values differ because they are computing the log-likelihood for two different models. DistributedLDAModel is effectively computing the log-likelihood w.r.t. a model where the parameters for the topics and the mixing weights for each of the documents are constants (as I mentioned in another post, the DistributedLDAModel is essentially regularized PLSI, though you need to use logPrior to also account for the regularization), while the LocalLDAModel takes the view that the topic parameters as well as the mixing weights for each document are random variables. So in the case of LocalLDAModel you have to integrate (marginalize) out the topic parameters and document mixing weights in order to compute the log-likelihood (and this is what makes the variational approximation/lower bound necessary, though even without the approximation the log-likelihoods would not be the same since the models are just different.)
As far as which one you should use, my suggestion (without knowing what you ultimately want to do) would be to go with the log-likelihood method attached to the class you originally trained (i.e. the DistributedLDAModel.) As a side note, the primary (only?) reason that I can see to convert a DistributedLDAModel into a LocalLDAModel via toLocal is to enable the computation of topic mixing weights for a new (out-of-training) set of documents (for more info on this see my post on this thread: Spark MLlib LDA, how to infer the topics distribution of a new unseen document?), a operation which is not (but could be) supported in DistributedLDAModel.
2) log-perplexity is just the negative log-likelihood divided by the number of tokens in your corpus. If you divide the log-perplexity by math.log(2.0) then the resulting value can also be interpreted as the approximate number of bits per a token needed to encode your corpus (as a bag of words) given the model.

Use linkage with custom distance

I would like to use the linkage function in matlab with a custom distance.
My distance function is in the form:
Distance = pdist(matrix,#mydistance);
so given a
matrix = rand(132,18)
Distance will be a vector [1x8646];
D_matrix = squareform(Distance,'tomatrix');
is a matrix 132x132 contaning all the pairwise distances between te rows of matrix
How can I embed mydistance in linkage?
You can use a call to linkage like this:
Z = linkage(Data,'single','#mydistance')
where 'single' can also be any of the other cluster merge methods as described here: http://www.mathworks.com/help/stats/linkage.html.
In other words, just put your function handle in a string and pass it as the 3rd argument to linkage. You cannot use the 'savememory' function in linkage while using a custom distance function, however. This is causing me some frustration with my 300,000 x 6 dataset. I think the solution will be to project it to some space where euclidean distance is defined and meaningful but we'll see how that goes.
Besides using
tree = linkage(Data,'single','#mydistance')
like Imperssonator suggests, you can also use
dissimilarity = pdist(Data,#mydistance);
tree = linkage(dissimilarity,'single');
The latter has the benefit of allowing Data to be an object array with #mydistance using objects as arguments.