Sample rate issue - trace

If the service has a lot of traffic, tracing each invocation and saving the corresponding link log costs a lot, but once the sample rate is set, some trace with errors may be missed. If I want to keep some rate collection, and the trace with errors must be traced, is this possible in spring-cloud-sleuth? If not, is there a solution?

Spring Cloud Sleuth pushes the sampling decision down to the tracer implementation, you need to create either a SamplerFunction or a Sampler to do this, please see the docs:
Sleuth: Sampling
Sleuth: Brave Sampling
Brave: Sampling
Update:
Sampling is an early decision because tracing has some overhead and early sampling can save your app from that.
If you want to filter out your spans at the end, you can setup an always sampler and a SpanHandler where you can access the Span (error, tags, etc.) and you can also keep/drop them based on this (you can inject a sampler to your handler and apply its logic to the rest of your spans).

Related

How to "join" a frequently updating stream with an irregularly updating stream in Apache Beam?

I have a stream of measurements keyed by an ID PCollection<KV<ID,Measurement>> and something like a changelog stream of additional information for that ID PCollection<KV<ID,SomeIDInfo>>. New data is added to the measurement stream quite regularly, say once per second for every ID. The stream with additional information on the other hand is only updated when a user performs manual re-configuration. We can't tell often this happens and, in particular, the update frequency may vary among IDs.
My goal is now to enrich each entry in the measurements stream by the additional information for its ID. That is, the output should be something like PCollection<KV<ID,Pair<Measurement,SomeIDInfo>>>. Or, in other words, I would like to do a left join of the measurements stream with the additional information stream.
I would expect this to be a quite common use case. Coming from Kafka Streams, this can be quite easily implemented with a KStream-KTable-Join. With Beam, however, all my approaches so far seem not to work. I already thought about the following ideas.
Idea 1: CoGroupByKey with fixed time windows
Applying a window to the measurements stream would not be an issue. However, as the additional information stream is updating irregularly and also significantly less frequently than the measurements stream, there is no reasonable common window size such that there is at least one updated information for each ID.
Idea 2: CoGroupByKey with global window and as non-default trigger
Refining the previous idea, I thought about using a processing-time trigger, which fires e.g. every 5 seconds. The issue with this idea is that I need to use accumulatingFiredPanes() for the additional information as there might be no new data for a key between two firings, but I have to use discardingFiredPanes() for the measurements stream as otherwise my panes would quickly become too large. This simply does not work. When I configure my pipeline that way, also the additional information stream discards changes. Setting both trigger to accumulating it works, but, as I said, this is not scalable.
Idea 3: Side inputs
Another idea would be to use side inputs, but also this solution is not really scalable - at least if I don't miss something. With side inputs, I would create a PCollectionView from the additional information stream, which is a map of IDs to the (latest) additional information. The "join" can than be done in a DoFn with a side input of that view. However, the view seems to be shared by all instances that perform the side input. (It's a bit hard to find any information regarding this.) We would like to not make any assumptions regarding the amount of IDs and the size of additional info. Thus, using a side input seems also not to work here.
The side input option you discuss is currently the best option, although you are correct about the scalability concern due to the side input being broadcast to all workers.
Alternatively, you can store the infrequently-updated side in an external key-value store and just do lookups from a DoFn. If you go this route, it's generally useful to do a GroupByKey first on the main input with ID as a key, which lets you cache the lookups with a good cache-hit ratio.

Getting real time statistics in Omnet++

In:
https://docs.omnetpp.org/tutorials/tictoc/part5/
and
https://doc.omnetpp.org/omnetpp/manual/#sec:simple-modules:declaring-statistics
it's shown how network statistics can be processed after a simulation.
Is it possible to get network parameters dynamically?
TL;DR: Use signals (not statistics) and hook up your own simple module on these signals and compute the required statistics in that module.
You cannot access the value of #statistics in your code, and there is a reason for this as this would be an anti pattern. NED based statistics were introduced as a method to add calculations and measurements to your model without modifying your models behavior and code. This means that statistics are NOT considered part of a model, but rather they are considered as a configuration. Changing a statistics (i.e. deciding that you want to measure something else) should never change the behavior of your model. That's why the actual value of a given statistic is not exposed (easily) to the C++ code. You could dig them out, but it is highly discouraged.
Now, this does not mean that what you want to achieve is not legitimate but the actual statistics gathering must be an integral part of your model. I.e. you should not aim for using built-in statistics, but rather create an explicit statistics gathering submodule that should hook up on the necessary signals (https://doc.omnetpp.org/omnetpp/manual/#sec:simple-modules:subscribing-to-signals) and do the actual statistics computation you need in its C++ code. After that, other modules are free to access this information and make decisions based on that.

Chaos engineering best practice

I studied the principles of chaos, and looks for some opensource project, such as chaosblade which is open sourced by Alibaba, and mangle, by vmware.
These tools are both fault injection tools, and do nothing to analysis on the tested system.
According to the principles of chaos, we should
1.Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
2.Hypothesize that this steady state will continue in both the control group and the experimental group.
3.Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
4.Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.
Is there any good suggestions or best practice?
so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.
As always the answer is it depends.... It depends how do you want to measure your hypothesis, it depends on the hypothesis itself and it depends on the system. But normally it makes totally sense to introduce metrics to improve/increase the observability.
If your hypothesis is like Our service can process 120 requests in a second, even if one node fails. Then you could do it via metrics to measure that yes, but you could also measure it via the requests you send and receive the responses back. It is up to you.
But if your Hypothesis is I get a response for an request which was send before a node goes down. Then it makes more sense to verify this directly with the requests and response.
At our project we use for example chaostoolkit, which lets you specify the hypothesis in json or yaml and related action to prove it.
So you can say I have a steady state X and if I do Y, then the steady state X should be still valid. The toolkit is also able to verify metrics if you want to.
The Principles of Chaos are a bit above the actual testing, they reflect the philosophy of designed vs actual system and system under injection vs baseline, but are a bit too abstract to apply in everyday testing, they are a way of reasoning, not a work process methodology.
I'm think the control group vs experiment wording is one especially doubtful part - you stage a test (injection) in a controlled environment and try to catch if there is a user-facing incident, SLA breach of any kind or a degradation. I do not see where there is a control group out there if you test on a stand or dedicated environment.
We use a very linear variety of chaos methodology which is:
find failure points in the system (based on architecture, critical user scenarios and history of incidents)
design choas test scenarios (may be a single attack or more elaborate sequence)
run tests, register results and reuse green for new releases
start tasks to fix red tests, verify the solutions when they are available
One may say we are actually using the Principles of Choas in 1 and 2, but we tend to think of choas testing as quite linear and simple process.
Mangle 3.0 released with an option for analysis using resiliency score. Detailed documentation available at https://github.com/vmware/mangle/blob/master/docs/sre-developers-and-users/resiliency-score.md

how to track errors in FPGA/ASIC development using post place'n' route and/or post synthesis simulation?

I am a bit confused on the usefulness of post PnR and/or post synthesis simulations for FPGA/ASIC development. If the synthesis or PnR process complete successfully in the design flow, is there any chance that the respective 'post' simulation will reveal errors in the design? Could someone give an example?
In typical design flow post Synthesis and/or post PnR simulations are not useful and the aim should be to avoid them.
Post synthesis simulation can only unearth bugs in the, well, synthesis tool which are extremely rare in established FPGA tools. Checking these should not be an integral part of any design flow.
Albeit there are some very rare cases where the PnR tools might make e.g. technology mapping error or fail to give a warning from design rule violation, at minimum 99% of the cases that reveal problems in Post PnR simulation are due to design error, most typically clock domain crossing, memory access race conditions a good, but already very rare, second.
Therefore, the emphasis should be in adhering to the design rules and having rigorous design methodology to avoid the problems rather than trying to catch them in the post PnR simulation.
To your question - if there is no negative slack and the design rule check is ok, there should not be anything more that either of the post simulations can reveal.
One practical use for post PnR simulation is when you have complex design that is failing occasionally due to timing variation of an external component or mistake in I/O constraints, but you don't have a clue about the error mechanism. Combination of integrated logic analyzer and post PnR simulation can help the trickiest of situations to find out the root cause.
Post-PnR simulations don't only verify the functionality, but also the timing. The timing information of the circuit can be dumped to the simulation in several formats, however the most popular one is Standard Delay Format (SDF), which is published as IEEE 1497.
What kind of errors can we catch then?
It is hard to catch some unwanted glitches in RTL simulations. If some outputs are generated by a combinational logic, post-PnR simulations are more important than ever.
There may be some mistakes in the synthesis and/or PnR constraints. It is always better to double check everything.
Synthesis/PnR tools may have bugs. Logic Equivalence Checking (LEC) can also catch bugs, but it performs for functionality only.
Post PnR simulations are what is called as Gate Level Simulation in industry. This is of two types timing and non timing. This kind of simulation is used to detect
Timing paths, not checked by STA or timing closure.
Bugs in power and reset operation as HFNS (High Fanout Net synthesis) and CTS (Clock tree synthesis) may have caused irregularities in the reset of some resettable flops causing them to deliver x to the next logic in the path causing an x-propagation.
Bugs in DFT logic which was not checked during RTL simulation and might have been removed during PnR.
x on logic path due to relaibility issues for clock domain cross paths skipped by STA
Mostly stable process in terms of translating the logic from mapped to PAR. But, of course, if pedantic, you could use a LEC for both syn->map and map-> PAR.
Post PAR Sims could be useful though if you have issues in the lab, maybe because you didn't fully constrain your design for timing, and need to simulate with the back-annotated SDF, as someone else mentioned above. This, of course, does not help you though, wrt to IO,h if you haven't created models with timing in your TB and/or constrained your IO properly as provided to you by the Board designer.
I think it is best-practice to run regression suite at least once against the PAR netlist with back-annotated SDF. It costs you nothing, and provides one more confidence data point.

With Bluemix Retrieve&Rank, How do we implement a system to continuously learn?

With reference to the Web page below, using the Retrieve & Rank service of IBM Bluemix, we are creating a bot that can respond to inquiries.
Question:
After learning the ranker once, based on the user's response to the inquiry, how can we construct a mechanism to continuously learn and improve response accuracy?
Assumption:
Because there was no API of R&R service to continuously learn from the inquiry response result of the user, tuning the GroundTruth file,
I suppose that it is necessary to periodically perform such a process as training the ranker again.
Tuning contents of assumed GT file:
If there is a new question, add a set of questions and answers
Increase or decrease the relevance score of responses if there is something that could not be answered well by existing question
(If bot answered incorrectly, lower the score, if there is a useful answer, raise the score)
In order to continuously learn, you will want to do the following:
capture new examples i.e. each user input and corresponding result
review those examples and create new ranker examples, adjust relevance scores, etc
add those new examples to the ranker
retrain the ranker using your new and existing examples
NOTE: Be sure to validate that new updates to the ranker data improves the overall system performance. k-fold validation is a great way to measure this.
All in all, learning is a continuous process that should be repeated indefinitely or until system performance is deemed suffcient.