Create data as a load on J-Meter - kubernetes

I'm working on Kubernetes on Microsoft azure with real data. Now, I need to generate a sample of data on JMeter then use it as workload to stress the CPU in Tea-Store microservices on Kubernetes. Any hint or source about How to do that, and which type of files work with JMeter?

If you want a specific answer you need to ask more specific question.
The most common parameterization options are:
If you need to ingest data from external data sources:
CSV Data Set Config allows reading CSV files into JMeter Variables so each virtual user on each iteration reads next line from the CSV file
__CSVRead() function does more or less the same however it can be declared/used in the runtime so you can have dynamic filename/path and you decide when to proceed to next column/row
JDBC Request sampler allows reading test data from the database or creating test data in the database
__StringFromFile() function reads next line from file each time it's being called
__FileToString() function reads whole file into memory/variable
If you need to generate brand new/random data:
__threadNum() - number of current thread
__time() and __timeShift() - current timestamp in various formats plus possibility to generate dates in future or past
__Random() - generate a random number
__RandomString() - generate a random string out of provided characters
__UUID() - generate unique GUID-like structure
__groovy() - for everything else, it executes arbitrary Groovy code and returns the result

IN addition to great Dmitri's answer I would like to add few cents.
Please take a look to 13-Step Guide to Performance Testing in Kubernetes, especially to
Step 12: Automating the Performance Tests
When running performance tests, we need to run these tests for a range
of workload scenarios (e.g. concurrency levels, heap sizes, message
sizes, etc.). Running the tests manually for each of these scenarios
is time-consuming and likely to cause errors. Therefore it is
important to automate the performance tests prior to executing them.
We automate our performance tests using a shell script:
start_performance_test.sh.
This script can have an idea for smth similar for you. Also overall the article introduces you Jmeter usage with some examples.

Related

What is the fastest way to persist large complex data objects in Powershell for a short term period?

Case in point - I have a build which invokes a lot of REST API calls and processes the results. I would like to split the monolithic step that does all that into 3 steps:
initial data acquisition - gets data from REST Api. Plain objects, no reference loops or duplicate references
data massaging - enriches the data from (1) with all kinds of useful information. May result in duplicate references (the same object is referenced from multiple places) or reference loops.
data processing
The catch is that there is a lot of data and converting it to json takes too much time to my taste. I did not check the Export-CliXml function, but I think it would be slow too.
If I wrote the code in C# I would use some kind of binary serialization, which should be sophisticated enough to handle reference loops and duplicate references.
Please, note that serialization would write to the build staging directory and would be deserialized almost immediately as soon as the next step runs.
I wonder what are my options in Powershell?
EDIT 1
I would like to clarify what do I mean by steps. This is a build running on a CI build server. Each step runs in a separate shell and is reported individually on the build page. There is no memory sharing between the steps. The only way to communicate between the steps is either through build variables or file system. Of course, using databases is also possible, but it is an overkill.
Build variables are set using certain API and are exposed to the subsequent steps as environment variables. As such they are quite limited in length.
So I am talking about communicating through the file system. I am sacrificing performance here for the sake of build granularity - instead of having one monolithic step I want to have 3 smaller steps. This way the build is more transparent and communicates clearly what it is doing. But I have to temporarily persist payloads between steps. If it is possible to minimize the overhead, then the benefits worth it. If the performance is going to degrade significantly, then I will keep the monolithic step.

Does parameters variation not update the builtin database?

I notice that whenever I run a ParametersVariation model, the built-in database does not update... I have PLE, so there is no way for me to write my own database. I am currently able to pull data from various logs present in the database, but only from a normal simulation run. Is there a way to have the parameters variation write its data to the database after each simulation run?
I am currently running this code in After simulation run
Database myFile = new Database(this, "A DB from Excel", "C:/Users/Downloads/DataExport.xlsx");
ModelDatabase modelDB = getEngine().getModelDatabase();
modelDB.exportToExternalDB("flowchart_stats_time_in_state_log", myFile.getConnection(), "Sheet", false, true);
The export works perfectly. But the data never changes and this is confirmed by exporting a distribution from a histogram that changes with every simulation run. But for this export, its the same data as was written to the database from the last standard (non-parametersvariation) simulation run.
Model log database tables aren't produced for multi-run experiments. It's not specifically stated anywhere, but they're designed more for testing/debugging (single runs of) models.
(Also, notice that the log tables don't have columns specifying a run ID or similar, so there's no way that you would have been able to distinguish rows for different runs anyway, even if there were rows written in multi-run experiments.)
Unfortunately, because they are one of the only ways to 'automatically' produce certain forms of output data (like the contents of datasets or histograms) many people try to use them for that (even though they have a pretty un-useful 'internal' format). In general you should write to your own internal database tables for any persistent outputs, where you can also govern whether you store outputs for multiple runs or not (which will require you to calculate some form of unique run IDs and use those in columns to differentiate outputs per run, plus have logic or UI elements to determine when the table data is cleared for a new run and when it isn't).
NB: Note that the kinds of data the model log tables (like flowchart_stats_time_in_state_log which you mention) create can in virtually all cases be determined and created 'manually' via your own model code. That table in particular has a large amount of detail on what's happened in each block and, in any given case, it's probably only a fraction of that data (or a simplification/aggregation of it) that you really want/need.

Is Apache Beam the right tool for feature pre processing?

So this is a bit of a weird question as it isn't related to how to use the tool but more about why to use it.
I'm deploying a model and thinking of using Apache-beam to run the feature processing tasks using its python API. Documentation is pretty big and complex but I went through most of it, even built a small working pipeline, and it is still not clear this would be the right tool for me.
An example of what I need is the following:
Input data structure:
ID | Timestamp | category
output needed:
category | category count for last 30 minutes (feature example)
This process needs to run every 5 minutes and update the counts.
===> What I fail to understand is if apache can run this pipeline every 5 minutes, read whichever new input data was generated and update the counts of the previous time it ran. And if so, can someone point me in the right direction?
Thank you!
When you run a Beam pipeline manually, it's expected to be started only once. Then it could be either a Bounded (Batch) or Unbounded (Streaming) pipeline. In the first case, it will be stopped after the all your bounded amount of data has been processed, in the second case it will run continuously and expect new data arrival (until it will be stopped manually).
Usually, the type of pipeline depends on data source that you have (Beam IO connectors). For example, if you read from files, then, by default, it's assumed to be a bounded source (limited number of files), but it could be unbounded source as well if you expect to have more new files to arrive and want to read them in the same pipeline.
Also, you can run your batch pipeline periodically with automated tools, like Apache Airflow (or just unix crontab). So, it all depends on your needs and type or data source. I could probably give more specific advice if you could share more details of your data pipeline - type of your data source and environment, an example of input and output results, how often your input data can be updated and so on.

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Clarifications in Electric commander and tutorial

I was searching for tutorials on Electric cloud over the net but found nothing. Also could not find good blogs dealing with it. Can somebody point me in right directions for this?
Also we are planning on using Electric cloud for executing perl scripts in parallel. We are not going to build software. We are trying to test our hardware in parallel by executing the same perl script in parallel using electric commander. But I think Electric commander might not be the right tool given its cost. Can you suggest some of the pros and cons of using electric commander for this and any other feature which might be useful for our testing.
Thanks...
RE #1: All of the ElectricCommander documentation is located in the Electric Cloud Knowledge Base located at https://electriccloud.zendesk.com/entries/229369-documentation.
ElectricCommander can also be a valuable application to drive your tests in parallel. Here are just a few aspects for consideration:
Subprocedures: With EC, you can just take your existing scripts, drop them into a procedure definition and call that procedure multiple times (concurrently) in a single procedure invocation. If you want, you can further decompose your scripts into more granular subprocedures. This will drive reuse, lower cost of administration, and it will enable your procedures to run as fast as possible (see parallelism below).
Parallelism: Enabling a script to run in parallel is literally as simple as checking a box within EC. I'm not just referring to running 2 procedures at the same time without risk of data collision. I'm referring to the ability to run multiple steps within a procedure concurrently. Coupled with the subprocedure capability mentioned above, this enables your procedures to run as fast as possible as you can nest suprocedures within other subprocedures and enable everything to run in parallel where the tests will allow it.
Root-cause Analysis: Tests can generate an immense amount of data, but often only the failures, warnings, etc. are relevant (tell me what's broken). EC can be configured to look for very specific strings in your test output and will produce diagnostic based on that configuration. So if your test produces a thousand lines of output, but only 5 lines reference errors, EC will automatically highlight those 5 lines for you. This makes it much easier for developers to quickly identify root-cause analysis.
Results Tracking: ElectricCommander's properties mechanism allows you to store any piece of information that you determine to be relevant. These properties can be associated with any object in the system whether it be the procedure itself or the job that resulted from the invocation of a procedure. Coupled with EC's reporting capabilities, this means that you can produce valuable metrics indicating your overall project health or throughput without any constraint.
Defect Tracking Integration: With EC, you can automatically file bugs in your defect tracking system when tests fail or you can have EC create a "defect triage report" where developers/QA review the failures and denote which ones should be auto-filed by EC. This eliminates redundant data entry and streamlines overall software development.
In short, EC will behave exactly they way you want it to. It will not force you to change your process to fit the tool. As far as cost goes, Electric Cloud provides a version known as ElectricCommander Workgroup Edition for cost-sensitive customers. It is available for a small annual subscription fee and something that you may want to follow up on.
I hope this helps. Feel free to contact your account manager or myself directly if you have additional questions (dfarhang#electric-cloud.com).
Maybe you could execute the same perl script on several machines by using r-commands, or cron, or something similar.
To further address the parallel aspect of your question:
The command-line interface lets you write scripts to construct
procedures, including this kind of subprocedure with parallel steps.
So you are not limited, in the number of parallel steps, to what you
wrote previously: you can write a procedure which dynamically sizes
itself to (for example) the number of steps you would like to run in
parallel, or the number of resources you have to run steps in
parallel.