Scalable approach to maintaining data files for simulations in Gatling - frameworks

So I have been part of building a framework with Gatling and we were thinking if there is a scalable way to maintain CSV files for large number of simulations.
What we want to achieve:
A single data file might be used by multiple simulations
A single csv file contains multiple values and few of the values are used by 1 simulation and other set of values are used by other simulations
Be able to have a default data file for each simulation
Be able to override default data file if required and pass a new data file from may be command line.
In some cases I would want that if I am running 10 simulations and I want to override input data file for just 1 simulation that should be possible as well.
Given above requirement and from my knowledge of JMeter I am thinking of following approach:
Hardcode input data file for each simulation through feeder
Have environment variable read in the framework and use them to override default data files for respective simulation. But this would mean having a lot of environment variables in framework i.e. 1 env variable for each simulation so we can override input file just for the simulation that we want for:
String envVarToGetInputDataFileNameForEachSimulation = System.getEnv(key_of_variable);
Limitations and questions:
Is hardcoding csv files for respective simulations a good idea?
When number of simulation grows, usually in enterprise frameworks that you build for your organisations do we maintain so many csv files? or use some other way?
If I want to override a default file as stated in approach point 2., is it good idea to use 1 environment variable for 1 simulation? or is their a better way? And in this approach won't we end up creating too many environment variables as number of simulation grows?
Any other input from folks who have built large gatling frameworks and managed large number of input files can you please suggest a better and scalable way of doing it?

Related

Create data as a load on J-Meter

I'm working on Kubernetes on Microsoft azure with real data. Now, I need to generate a sample of data on JMeter then use it as workload to stress the CPU in Tea-Store microservices on Kubernetes. Any hint or source about How to do that, and which type of files work with JMeter?
If you want a specific answer you need to ask more specific question.
The most common parameterization options are:
If you need to ingest data from external data sources:
CSV Data Set Config allows reading CSV files into JMeter Variables so each virtual user on each iteration reads next line from the CSV file
__CSVRead() function does more or less the same however it can be declared/used in the runtime so you can have dynamic filename/path and you decide when to proceed to next column/row
JDBC Request sampler allows reading test data from the database or creating test data in the database
__StringFromFile() function reads next line from file each time it's being called
__FileToString() function reads whole file into memory/variable
If you need to generate brand new/random data:
__threadNum() - number of current thread
__time() and __timeShift() - current timestamp in various formats plus possibility to generate dates in future or past
__Random() - generate a random number
__RandomString() - generate a random string out of provided characters
__UUID() - generate unique GUID-like structure
__groovy() - for everything else, it executes arbitrary Groovy code and returns the result
IN addition to great Dmitri's answer I would like to add few cents.
Please take a look to 13-Step Guide to Performance Testing in Kubernetes, especially to
Step 12: Automating the Performance Tests
When running performance tests, we need to run these tests for a range
of workload scenarios (e.g. concurrency levels, heap sizes, message
sizes, etc.). Running the tests manually for each of these scenarios
is time-consuming and likely to cause errors. Therefore it is
important to automate the performance tests prior to executing them.
We automate our performance tests using a shell script:
start_performance_test.sh.
This script can have an idea for smth similar for you. Also overall the article introduces you Jmeter usage with some examples.

Dymola: Avoid "Not enough storage for initial variable data" for large Modelica model

I am trying to simulate a large Modelica model in Dymola. This model uses several records that define time series input data (data with 900 second intervals for 1 year), which it reads via the CombiTimeTable model.
If I limit the records to only contain the data for 2 weeks (also 900 second intervals), the model simulates fine.
With the yearly data, the translation seems to run successfully, but simulation fails. The dslog file contains the message Not enough storage for initial variable data.
This happens on a Windows 10 system with 8 GB RAM as well as on a Windows 7 system with 32 GB RAM.
Is there any way to avoid this error and get the simulation to run? Thanks in advance!
The recommended way is to have the time series data not within the records (that is in your model or library) but as external data files. The CombiTimeTable supports both reading from text file and MATLAB MAT file at simulation run-time. You will also benefit from shorter translation times.
You still can organize your external files relative to your library by means of Modelica URIs since the CombiTimeTable (as well as the other table blocks) already call the loadResource function. The recommended way is to organize these files in an Resources directory of your Modelica package.

spark: convert dataframe to svm labeled point

When referring to the spark ml/mllib docs, they all start from a svm stored example. This is really frustrating me since there doesn't seem to be a straightforward way to go from a standard RDD[Row] or Dataframe (taken from a "table" select) to this notation without first storing it.
This is just an inconvenience when dealing with 3 features or so, but when you scale that up to lots and lots of features, it's implying you will be doing a lot of typing and searching.
I ended up with something like this: (where "train" is a random split of a dataset w/ features stored in a table)
val trainLp = train.map(row => LabeledPoint(row.getInt(0).toDouble, Vectors.dense(row(8).asInstanceOf[Int].toDouble,row(9).asInstanceOf[Int].toDouble,row(10).asInstanceOf[Int].toDouble,row(11).asInstanceOf[Int].toDouble,row(12).asInstanceOf[Int].toDouble,row(13).asInstanceOf[Int].toDouble,row(14).asInstanceOf[Int].toDouble,row(15).asInstanceOf[Int].toDouble,row(18).asInstanceOf[Int].toDouble,row(21).asInstanceOf[Int].toDouble,row(27).asInstanceOf[Int].toDouble,row(28).asInstanceOf[Int].toDouble,row(29).asInstanceOf[Int].toDouble,row(30).asInstanceOf[Int].toDouble,row(31).asInstanceOf[Double],row(32).asInstanceOf[Double],row(33).asInstanceOf[Double],row(34).asInstanceOf[Double],row(35).asInstanceOf[Double],row(36).asInstanceOf[Double],row(37).asInstanceOf[Double],row(38).asInstanceOf[Double],row(39).asInstanceOf[Double],row(40).asInstanceOf[Double],row(41).asInstanceOf[Double],row(42).asInstanceOf[Double],row(43).asInstanceOf[Double])))
This is a nightmare to maintain, since these rows tend to change pretty often.
And here I'm only in the stage of getting labeled points, I'm not even at a svm stored version of this data.
What am I missing here that could potentially save me days of misery?
EDIT:
I got one step closer to the solution using something called a vectorassembler to build up my vector
Usually, CSV files are raw, unfiltered sources of information. Often they provide the original source of information.
In order to build a model, you usually want to go through a data cleansing, data preparation, data wrangling (and maybe more "data x" wording) phase before you build your model. This phase usually takes a big piece of the model building, and usually requires exploration of data. Typically, a process of transformation and feature selection (and creation) occurs between the original data and the data that builds the model.
If your CSV files don't need any of these preliminary phases - good for you!
You can always make configuration files that can keep track of certain columns or column indexes that build your model.
If your DataFrame comes from a "select", I guess what you can do to improve legibility and maintainability is to use column names instead of index numbers.
df.select($"my_col_1", $"my_col_2", .. )
and then operate through
row.getAs[String]("my_col_1")

How do you generate a CAD geometry of randomly oriented objects?

How can one generate CAD geometries of randomly oriented and randomly sized objects (3D)? I need to model randomly sized and randomly oriented rectangles--thousands to millions of them.
I have not yet come across any CAD tools that have =rand() functions that can be inputted into dimensions. Is one way perhaps to have a CAD program import a CSV file of these randomly generated parameter values?
In SolidWorks, you can have model parameters (dimension lengths/angles, constraints, etc.) stored in an Excel spreadsheet called a Design Table. Each row in the spreadsheet will represent a different configuration of your model, and each column a different parameter. You can use Excel's built-in capabilities or an export-capable tool of your choosing to generate the configurations according to your desired distribution. I don't recall off the top of my head the easiest way to get a large number of instances with different configurations into the same assembly, but you haven't really told us what you're trying to accomplish so I can't give you specific recommendations anyways.
If you have a specific CAD tool then you can often find documentation on the internal file format. With a little experimentation you can sometimes write a small external program that will generate the header of the CAD file and then loop thousands or millions of times generating each individual object. Finally you generate the lines needed to complete the file. That can sometimes be easier than trying to force a tool to do something the designers never expected. And this might let you use the software of your choice to generate the file.
I would suggest starting small. Use the CAD tool to create a file with two or three of your rectangles. Save and inspect the contents of the file to see that it matches your understanding of the needed format. Then try externally creating what should be the same file and verify your version is correctly accepted.
You might consider that some tool designers never expected someone to want thousands or millions of anything. I would suggest sneaking up on the problem. Try doubling the number of items, check this works as expected and then repeat this process again and again until either you successfully get to millions or until you find the CAD tool won't be able to handle this.

FORTRAN: Best way to store large amount of data which is readable in MATLAB

I am working on developing an application in Fortran where I have points defining quadrilateral panels on the surface of an object. I am calculating various parameters on these quadrilateral panels for a number of frequencies.
The output file should look like:
FREQUENCY,PANEL_NUMBER,X1,Y1,Z1,X2,Y2,Z2,X3,Y3,Z3,X4,Y4,Z4,AREA,PRESSURE,....
0.01,1,....
0.01,2,....
0.01,3,....
.
.
.
.
0.01,2000,....
0.02,1,....
0.02,2,....
.
.
.
0.02,2000,...
.
.
I am expecting a maximum of 300,000 rows with 30 columns. Data types are composed of integer, real and complex numbers. I want to store this file and later read the file in MATLAB to create a 3D geometry which I will color based on pressure at each panel.
The problem is, as you can see from the file structure, there is lot of data. I am currently writing this as a CSV file and the size is about 26GB.
I do not want to use database to handle this. Could anyone suggest what file format I should write this data using FORTRAN.
Thanks for your help,
Amitava
Store the data in the native format of the computer rather than in a human-readable file in which the numbers have been converted to base 10 and characters. This will produce the smallest file and the fastest to process. On the Fortran open statement, use form='unformatted', access='stream'. The first causes the file to be unformatted, the second causes Fortran not to include its usual record-length information, which is Fortran specific. This omission makes the file more portable to other languages. Someone else can help better with how to read the file in MATLAB; I found this on the web: http://www.mathworks.com/help/matlab/import_export/importing-binary-data-with-low-level-i-o.html
UPDATE: This approach has several assumptions. It might not work easily if you wish to transport the file between different types of computers. Your question implies that want many rows of identical content. Identical rows simply matches a file structure with that number of identical records. It seems that you want to read the entire file, in which case a sequential file is appropriate. If you wish to read "random" records, a Fortran direct access file might be useful. With the simplicity of identical records, using a native file format seems easy. If you want self-documentation or portability across computers (different numeric representations), a file format such as HDF or FITS would be useful.
I second #steabert's mention of NetCDF and there's also HDF5 (on which the NetCDF 4 format is built). However, it does depend on what you mean by "data types": they are best used with regular/rigid data layouts and NetCDF's support for Fortran derived types can be painful at times.
Possible advantages for cases with large lumps are data transparent compression; data checksumming; and possibly more natural random access (that is, no need to compute seek positions based on array index) compared with Fortran stream access. That's on top of the usual things of a self-documenting and portable file format.
MATLAB has inbuilt support for reading these files, and recent versions also support the OPeNDAP framework so you wouldn't even need to have the file on the same (or multiple) machine(s).
Of course, disadvantages: extra software; extra skills development (especially for HDF5); and increased code complexity on the Fortran side.