What is the fastest way to persist large complex data objects in Powershell for a short term period? - powershell

Case in point - I have a build which invokes a lot of REST API calls and processes the results. I would like to split the monolithic step that does all that into 3 steps:
initial data acquisition - gets data from REST Api. Plain objects, no reference loops or duplicate references
data massaging - enriches the data from (1) with all kinds of useful information. May result in duplicate references (the same object is referenced from multiple places) or reference loops.
data processing
The catch is that there is a lot of data and converting it to json takes too much time to my taste. I did not check the Export-CliXml function, but I think it would be slow too.
If I wrote the code in C# I would use some kind of binary serialization, which should be sophisticated enough to handle reference loops and duplicate references.
Please, note that serialization would write to the build staging directory and would be deserialized almost immediately as soon as the next step runs.
I wonder what are my options in Powershell?
EDIT 1
I would like to clarify what do I mean by steps. This is a build running on a CI build server. Each step runs in a separate shell and is reported individually on the build page. There is no memory sharing between the steps. The only way to communicate between the steps is either through build variables or file system. Of course, using databases is also possible, but it is an overkill.
Build variables are set using certain API and are exposed to the subsequent steps as environment variables. As such they are quite limited in length.
So I am talking about communicating through the file system. I am sacrificing performance here for the sake of build granularity - instead of having one monolithic step I want to have 3 smaller steps. This way the build is more transparent and communicates clearly what it is doing. But I have to temporarily persist payloads between steps. If it is possible to minimize the overhead, then the benefits worth it. If the performance is going to degrade significantly, then I will keep the monolithic step.

Related

Create data as a load on J-Meter

I'm working on Kubernetes on Microsoft azure with real data. Now, I need to generate a sample of data on JMeter then use it as workload to stress the CPU in Tea-Store microservices on Kubernetes. Any hint or source about How to do that, and which type of files work with JMeter?
If you want a specific answer you need to ask more specific question.
The most common parameterization options are:
If you need to ingest data from external data sources:
CSV Data Set Config allows reading CSV files into JMeter Variables so each virtual user on each iteration reads next line from the CSV file
__CSVRead() function does more or less the same however it can be declared/used in the runtime so you can have dynamic filename/path and you decide when to proceed to next column/row
JDBC Request sampler allows reading test data from the database or creating test data in the database
__StringFromFile() function reads next line from file each time it's being called
__FileToString() function reads whole file into memory/variable
If you need to generate brand new/random data:
__threadNum() - number of current thread
__time() and __timeShift() - current timestamp in various formats plus possibility to generate dates in future or past
__Random() - generate a random number
__RandomString() - generate a random string out of provided characters
__UUID() - generate unique GUID-like structure
__groovy() - for everything else, it executes arbitrary Groovy code and returns the result
IN addition to great Dmitri's answer I would like to add few cents.
Please take a look to 13-Step Guide to Performance Testing in Kubernetes, especially to
Step 12: Automating the Performance Tests
When running performance tests, we need to run these tests for a range
of workload scenarios (e.g. concurrency levels, heap sizes, message
sizes, etc.). Running the tests manually for each of these scenarios
is time-consuming and likely to cause errors. Therefore it is
important to automate the performance tests prior to executing them.
We automate our performance tests using a shell script:
start_performance_test.sh.
This script can have an idea for smth similar for you. Also overall the article introduces you Jmeter usage with some examples.

Incrementing hundreds of counters at once, redis or mongodb?

Background/Intent:
So I'm going to create an event tracker from scratch and have a couple of ideas on how to do this but I'm unsure of the best way to proceed with the database side of things. One thing I am interested in doing is allowing these events to be completely dynamic, but at the same time to allow for reporting on relational event counters.
For example, all countries broken down by operating systems. The desired effect would be:
US # of events
iOS - # of events that occured in US
Android - # of events that occured in US
CA # of events
iOS - # of events that occured in CA
Android - # of events that occured in CA
etc.
My intent is to be able to accept these event names like so:
/?country=US&os=iOS&device=iPhone&color=blue&carrier=Sprint&city=orlando&state=FL&randomParam=123&randomParam2=456&randomParam3=789
Which means in order to do the relational counters for something like the above I would potentially be incrementing 100+ counters per request.
Assume there will be 10+ million of the above requests per day.
I want to keep things completely dynamic in terms of the event names being tracked and I also want to do it in such a manner that the lookups on the data remains super quick. As such I have been looking into using redis or mongodb for this.
Questions:
Is there a better way to do this then counters while keeping the fields dynamic?
Provided this was all in one document (structured like a tree), would using the $inc operator in mongodb to increment 100+ counters at the same time in one operation be viable and not slow? The upside here being I can retrieve all of the statistics for one 'campaign' quickly in a single query.
Would this be better suited to redis and to do a zincrby for all of the applicable counters for the event?
Thanks
Depending on how your key structure is laid out I would recommend pipelining the zincr commands. You have an easy "commit" trigger - the request. If you were to iterate over your parameters and zincr each key, then at the end of the request pass the execute command it will be very fast. I've implemented a system like you describe as both a cgi and a Django app. I set up a key structure along the lines of this:
YYYY-MM-DD:HH:MM -> sorted set
And was able to process Something like 150000-200000 increments per second on the redis side with a single process which should be plenty for your described scenario. This key structure allows me to grab data based on windows of time. I also added an expire to the keys to avoid writing a db cleanup process. I then had a cronjob that would do set operations to "roll-up" stats in to hourly, daily, and weekly using variants of the aforementioned key pattern. I bring these ideas up as they are ways you can take advantage of the built in capabilities of Redis to make the reporting side simpler. There are other ways of doing it but this pattern seems to work well.
As noted by eyossi the global lock can be a real problem with systems that do concurrent writes and reads. If you are writing this as a real time system the concurrency may well be an issue. If it is an "end if day" log parsing system then it would not likely trigger the contention unless you run multiple instances of the parser or reports at the time of input. With regards to keeping reads fast In Redis, I would consider setting up a read only redis instance slaved off of the main one. If you put it on the server running the report and point the reporting process at it it should be very quick to generate the reports.
Depending on your available memory, data set size, and whether you store any other type of data in the redis instance you might consider running a 32bit redis server to keep the memory usage down. A 32b instance should be able to keep a lot of this type of data in a small chunk of memory, but if running the normal 64 bit Redis isn't taking too much memory feel free to use it. As always test your own usage patterns to validate
In redis you could use multi to increment multiple keys at the same time.
I had some bad experience with MongoDB, i have found that it can be really tricky when you have a lot of writes to it...
you can look at this link for more info and don't forget to read the part that says "MongoDB uses 1 BFGL (big f***ing global lock)" (which maybe already improved in version 2.x - i didn't check it)
On the other hand, i had a good experience with Redis, i am using it for a lot of read / writes and it works great.
you can find more information about how i am using Redis (to get a feeling about the amount of concurrent reads / writes) here: http://engineering.picscout.com/2011/11/redis-as-messaging-framework.html
I would rather use pipelinethan multiif you don't need the atomic feature..

How to build local environment with large databases

I have two storages (PostgreSQL, MongoDB) and as I need to develope application locally on my computer (ideally offline), i need data from those storages to be copied to my HDD.
Anyway those are massive databases with around hundreds of gigabytes of data.
I don't need all data stored there, just sample of them to be able to launch my app locally on that data. Both storages have some capable tools for data export (pg_dump, mongodump, mongoexport etc.).
But I don't know how to easily and effectively do the export of small sample of data. Even if I would take the list of all tables/collections and build some whitelist, which would define tables, which should be limited on number of rows, there comes troubles with triggeres, functions, indexes etc.
I don't know about testing for MongoDB, but for PostgreSQL here's what I do.
I follow a pattern while developing against databases that separates the DB side from the app side. For testing the DB side, I have a test schema which includes a single stored procedure that resets all the data in the real schema. This reset is done following the MERGE pattern (delete any records with an unrecognized key, update records that have matching keys but which are changed, and insert missing records). This reset is called before running every unit test. This gives me simple, clear test coverage for stored functions.
For testing code that calls into the database, the database layer is always mocked, so there are never any calls that actually go to the database.
What you are describing suggests to me that you are attempting to mix unit testing with integration testing, and I rather strongly suggest that you don't do that. Integration testing is what happens when you've already proved base functionality and want to prove integration between components and probably also performance, too. For IT, you really need a representative data set on representative hardware. Usually this means a dedicated machine, and using hudson for CI.
The direction you seem to be going in is going to be difficult because, as you've already noticed, it's difficult to handle that volume of data and it's difficult to generate representative data sets (most CI systems actually use production data that's been "cleaned" of sensitive information)
Which is why most of the places I've worked have not gone that way.
Just copy it all. Several hundreds gigabytes is not very much by today's standards — you can buy 2000GB disk for $80.
If you test your code on small sample data then how do you know if your coding will be efficient enough for full database?
Just remember to encrypt it with strong password if it goes out of your company building.

Clarifications in Electric commander and tutorial

I was searching for tutorials on Electric cloud over the net but found nothing. Also could not find good blogs dealing with it. Can somebody point me in right directions for this?
Also we are planning on using Electric cloud for executing perl scripts in parallel. We are not going to build software. We are trying to test our hardware in parallel by executing the same perl script in parallel using electric commander. But I think Electric commander might not be the right tool given its cost. Can you suggest some of the pros and cons of using electric commander for this and any other feature which might be useful for our testing.
Thanks...
RE #1: All of the ElectricCommander documentation is located in the Electric Cloud Knowledge Base located at https://electriccloud.zendesk.com/entries/229369-documentation.
ElectricCommander can also be a valuable application to drive your tests in parallel. Here are just a few aspects for consideration:
Subprocedures: With EC, you can just take your existing scripts, drop them into a procedure definition and call that procedure multiple times (concurrently) in a single procedure invocation. If you want, you can further decompose your scripts into more granular subprocedures. This will drive reuse, lower cost of administration, and it will enable your procedures to run as fast as possible (see parallelism below).
Parallelism: Enabling a script to run in parallel is literally as simple as checking a box within EC. I'm not just referring to running 2 procedures at the same time without risk of data collision. I'm referring to the ability to run multiple steps within a procedure concurrently. Coupled with the subprocedure capability mentioned above, this enables your procedures to run as fast as possible as you can nest suprocedures within other subprocedures and enable everything to run in parallel where the tests will allow it.
Root-cause Analysis: Tests can generate an immense amount of data, but often only the failures, warnings, etc. are relevant (tell me what's broken). EC can be configured to look for very specific strings in your test output and will produce diagnostic based on that configuration. So if your test produces a thousand lines of output, but only 5 lines reference errors, EC will automatically highlight those 5 lines for you. This makes it much easier for developers to quickly identify root-cause analysis.
Results Tracking: ElectricCommander's properties mechanism allows you to store any piece of information that you determine to be relevant. These properties can be associated with any object in the system whether it be the procedure itself or the job that resulted from the invocation of a procedure. Coupled with EC's reporting capabilities, this means that you can produce valuable metrics indicating your overall project health or throughput without any constraint.
Defect Tracking Integration: With EC, you can automatically file bugs in your defect tracking system when tests fail or you can have EC create a "defect triage report" where developers/QA review the failures and denote which ones should be auto-filed by EC. This eliminates redundant data entry and streamlines overall software development.
In short, EC will behave exactly they way you want it to. It will not force you to change your process to fit the tool. As far as cost goes, Electric Cloud provides a version known as ElectricCommander Workgroup Edition for cost-sensitive customers. It is available for a small annual subscription fee and something that you may want to follow up on.
I hope this helps. Feel free to contact your account manager or myself directly if you have additional questions (dfarhang#electric-cloud.com).
Maybe you could execute the same perl script on several machines by using r-commands, or cron, or something similar.
To further address the parallel aspect of your question:
The command-line interface lets you write scripts to construct
procedures, including this kind of subprocedure with parallel steps.
So you are not limited, in the number of parallel steps, to what you
wrote previously: you can write a procedure which dynamically sizes
itself to (for example) the number of steps you would like to run in
parallel, or the number of resources you have to run steps in
parallel.

One big call vs. multiple smaller TSQL calls

I have a ADO.NET/TSQL performance question. We have two options in our application:
1) One big database call with multiple result sets, then in code step through each result set and populate my objects. This results in one round trip to the database.
2) Multiple small database calls.
There is much more code reuse with Option 2 which is an advantage of that option. But I would like to get some input on what the performance cost is. Are two small round trips twice as slow as one big round trip to the database, or is it just a small, say 10% performance loss? We are using C# 3.5 and Sql Server 2008 with stored procedures and ADO.NET.
I would think it in part would depend on when you need the data. For instance if you return ten datasets in one large process, and see all ten on the screen at once, then go for it. But if you return ten datasets and the user may only click through the pages to see three of them then sending the others was a waste of server and network resources. If you return ten datasets but the user really needs to see sets seven and eight only after making changes to sets 5 and 6, then the user would see the wrong info if you returned it too soon.
If you use separate stored procs for each data set called in one master stored proc, there is no reason at all why you can't reuse the code elsewhere, so code reuse is not really an issue in my mind.
It sounds a wee bit obvious, but only send what you need in one call.
For example, we have a "getStuff" stored proc for presentation. The "updateStuff" proc calls "getStuff" proc and the client wrapper method for "updateStuff" expects type "Thing". So one round trip.
Chatty servers are one thing you prevent up front with minimal effort. Then, you can tune the DB or client code as needed... but it's hard to factor out the roundtrips later no matter how fast your code runs. In the extreme, what if your web server is in a different country to your DB server...?
Edit: it's interesting to note the SQL guys (HLGEM, astander, me) saying "one trip" and the client guys saying "multiple, code reuse"...
I am struggling with this problem myself. And I don't have an answer yet, but I do have some thoughts.
Having reviewed the answers given by others to this point, there is still a third option.
In my appllication, around ten or twelve calls are made to the server to get the data I need. Some of the datafields are varchar max and varbinary max fields (pictures, large documents, videos and sound files). All of my calls are synchronous - i.e., while the data is being requested, the user (and the client side program) has no choice but to wait. He may only want to read or view the data which only makes total sense when it is ALL there, not just partially there. The process, I believe, is slower this way and I am in the process of developing an alternative approach which is based on asynchronous calls to the server from a DLL libaray which raises events to the client to announce the progress to the client. The client is programmed to handle the DLL events and set a variable on the client side indicating chich calls have been completed. The client program can then do what it must do to prepare the data received in call #1 while the DLL is proceeding asynchronously to get the data of call #2. When the client is ready to process the data of call #2, it must check the status and wait to proceed if necessary (I am hoping this will be a short or no wait at all). In this manner, both server and client side software are getting the job done in a more efficient manner.
If you're that concerned with performance, try a test of both and see which performs better.
Personally, I prefer the second method. It makes life easier for the developers, makes code more re-usable, and modularizes things so changes down the road are easier.
I personally like option two for the reason you stated: code reuse
But consider this: for small requests the latency might be longer than what you do with the request. You have to find that right balance.
As the ADO.Net developer, your job is to make the code as correct, clear, and maintainable as possible. This means that you must separate your concerns.
It's the job of the SQL Server connection technology to make it fast.
If you implement a correct, clear, maintainable application that solves the business problems, and it turns out that the database access is the major bottleneck that prevents the system from operating within acceptable limits, then, and only then, should you start persuing ways to fix the problem. This may or may not include consolidating database queries.
Don't optimize for performance until a need arisess to do so. This means that you should analyze your anticipated use patterns and determine what the typical frequency of use for this process will be, and what user interface latency will result from the present design. If the user will receive feedback from the app is less than a few (2-3) seconds, and the application load from this process is not an inordinate load on server capacity, then don't worry about it. If otoh the user is waiting an unacceptable amount of time for a response (subjectve but definitiely measurable) or if the server is being overloaded, then it's time to begin optimization. And then, which optimization techniques will make the most sense, or be the most cost effective, depend on what your analysis of the issue tells you.
So, in the meantime, focus on maintainability. That means, in your case, code reuse
Personally I would go with 1 larger round trip.
This will definately be influenced by the exact reusability of the calling code, and how it might be refactored.
But as mentioned, this will depend on your exact situation, where maintainability vs performance could be a factor.