azure stream analytics group multiple inputs to one output

azure stream analytics group multiple inputs to one output - tsql

i am working with azure stream analytics and want to group multiple inputs with same time to only one output.
current output
how can i modify my code that it prints for example in first line:
{"HO","KSPF","07","03","#300","46091240","2017-05-31T10:17:34:4430000Z"}

You can probably aggregate using Collect over very small tumbling window, like 1 millisecond. And then write JavaScript UDF to create one event result from the array accumulated by Collect aggregate.

Related

ADF - what's the best way to execute one from a list of Data Flow activities based on a condition

I have 20 file formats and 1 Data Flow activity that maps to each one of them. Based on the file name, I know which data flow activity to execute. Is the only way to handle this through a "Switch" activity? Is there another way? e.g. can I parameterize the data flow to execute by a variable name?:

Unfortunately , there is no option to run one out of list of dataflows based on input condition.
To perform data migration and transformation for multiple tables, you can use same dataflow and parameterize the dataflow by providing the table names either during the runtime or use a control table to hold all the tablenames and inside foreach , call the dataflow activity. In the sink settings, use merge schema option.

Azure Data Flow generic curation framework

I wanted to create a data curation framework using Data Flow that uses generic data flow pipelines.
I have multiple data feeds (raw tables) to validate (between 10-100) and write to sink as curated tables:
For each raw data feed, need to validate the expected schema (based on a parameterized file name)
For each raw data feed, need to provide the Data Flow Script with validation logic (some columns should not be null, some columns should have specifici data types and value ranges, etc.)
Using Python SDK, create Data Factory and mapping data flows pipelines using the Data Flow Script prepared with the parameters provided (for schema validation)
Trigger the python code that creates the pipelines for each feed, does validation, write the issues into Log Analytics workspace and tear off the resources at specific schedules.
Has anyone done something like this? What is the best approach for the above please?
My overall goal is to reduce the time to validate/curate the data feeds, thus I wanted to prepare the validation logic quickly for each feed and create python classes or Powershell scripts scheduled to run them on generic data pipelines at specific times of the day.
many thanks
CK

To validate the schema, you can have a reference dataset which will be having the same schema (first row) as of your main dataset. Then you need to use “Get Metadata” activity for each dataset and get the structure of each dataset. Your Get Metadata activity will look like this:
You can then use “If Condition” activity to matches the structure of both datasets using equal Logical Function. Your equal expression will look something like this:
If both datasets’ structure matches, your next required activity(like copy the dataset to another container) will be performed.
Your complete pipeline will look like this:
The script which you want to run on your inserted dataset could be performed using “Custom” activity. You again need to create the linked service and it’s corresponding dataset for your script which you will run to validate the raw data. Please refer: https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
To schedule the pipeline as per your specific pipeline will be take care by Triggers in Azure Data Factory. A schedule trigger will take care of your requirement of auto trigger your pipeline at any specific time.

Azure Data Flow (Pass output of one data flow to another in the pipeline)

I have a requirement where I have to pass the select transformation output from one data flow (data flow) to another directly.
Example:
I have a data flow with a SELECT transformation as Final step.
I have another data flow that needs to take the above SELECT transformation output as input.
Currently, I am storing the output of first data flow into a table and getting the data from the table in second data flow which takes long to execute. I want to avoid storing into the table.
Thanks,
Karthik

Data flows require your logic to terminate with a Sink when you execute them from a pipeline, so you must persist your output somewhere in Azure. The next pipeline activity can read from that output dataset.

JMeter to record results on hourly basis

I have a JMeter project with multiple GET and POST requests and assertions for these. I use Aggregate results and View results tree listeners, but none of these can store results on hourly basis. I tried JMeterPlugins-Standard and JMeterPlugins-Extras packages and jp#gc - Graphs Generator listener, but all of them use aggregated data instead of hourly data. So I would like to get number of successful and failed requests/assertions per hour, maybe a bar chart would be most suitable for this purpose.

I'm going to suggest a non-conventional design-level solution: name your samplers dynamically with hour (or date and hour), so that each hour the name will change, and thus they will appear in different category, i.e.:
The code for such name is:
${__time(dd:hh,)} the rest of sampler name
Such sampler will appear in the following way in Aggregate Report (here I simulated it with minutes/seconds, but same will happen with days/hours, just on larger scale):
Pros and cons of such approach:
Simple, you can aggregate anything by hour, minute, or any other time slice while test is running, and not by analysis after execution.
Not listener-dependant, can be used with pretty much any listener or visualizer
If you want to also have overall stats, it will require to sum up every sub-category. So it alters data, but in the way that it can still can be added back to original relatively easy.
Calculating __time before every sampler will not be unnoticed completely from performance perspective, but I don't think it will add visible overhead to a script.
You could get the same data by properly aggregating JTL or CSV (whichever you use) after execution, so it doesn't provide you with anything that is not possible to achieve using standard methods
Script needs altering to make this happen. if you have 100s of samplers, it's going to take a while. And if you want to change back...

You might want to use Filter Results Tool which has --start-offset and --end-offset parameters, you can "cut" your results file into "interesting" pieces and plot them according to your requirements.
You can install Filter Results Tool using JMeter Plugins Manager
Also be aware that according to JMeter Best Practices you should
Use as few Listeners as possible; if using the -l flag as above they can all be deleted or disabled.
Don't use "View Results Tree" or "View Results in Table" listeners during the load test, use them only during scripting phase to debug your scripts.
You can get whatever information you need from the .jtl results file, you can specify test results location via -l command-line argument

To get summarized results per hour add to your test plan Generate Summary Results:
Generates a summary of the test run so far to the log file and/or standard output
Update interval in jmeter.properties to your needs ,1 hour, 3600 seconds:
summariser.interval=3600
You will get summary per hour of your requests.

You can try with Jmeter backend Listener. It has integration with graphite and Influxdb. After storing the results in these time series database you can display the result in Grafana dashboard. Grafana has its own filtering of showing the results in hourly, monthly, daily basis and so on.

How to pass output from a Datastage Parallel job to input as another job?

My requirement is
Parallel Job1 --I extract data from a table, when row count is more than 0
Parallel job 2 should be triggered in the sequencer only when the row count from source query in Job1 is greater than 0
I want to achieve this without creating any intermediate file in job1.

So basically what you want to do is using information from a data stream (of your Job1) and use it in the "above" sequence as a parameter.
In your case you want to decide on sequence level to run subsequent jobs (if more than 0 rows get returned) or not.
Two options for that:
Job1 writes information to a file which is a value file of a parameterset. These files are stored in a fixed directory. The parameter of the value file could then be used in your sequence to decide your further processing. Details for parameter sets can be found here.
You could use a server job for Job1 and set a user status (basic function DSSetUserStatus) in a transfomer. This is also passed back to the sequence and could be referenced in subsequent stages of the sequence. See the documentation but you will find many other information on the internet as well regarding this topic.
There are more solution to this problem - or let us call it challenge. Other ways may be a script called at sequence level which queries the database and will avoid Job1...