ADF - Data flow limiting number of rows on group by - azure-data-factory

I have a dataflow that reads data from an excel that has >100,000 rows.
I've added a RowCount column using a Surrogate Key:
I've then added another column called BatchNumber with the following expression, so that each row is assigned to a batch
ceil(RowNumber/$batchSize)
Then I've added a "group by" step using the BatchNumber value, so that the rows are grouped into batched of $batchSize.
My issue is that no matter what batch size I choose, the totals rows output is always 1,000. For example;
Where $batchSize = 100, I get 10 batches of 100
Where $batchSize = 50, I get 20 batches of 50
I've tried running the pipeline using the activity runtime.

In Data factory Dataflow debug settings, there is limit to use how many rows are used to debug preview dataset. by default, it is of 1000 rows. Only the number of rows you have specified as your limit in your debug settings will be queried by the data preview.
Turn on Dataflow Debug and Click on debug settings.
Set the no of Row limit what you want. e.g. 100000 and Click on save.
It will take that many rows in debug preview dataset. but in debug preview dataset it only shows 100 columns maximum.

I don't know what the limit is or where it is documented, but it appears the issue was when using the sink type Cache.
I changed it to a dataset and output the data to files and it exported everything I was expecting.

Related

How do I make the trigger run after all the data is inserted into the batch class?

I want to use Apex Batch class to put 10,000 pieces of data into an object called A and use After Insert trigger to update the weight field value of 10,000 pieces of data to 100 if the largest number of weight fields is 100.
But now, if Batch size is 500, the number with the largest weight field value out of 500 data is applied to 500 data.
Of the following 500 data, the number with the largest weight field value applies to 500 data.
For example, if the weight field for the largest number of the first 500 data is 50,
Weight field value for data 1-50: 50
If the weight field for the largest number of the following 500 data is 100,
Weight field value for data 51-100: 100
I'm going to say that if the data is 10,000, the weight field is the largest number out of 10,000 data.
I want to update the weight field value of all data.
How shall I do it?
Here's the code for the trigger I wrote.
trigger myObjectTrigger on myObject_status__c (after insert) {
List<myObject_status__c> objectStatusList = [SELECT Id,Weight FROM myObject_status__c WHERE Id IN: Trigger.newMap.KeySet() ORDER BY Weight DESC];
Decimal maxWeight= [SELECT Id,Weight FROM myObject_status__c ORDER BY Weight DESC Limit 1].weight
for(Integer i=0;i<objectStatusList();i++){
objectStatusList[i].Weight = maxWeight;
}
update objectStatusList;
}
A trigger will not know whether the batch is still going on. Trigger works on scope of max 200 records at a time and normally sees only that. There are ways around it (create some static variable?) but even then it'd be limited to whatever is the batch's size, what came to single execute(). So if you're running in chunks of 500 - not even static in a trigger would help you.
Couple ideas:
How exactly do you know it'll be 10K? You're inserting them based on on another record? You're using the "Iterator" variant of batch? Could you "prescan" the records you're about to insert, figure out the max weight, then apply it as you insert, eliminating the need for update?
if it's never going to be bigger than 10K (and there are no side effects, no DMLs running on update) - you could combine Database.Stateful and finish() method. Keep updating the max value as you go through executes(), then in finish() update them 1 last time. Cutting it real close though.
can you "daisy chain". Submit another batch from this batch's finish. Passing same records and the max you figured out.
can you stamp the records inserted in same batch with same value, like maybe put the batch job's id into a hidden field. Then have another batch (daisy chained?) that looks for them, finds the max in the given range and applies to any that share the batch job id but not have the value applied yet
Set the weight in your finish method of the batch class, it runs once all batches have finished. Track the max weight record in a static variable in the class.

Change the config settings per column in a Grafana table

I have a table in Grafana that has several columns and uses the gauge display mode. Setting the min/max values for these columns is troublesome. If the column is a percentage value the max is always known so can be hard coded as 100 or 1. But for example a column displaying database sizes the max is not known and will change.
I am running Grafana 8.1.2 so tried the new 'config from query results' transform for the first time. This works fine to alter the values of a single column but not for more than one.
Grafana Table
As you can see in the attached picture I have set the max for the database size using the new transform but I also need to be able to set the max for the log size column too.
The dashboard has 2 queries in it, both for MSSQLServer. Query A returns the results in a table format and Query B returns the config settings: query result
I've then got the transform set up as follows: transform set up
Is there a way to set the min\max settings for multiple columns using this new transform that I'm missing or some other technique to do it. Unfortunately (for me) Grafana seems to favour time series data so isn't as configurable for table data.

Data Factory / Data Flow - conditional split based on a number of records

I need to split a huge dataset into multiple files and each file must not have more than 100 000 rows.
I don't know if this is possible with Data Flow and the conditional split?
If you want simply split by a fixed number of rows, I've created a simple test.
Declare a parameter inside the dataflow to store the row count of your source dataset. If your source dataset is Azure sql, you can use Lookup activity to get the max Row_No. If your source dataset is Azure storage, you can use Azure Function activity to get the max Row_No. Then pass the value to the parameter.
Here for test, set a static default value.
Then we can set Number of partitions expression $RowCount/10, if you want 10 lines per file.
We can set file names after division here.
My source dataset contains 50 lines, so ADF will split it to 5 files. Judging by the Id column, it has randomly taken 10 rows of data.
You can achieve this with 2 dataflows. 1 to get the row count and another to partition. This can also be achieved in 1 dataflow using a cache sink in the future.

Get past row limitation

My report processes millions of records. When the number of rows gets too high, I get this error:
The number of rows or columns is too big. Try limiting the number of unique group values.
Details: The number of rows or columns exceeds its limit, 65535.
How can I work around (or increase) this limit?
This error is pretty straightforward. 65535 is 0xFFFF in hexadecimal, so once you hit that limit there's no more vacancies and the hotel is closed. Solutions include:
Reduce the number of rows displayed by using grouping in your crosstab or whatever.
Reduce the amount of incoming data to your report with Record Selection. (Parameters)
Perform the dependent calculations in a custom SQL statement, generated as a temporary table in your report. You can then pass the results into your report as fields, rather than having to print millions of lines.

See length (count) of query results in workbench

I just started using MySQL Workbench (6.1). The default limit for queries is 1,000 and that's fine I want to keep that.
But the results from the action output message will therefore always say "1000 rows returned".
Is there a setting to see the number of records that would be returned in the query had their been no limit? For sanity checking query results?
I know this is late by a few years, but I think you're asking for a way to see total row count in the bottom of the results pane, like in SQL Server. In SQL Server, you would also go in the messages pane and it would say how many rows were returned. I was actually looking for exactly what you were asking for as well, and seems like there is no way to find that. If you have an ID in your table that is just numeric and is in numeric order, you could order by ID desc and look at the biggest number there. That is what I've decided to do.
The result is not always "1000 rows returned". If there are less records than that you will get the actual count. If you want to know the total number of rows in a table do a select count(*) from table. Alternatively, you can switch off the automatic limit and have all records returned by MySQL Workbench, but that can be time + memory consuming for large tables.
I think removing the row limit will help. By default, MySQL workbench will limit the result set to 1000 rows but you can always disable the limit. Check out https://superuser.com/questions/240291/how-to-remove-1000-row-limit-in-mysql-workbench-queries on how to do that.
You can run a second query to check that
select count(*) from (your original query) as t;
this will return the total rows in actual result.
You can use the SQL count function. It returns the count of the total number of rows a query returns.
A sample query:
select count(*) from tableName where field1 = value1
In workbench, in the dropdown menu at the top, set it to dont limit Then run the query to extract data from table Then under the output pane below, the total count of the query results will be displayed in the message column