I have a ASP.Net application in which I am trying to insert rows in google bigquery through streaming (tabledata.insertAll()). I am doing this by HTTP POST request and in the request body, i am supplying data with the following structure:
{
"kind": "bigquery#tableDataInsertAllRequest",
"rows": [
{
"insertId": string,
"json": {
(key): (value)
}
}
]
}
When I am passing more than 100 rows (such as 101) in the request body, it gives me 400 bad request error. But when I pass 100 rows or less than 100, then it works fine with no errors.
Is there any limit of rows while using streaming?
The tableData.insertAll() has a :
Maximum row size: 100 KB
Maximum data size of all rows, per insert: 1 MB
Maximum rows per second: 100 rows per second, per table, with allowed and occasional bursts of up to 1,000 rows per second. If you
exceed 100 rows per second for an extended period of time, throttling
might occur.
Visit https://developers.google.com/bigquery/streaming-data-into-bigquery for the streaming quota policy.
Update from https://developers.google.com/bigquery/streaming-data-into-bigquery :
The following limits apply for streaming data into BigQuery.
Maximum row size: 20 KB
Maximum data size of all rows, per insert: 1
MB
Maximum rows per second: 10,000 rows per second, per table.
Exceeding this amount will cause quota_exceeded errors. For
additional support up to 100,000 rows per second, per table, please
contact a sales representative.
Maximum bytes per second: 10 MB per
second, per table.
Exceeding this amount will cause quota_exceeded
errors.
I think you gonna stop to get this kinda of errors.
Related
I want to use Apex Batch class to put 10,000 pieces of data into an object called A and use After Insert trigger to update the weight field value of 10,000 pieces of data to 100 if the largest number of weight fields is 100.
But now, if Batch size is 500, the number with the largest weight field value out of 500 data is applied to 500 data.
Of the following 500 data, the number with the largest weight field value applies to 500 data.
For example, if the weight field for the largest number of the first 500 data is 50,
Weight field value for data 1-50: 50
If the weight field for the largest number of the following 500 data is 100,
Weight field value for data 51-100: 100
I'm going to say that if the data is 10,000, the weight field is the largest number out of 10,000 data.
I want to update the weight field value of all data.
How shall I do it?
Here's the code for the trigger I wrote.
trigger myObjectTrigger on myObject_status__c (after insert) {
List<myObject_status__c> objectStatusList = [SELECT Id,Weight FROM myObject_status__c WHERE Id IN: Trigger.newMap.KeySet() ORDER BY Weight DESC];
Decimal maxWeight= [SELECT Id,Weight FROM myObject_status__c ORDER BY Weight DESC Limit 1].weight
for(Integer i=0;i<objectStatusList();i++){
objectStatusList[i].Weight = maxWeight;
}
update objectStatusList;
}
A trigger will not know whether the batch is still going on. Trigger works on scope of max 200 records at a time and normally sees only that. There are ways around it (create some static variable?) but even then it'd be limited to whatever is the batch's size, what came to single execute(). So if you're running in chunks of 500 - not even static in a trigger would help you.
Couple ideas:
How exactly do you know it'll be 10K? You're inserting them based on on another record? You're using the "Iterator" variant of batch? Could you "prescan" the records you're about to insert, figure out the max weight, then apply it as you insert, eliminating the need for update?
if it's never going to be bigger than 10K (and there are no side effects, no DMLs running on update) - you could combine Database.Stateful and finish() method. Keep updating the max value as you go through executes(), then in finish() update them 1 last time. Cutting it real close though.
can you "daisy chain". Submit another batch from this batch's finish. Passing same records and the max you figured out.
can you stamp the records inserted in same batch with same value, like maybe put the batch job's id into a hidden field. Then have another batch (daisy chained?) that looks for them, finds the max in the given range and applies to any that share the batch job id but not have the value applied yet
Set the weight in your finish method of the batch class, it runs once all batches have finished. Track the max weight record in a static variable in the class.
I have a dataflow that reads data from an excel that has >100,000 rows.
I've added a RowCount column using a Surrogate Key:
I've then added another column called BatchNumber with the following expression, so that each row is assigned to a batch
ceil(RowNumber/$batchSize)
Then I've added a "group by" step using the BatchNumber value, so that the rows are grouped into batched of $batchSize.
My issue is that no matter what batch size I choose, the totals rows output is always 1,000. For example;
Where $batchSize = 100, I get 10 batches of 100
Where $batchSize = 50, I get 20 batches of 50
I've tried running the pipeline using the activity runtime.
In Data factory Dataflow debug settings, there is limit to use how many rows are used to debug preview dataset. by default, it is of 1000 rows. Only the number of rows you have specified as your limit in your debug settings will be queried by the data preview.
Turn on Dataflow Debug and Click on debug settings.
Set the no of Row limit what you want. e.g. 100000 and Click on save.
It will take that many rows in debug preview dataset. but in debug preview dataset it only shows 100 columns maximum.
I don't know what the limit is or where it is documented, but it appears the issue was when using the sink type Cache.
I changed it to a dataset and output the data to files and it exported everything I was expecting.
My report processes millions of records. When the number of rows gets too high, I get this error:
The number of rows or columns is too big. Try limiting the number of unique group values.
Details: The number of rows or columns exceeds its limit, 65535.
How can I work around (or increase) this limit?
This error is pretty straightforward. 65535 is 0xFFFF in hexadecimal, so once you hit that limit there's no more vacancies and the hotel is closed. Solutions include:
Reduce the number of rows displayed by using grouping in your crosstab or whatever.
Reduce the amount of incoming data to your report with Record Selection. (Parameters)
Perform the dependent calculations in a custom SQL statement, generated as a temporary table in your report. You can then pass the results into your report as fields, rather than having to print millions of lines.
I constructed query this way:
...
val request = new SearchAnalyticsQueryRequest()
request.setStartDate(from)
request.setEndDate(to)
request.setDimensions(List("query", "page").asJava)
request.setRowLimit(5000)
request.setStartRow(0)
webmasters.searchanalytics().query(site, request)
result have 3343 rows
I tried to make paging - and for test reasons setup rowLimit at 1000
and i suggest to get 1000 then another 1000, and another 1000 and, finaly, 343 rows
from here https://developers.google.com/webmaster-tools/v3/how-tos/search_analytics
If your query has more than 5,000 rows of data, you can request data in batches of 5,000 rows at a time by sending multiple queries and incrementing the startRow value each time. Count the number of retrieved rows; if you get less than the number of rows requested, you have retrieved all the data. If your request ends exactly on the data boundary (for example, there are 5,000 rows and you requested startRow=0 and rowLimit=5000), on your next call you will get an empty response.
but i got only 559 rows !
when i set rowLimit at 100 - i got 51 rows!!!
What i doing wrong? :)
I noticed the same behavior, that looks like data sampling.
You can get more results (and then more accurate metrics) by fetching data day-by-day through multiple queries, instead of only one query ranging between two dates.
Hope, it helps!
I tested two scenarios Single Huge collection vs Multiple Small Collections and found huge difference in performance while querying. Here is what I did.
Case 1: I created a product collection containing 10 million records for 10 different types of product, and in this exactly 1 million records for each product type, and I created index on ProductType. When I ran a sample query with condition ProductType=1 and ProductPrice>100 and limit(10) to return 10 records of ProductType=1 and whose price is greater than 100, it took about 35 milliseconds when the collection has lot of products whose price is more than 100, and the same query took about 8000 millisecond (8 second) when we have very less number of products in ProductType=1 whose price is greater than 100.
Case 2: I created 10 different Product table for each ProductType each containing 1 million records. In collection 1 which contains records for productType 1, when I ran the same sample query with condition ProductPrice>100 and limit(10) to return 10 records of products whose price is greater than 100, it took about 2.5 milliseconds when the collection has lot of products whose price is more than 100, and the same query took about 1500 millisecond (1.5 second) when we have very less number of products whose price is greater than 100.
So why there is so much difference? The only difference between the case one and case two is one huge collection vs multiple smaller collection, but I have created index of ProductType in the first case one single huge collection. I guess the performance difference is caused by the Index in the first case, and I need that index in the first case otherwise it will be more worst in performance. I expected some performance slow in the first case due to the Index but I didn't expect the huge difference about 10 times slow in the first case.
So 8000 milliseconds vs 1500 milliseconds on one huge collection vs multiple small collection. Why?
Separating the collections gives you a free index without any real overhead. There is overhead for an index scan, especially if the index is not really helping you cut down on the number of results it has to scan (if you have a million results in the index, but you have to scan them all and inspect them, it's not going to help you much).
In short, separating them out is a valid optimization, but you should make your indexes better for your queries before you actually decide to take that route, which I consider a drastic measure (an index on product price might help you more in this case).
Using explain() can help you understand how queries work. Some basics are: You want a low nscanned to n ratio, ideally. You don't want scanAndOrder = true, and you don't want BasicCursor, usually (this means you're not using an index at all).