I'm using Bigquery Java API to run ~1000 copy jobs simultaneously (With scala.concurrent.Future) with WriteDisposition WRITE_APPEND, but I'm getting
com.google.cloud.bigquery.BigQueryException: API limit exceeded: Unable to return a row that exceeds the API limits. To retrieve the row, export the table
I thought this is caused by too much concurrency, then I tried to use Monix's Task to limit the parallelism to at most 20:
def execute(queries: List[Query]): Future[Seq[Boolean]] = {
val tasks: Iterator[Task[List[Boolean]]] = queries.map(q => BqApi.copyTable(q, destinationTable))
.sliding(20, 20)
.map(Task.gather(_))
val results: Task[List[Boolean]] = Task.sequence(tasks)
.map(_.flatten.toList)
results.runAsync
}
where BqApi.copyTable executes the query and copy the result to the destination table then returns a Task[Boolean].
The same exception still happens.
But if I change the WriteDisposition to WRITE_TRUNCATE, the exception goes away.
Can anyone help me to understand what happens under the hood? And why Bigquery API behaves like this?
This message is encountered when a query exceeds a maximum response size. Since copy jobs use jobs.insert, maybe you're hitting the maximum row size which are in the query jobs limits. I suggest filling a BigQuery bug on its issue tracker to describe your behavior properly regarding the Java API.
Related
16:37:21.945 [Workflow Executor taskList="PullFulfillmentsTaskList", domain="test-domain": 3] WARN com.uber.cadence.internal.common.Retryer - Retrying after failure
org.apache.thrift.transport.TTransportException: Request timeout after 1993ms
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:546)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:519)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.respondDecisionTaskCompleted(WorkflowServiceTChannel.java:962)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RespondDecisionTaskCompleted$11(WorkflowServiceTChannel.java:951)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:569)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.RespondDecisionTaskCompleted(WorkflowServiceTChannel.java:949)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:301)
at com.uber.cadence.internal.common.Retryer.lambda$retry$0(Retryer.java:104)
at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
at com.uber.cadence.internal.common.Retryer.retry(Retryer.java:101)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:301)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:261)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:229)
at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Our parent workflow code is basically like this (JSONObject is from org.json)
JSONObject[] array = restActivities.getArrayWithHugeJSONItems();
for(JSONObject hugeJSON: array) {
ChildWorkflow child = Workflow.newChildWorkflowStub(ChildWorkflow.class);
child.run(hugeJSON);
}
What we find out is that most of the time, the parent workflow worker fails to start the child workflow and throws the timeout exception above. It retries like crazy but never success and print the timeout exception over and over again. However sometimes we got very lucky and it works. And sometimes it fails even earlier at the activity worker, and it throws the same exception. We believe this is due to the size of the data is too big (about 5MB) and could not be sent within the timeout (judging from the log we guess it's set to 2s). If we call child.run with small fake data it 100% works.
The reason we use child workflow is we want to use Async.function to run them in parallel. So how can we solve this problem? Is there a thrift timeout config we should increase or somehow we can avoid passing huge data around?
Thank you in advance!
---Update after Maxim's answer---
Thank you. I read the example, but still have some questions for my use case. Let's say I got an array of 100 huge JSON objects in my RestActivitiesWorker, if I should not return the huge array to the workflow, I need to make 100 calls to the database to create 100 rows of records and put 100 ids in an array and pass that back to the workflow. Then the workflow create one child workflow per id. Each child workflow then calls another activity with the id to load the data from the DB. But that activity has to pass that huge JSON to the child workflow, is this OK? And for the RestActivitiesWorker making 100 inserts into the DB, what if it failed in the middle?
I guess it boils down to that our workflow is trying to work directly with huge JSON. We are trying to load huge JSON (5-30MB, not that huge) from an external system into our system. We break down the JSON a little bit, manipulate a few values, and use values from a few fields to do some different logic, and finally save it in our DB. How should we do this with Temporal?
Temporal/Cadence doesn't support passing large blobs as inputs and outputs as it uses a DB as underlying storage. So you want to change architecture of your application to avoid this.
The standard workarounds are:
Use external blob store to save large data and pass reference to it as parameters.
Cache data in a worker process or even on a host disk and route activities that operate on this data to that process or host. See fileprocessing sample for this approach.
I am trying to read a collection in MongoDB in Spark with 234 million records. I want only 1 field.
case class Linkedin_Profile(experience : Array[Experience])
case class Experience(company : String)
val rdd = MongoSpark.load(sc, ReadConfig(Map("uri" -> mongo_uri_linkedin)))
val company_DS = rdd.toDS[Linkedin_Profile]()
val count_udf = udf((x: scala.collection.mutable.WrappedArray[String]) => {x.filter( _ != null).groupBy(identity).mapValues(_.size)})
val company_ColCount = company_DS.select(explode(count_udf($"experience.company")))
comp_rdd.saveAsTextFile("/dbfs/FileStore/chandan/intermediate_count_results.csv")
The job runs for 1 hour with half of jobs completed but after that it gives an error
com.mongodb.MongoCursorNotFoundException:
Query failed with error code -5 and error message
'Cursor 8962537864706894243 not found on server cluster0-shard-01-00-i7t2t.mongodb.net:37017'
on server cluster0-shard-01-00-i7t2t.mongodb.net:37017
I tried changing the configuration with below, but to no avail.
System.setProperty("spark.mongodb.keep_alive_ms", "7200000")
Please suggest how to read this large collection.
The config property park.mongodb.keep_alive_ms is meant to control the life of the client. See docs here.
The issue you're experiencing seems to be related to server-side configuration. According to what's documented on this issue:
By specifing the cursorTimeoutMillis option, administrators can configure mongod or mongos to automatically remove idle client cursors after a specified interval. The timeout applies to all cursors maintained on a mongod or mongos, may be specified when starting the mongod or mongos and may be modified at any time using the setParameter command.
So, try starting your mongod daemon with specified cursorTimeoutMillis, such as:
mongod --setParameter cursorTimeoutMillis=10800000
This command tries to instruct the server to keep cursors valid for 3 hours.
Although this may in theory get rid of the annoyance, it is still a good idea to get the reads to complete faster. You might want to look into limiting the dataset sitting in collections to what you really want to load into Spark. There may be many options to tune the read speed worth looking into.
Yes,By specifing the cursorTimeoutMillis option, you can avoid this.
But,if you are not the administrators, you can cache the MongoRdd by Action first, then do something in spark env.
I have a grails application. I'm using amazon dynamodb for a specific requirement which is accessed, and entries are added by a different application. Now I need to get all the information from the dynamodb table to a postgreSQL table. There are over 10000 records in the dynamodb but the throughput is
Read capacity units : 100
Write capacity units : 100
In BuildConfig.groovy I have defined the plugin
compile ":dynamodb:0.1.1"
In config.groovy I have the following configuration
grails {
dynamodb {
accessKey = '***'
secretKey = '***'
disableDrop = true
dbCreate = 'create'
}
}
The code I have looks something similar to this
class book {
Long id
String author
String name
date publishedDate
static constraints = {
}
static mapWith = "dynamodb"
static mapping = {
table 'book'
throughput read:100
}
}
When I try something like book.findAll() I get the following error
AmazonClientException: Unable to unmarshall response (Connection reset)
And when I tried to reduce the number of records by trying something like book.findAllByAuthor() (which also wud have above 1000's of records) I get the following error
Caused by ProvisionedThroughputExceededException: Status Code: 400, AWS Service: AmazonDynamoDB, AWS Request ID: ***, AWS Error Code: ProvisionedThroughputExceededException, AWS Error Message: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
I have the need to get all the records in dynamodb despite the throughput restriction and save it in a postgres table. Is there a way to do so?
I'm very new to this area, thanks in advance for the help.
After some research I came Across Google Guava. But even to use Guava RateLimiter, there wont be a fixed number of times I would need to send the request of how long it would take. So I'm looking for a solution which will suit the requirement I have
Probably your issue is not connected with grails at all. The returned error message claims: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API
So you should consider increasing level of throughput (for this option you have to pay more) or adjust your queries to obey actual limits.
Check out also this answer: https://stackoverflow.com/a/31484168/2166188
At the tope of every minute my code uploads between 20 to 40 files total (from multiple machines, about 5 files in parallel until they are all uploaded) to Google Cloud Storage. I frequently get 429 - Too Many Errors, like the following:
java.io.IOException: Error inserting: bucket: mybucket, object: work/foo/hour/out/2015/08/21/1440191400003-e7ba2b0c-b71b-460a-9095-74f37661ae83/2015-08-21T20-00-00Z/
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1583)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:474)
... 3 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 429 Too Many Requests
{
"code" : 429,
"errors" : [ {
"domain" : "usageLimits",
"message" : "The total number of changes to the object mybucket/work/foo/hour/out/2015/08/21/1440191400003-e7ba2b0c-b71b-460a-9095-74f37661ae83/2015-08-21T20-00-00Z/ exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"reason" : "rateLimitExceeded"
} ],
"message" : "The total number of changes to the object mybucket/work/foo/hour/out/2015/08/21/1440191400003-e7ba2b0c-b71b-460a-9095-74f37661ae83/2015-08-21T20-00-00Z/ exceeds the rate limit. Please reduce the rate of create, update, and delete requests."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:471)
... 3 more
I have some retry logic, which helps a bit, but even after some exponential backoff and up to 3 retries, I still often get the error.
Strangely, when I go to the Google Developers Console -> APIs & auth -> APIs -> Cloud Storage API -> Quotas, I see Per-user limit 102,406.11 requests/second/user. When I look at the Usage tab, it shows no usage.
What am I missing? How do I stop getting rate limited when uploading files to GCS? Why is my quota so high and my usage reported as 0?
Judging by your description of multiple machines all taking an action at the same moment, I suspect all of your machines are attempting to write exactly the same object name at the same moment. GCS limits the number of writes per second against any one single object (1 per second).
Since it looks like your object names end in a slash, like they're meant to be a directory (work/foo/hour/out/2015/08/21/1440191400003-e7ba2b0c-b71b-460a-9095-74f37661ae83/2015-08-21T20-00-00Z/ ), is it possible you meant to end them with some unique value or a machine name or something but left that bit off?
That error happens when you attempt to update the same object too frequently. From https://cloud.google.com/storage/docs/concepts-techniques#object-updates:
There is no limit to how quickly you can create or update different objects in a bucket. However, a single particular object can only be updated or overwritten up to once per second.
Does Facebook's Graph API have some sort of limit on the size of the JSON object that is returned from its queries?
When I request a lot of a user's friends information, I sometimes get an error code of 1 - unknown error. This happens when I run the following query for a user that has a lot of Facebook friends (200 and up)
me/friends/?fields=id,name,gender,birthday,cover,significant_other,languages,education,work,
checkins.limit(1).fields(place,id,created_time),
likes.limit(5).fields(id,name,created_time),
statuses.limit(5).fields(message,updated_time),
movies.limit(5).fields(name,created_time,id),
music.limit(5).fields(name,created_time,id),
books.limit(5).fields(name,created_time,id),
games.limit(5).fields(name,created_time,id),
interests.limit(5).fields(name)
I tried this on the Graph Explorer and it returned this error
{
"error": "Request failed"
}
If I run the same request with fewer friends (125 or so), I get back all the data I expect.
It seems like the error is happening because the number of bytes in the JSON that is returned is larger than some threshold, but I haven't seen anything in the docs to corroborate this.
Would what cause this error to happen? Has anyone faced this issue before? Any ideas of how to mitigate this?
Solutions I've Considered
Limit the number of friends returned, and if the error still occurs, lower that limit for the next batch, and if still the error occurs, lower the limit again, etc - this solution isn't ideal but will probably work for most cases
Split up the queries into multiple requests - this approach would increase the API calls significantly (risking throttling) since it is no longer part of one paged request
Use FQL instead of Graph API - I haven't done enough research into this, but I believe that I would have to query each entity (likes, checkins, etc) one at a time which would increase the API calls significantly and risk throttling
In the end, all of these solutions are still subject to same Unknown Error to some degree since I can't predict the size of the object that will be returned (a status message could be a few words or a few paragraphs). It would be ideal to get a handle as to why this error is happening before going off and implementing a work around.
Thanks in advance!