How to improve CloudKit server latency when uploading data - cloudkit

I am having a hard time uploading data to my CloudKit container in a series of
'modify records' operations. I have an 'uploader' function in my app that can populate the CloudKit private database with a lot of user data. I batch the records into multiple CKModifyRecordsOperations, with 300 records in each operation as a maximum, before I upload them. When I do this with a bit of data (less than 50MB even), it can take dozens of minutes to do a simple upload. This includes a robust retry logic that takes the CKErrorRetryAfterKey key from any timed-out operations and replays them after the delay (which after happens frequently).
I checked the CloudKit dashboard, and for the container's telemetry section, the 'server latency' seems very very high (over 100,000 for 95% percentile). It also suggests the 'average request size' is 150KB on average across the last few days as I've been testing this, which doesn't seem like a lot, but the server response time is 10 seconds on each operation on average! This seems super slow.
I've tried throttling the requests so only 20 modify operations are sent at a time, but it doesn't seem to help. I have 'query' indexes for 'recordName' field for each recordType, and 'query, searchable, sortable' on some of the custom fields on the recordTypes (though not all). The CKModifyRecordsOperations' configurations have 'qualityOfService' set to 'userInitiated'. But none of this seems to help. I'm not sure what else I can try to improve the 'upload' times (downloading records seem to be happen as expected).
Is there anything else I can try to improve the time it takes to upload a few thousand records? Or is it out of my control?

Related

Trace cause of Firestore reads

I am having an excessive amount of Firestore reads in the past few weeks. My system generally was processing about 60k reads per day. About 3 weeks ago it jumped to roughly 10 million a day and the past 2 days have hit over 40 million records in a single day! My user base has not grown, my code has not changed so there is no reason for this spike. I suspect an endpoint is being hit from someone outside the scope of my application that may be trying to penetrate or retrieve records. I have reached Firestore repeatedly for help with this as it becoming a huge loss every day this happens but they are unable to assist me.
Is there a way to trace an origin of read requests or more importantly see counts for which collections or documents are being read? This must be traceable somehow as Firestore bills you per read but I cannot seem to find it.
There is currently no source IP address tracking with Cloud Firestore. All reads fall under the same bucket, which is that "they happened".
If you're building a mobile app, now would be a good time to use security rules to limit which authenticated users can read and write what parts of your database, so that it's not just being accessed unlimited from anywhere on the internet.

Why is saving data from an API to CSV faster than uploading it to MongoDB database

My question revolves around understanding the following two procedures (particularly performance and code logic) that I used to collect trade data from the US Census Bureau API. I already collected the data but I ended up writing two different ways of requesting and saving the data for which my questions pertain to.
Summary of my final questions comes at the bottom.
First way: npm request and mongodb to save the data
I limited my procedure using tiny-async-pool (sets concurrency of a certain function to perform) to not try to request too much at once or receive a timeout or overload my database with queries. Simply put, the bottleneck I was facing was the database since the API requests returned rather quickly (depending on body size 1-15 secs), but to save each array item (return data was nested array, sometimes from a few hundred items to over one hundred thousand items with max 10 values in each array) to its own mongodb document ranged from 100 ms to 700 ms. To save time from potential errors and not redoing the same queries, I also performed a check in my database before making the query to see if the query was already complete. The end result was that I did not follow this method since it was very error prone and susceptible to timeouts if the data was very large (I even set the timeout to 10 minutes in request options).
Second way: npm request and save data to csv
I used the same approach as the first method for the requests and concurrency, however I saved each query to its own csv file. In case of errors and not redoing successful queries I also did a check to see if the file already existed and if so skipped that query. This approach was error free, I ran it and after a few hours was able to have all the data saved. To write to csv was insanely fast, much more so than using mongodb.
Final summary and questions
My end goal was to get the data in the easiest manner possible. I used javascript because that's where I learned api requests and async operations, even though I will do most of my data analysis with python and pandas. I first tried the database method mostly because I thought it was the right way and I wanted to improve my database CRUD skills. After countless hours of refactoring code and trying new techniques I still could not get it to work properly. I resorted to the csv method which was a) much less code to write, b) less checks, c) faster, and d) more reliable.
My final questions are these:
Why was the csv approach better than the database approach? Any counter arguments or different approaches you would have used?
How do you handle bottlenecks and concurrency in your applications with regards to APIs and database operations? Do your techniques vary in production environments from personal use cases (in my case I just needed the data and a few hours of waiting was fine)?
Would you have used a different programming language or different package/module for this data collection procedure?

Sphinx: Real-Time Search w/Expiration?

I am designing a search that will be fed around 50 to 200 GB of text data per day (similar to logs) and it only needs to retain that data for week or two. This data will be piped at a constant rate (5,000/per second for example), non-stop, 24 hours a day. After a week or two, the document should drop out of the index never to be heard from again.
The index should be searchable with free-form text across only 1 field (pretty small in size, around 512 characters max). At most, the schema could have 2 attributes that could be categorized.
The system needs to be indexed in near real-time as data is fed to it. A delay of 15 to 30 seconds is acceptable.
We prefer to stream data into the indexer/service with a constant stream of pipe data.
Lastly, a single stand-alone solution is prefer over any type of distribution setup (this will be part of a package to deploy and setup on local machines for testers).
I'm looking closely at Sphinx search engine with RT updates via the API as it checks off most of these. But, I am not seeing an easy way to expire documents after a certain length of time.
I am aware that I could track the IDs and a timestamp and issue a batch DELETE through the Sphinx API. But, that creates an issue of tracking large amounts of IDs in a separate datastore that will need the same kind of 5,000/per second inserts and deleting them when done.
I also have a concern around Sphinx Fragmentation of mass-inserting, and mass-deleting in the middle of inserting.
We would really prefer the search engine/indexer to handle the expiration itself.
I think I can perform a WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO as the where clause in the Sphinx API in order to gather the Document IDs to delete. The problem with that is if the system does not stay ontop of the deletes, the total number of documents/search results will be in the 10s of millions, maybe even billions in count after a two week timeframe if it has to gather a few days worth of document ids to delete. That's not a feasible query.
You can actually run
DELETE FROM rt WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO
As a query to delete the old documents, which is much simpler :)
You will also need to call OPTIMIZE INDEX from time to time.
Both these will have to be called on some sort of 'cron' schedule, as they wont be run automatically.
You might be better not using Sphinxes DELETE function at all. When writing RT indexes, as soon as the RAM chunk is full its writen out as a disk chunk. So you end up with a number of disk chunks on the disk. The oldest documents will be in the oldest chunk, sequentially.
So to clear out the oldest documents, you could just dispose of the oldest chunks. (on a rolling basis)
The problem is sphinx does not include a function to delete individual chunks.
Will need to shutdown searchd, delete the chunk(s), manipulate the header files and then restart Sphinx. Not an easy process.
But in the more general sense, not sure if sphinx will be able to keep up with a continuous stream of 5,000/documents per second (even ignoreing delete for a moment) - Sphinx is generally designed for write-infrequently, read-frequently. It builds a (for the most part) monolithic inverted index. This is great for querying, but is very hard to keep updated. Its not great for incremental updates.

Drools is very slow when we integrate with Talend ETL and process millions of records

we have used around 30 rules with multiple conditions in it. we are under the assumption that Drools takes one record and compares it against the records then will give the output for each one.So the time taken for processing 1 million record is around 4 hours. Cant we not process the records in batches. I mean to say in big numbers and reducing time for processing. Pls help me this issue. Thanks for the response.
Inserting 1M facts in one batch is a very bad strategy (unless you need to find combinations out of the lot). The documentation makes it clear that all work (at least in 5.x) is done during inserts and modifications. (6.x is reportedly different, but it's still bad practice to needlessly fill your memory up with objects galore.)
Simply insert, and after some suitable number, call fireAllRules() and process (transmit,...) the results. Make sure that no "dead stock" remains in Working Memory from such a batch - this would also slow you down.

Large mongo update queue burst issue

I'm doing some user analytics tracking with mongo. I'm averaging about 200 updates a second to documents (around 400k) based on a users email address. There are 3 shards split along email alphabetically. It works pretty well except for the daily user state change scripts. It bursts the requests to about 6k per second.
This causes a tail spin effect where it overloads the mongo queue and it never seems to catch up again. Scripts fail, bosses get angry, etc. They also won't allow the scripts to be throttled. Since they are update operations and not insert they are not able to be submitted in bulk. What I see for me options are.
1:) Finding a way to allocate a large queue to mongo so it can wait for low points and get the data updated
2:) Writing a custom throttling solution
3:) Finding a more efficient indexing strategy (currently just indexing the email address)
Pretty much anything is on the table.
Any help is greatly appreciated