How to find size of a partition in Cassandra using ScalarDB - scalardb

I am using Scalar DB library which adds ACID support in Cassandra. How can I get the size of the partition using Scalar DB?
In Cassandra Java driver, I would call something like the following to get the size.
val resultSet = session.execute(partitionSizeQuery)
val resultSetSize = resultSet.all.size
What is the equivalent for ScalarDB.
I get Optional[Result] when I call
val result = transaction.get(getQuestion)
then I call .get to get the value of result if there is a value (after checking result.isPresent
val resultGet = result.get
logger.trace(s"result is ${resultGet}")
I suppose the above will give me only one row.
I thought of using Scan as well as it give List[Result] but it is not clear from the documentation whether I'll get all the results or some system/configurable limit.
How do I get size of partition even if i won't get all the wors in one go?

You need to do scan and get the List<Result>.
It retrieves all the results from the database, but if the size of a partition is big, it is not recommended to do it.
You can limit the scan range by using withStart and withEnd API of Scan.
https://scalar-labs.github.io/scalardb/javadoc/com/scalar/db/api/Scan.html
It could depend on the underlining implementations if it retrieves only the specified range, but it retrieves only the specified range in Cassandra case.

Related

Read full collection through spark mongo connector with sequential disk access?

I want to read a full MongoDB collection into Spark using the Mongo Spark connector (Scala API) as efficiently as possible in terms of disk I/O.
After reading the connector docs and code, I understand that the partitioners are all designed to compute the minimum and maximum boundaries of an indexed field. My understanding is (and my tests using explain show) that each cursor will scan the index for document keys within the computed boundaries and then fetch the corresponding documents.
My concern is that this index-scan approach will result in random disk reads, and ultimately more I/Ops then necessary. In my case, the problem is accentuated because the collection is larger than available RAM (I know that's not recommended). Wouldn't it be orders of magnitudes faster to use a natural order cursor to read the documents as they are stored on disk? How can I accomplish this?

Handling DB query "IN" with a list of values exceeding DB capacity

I am querying a CosmosDB with a huge list of ids, and i get an exception saying i have exceeded the permissible limit of 256 characters.
What is the best way to handle such huge queries ?
The only way i can think of is to split the list and execute in batches.
Any other suggestions ?
If you're querying data this way then your model is likely not optimal. I would look to remodel your data such that you can query on another property shared by the items you are looking for (in partition as well too).
Note that this could also be achieved by using Change Feed to copy the data into another container with a different partition key and a new property that groups the data together. whether you do this will depend on how often you run this query and whether this is cheaper than running this query in multiple batches.

Fetching millions of records from cassandra using spark in scala performance issue

I have tried single node cluster and 3 node cluster on my local machine to fetch 2.5 million entries from cassandra using spark but in both scenarios it is takes 30 seconds just for SELECT COUNT(*) from table. I need this and similarly other counts for real time analytics.
SparkSession.builder().getOrCreate().sql("SELECT COUNT(*) FROM data").show()
Cassandra isn't designed to iterate over the entire data set in a single expensive query like this. If theres 10 petabytes in data for example this query would require reading 10 petabytes off disk, bring it into memory, stream it to coordinator which will resolve the tombstones/deduplication (you cant just have each replica send a count or you will massively under/over count it) and increment a counter. This is not going to work in a 5 second timeout. You can use aggregation functions over smaller chunks of the data but not in a single query.
If you really want to make this work like this, query the system.size_estimates table of each node, and for each range split according to the size such that you get an approximate max of say 5k per read. Then issue a count(*) for each with a TOKEN restriction for each of the split ranges and combine value of all those queries. This is how spark connector does its full table scans in the SELECT * rrds so you just have to replicate that.
Easiest and probably safer and more accurate (but less efficient) is to use spark to just read the entire data set and then count, not using an aggregation function.
How much does it take to run this query directly without Spark? I think that it is not possible to parallelize COUNT queries so you won't benefit from using Spark for performing such queries.

Want to retrieve records from mongodb in batchwise

I am trying to retrieve records from Mongodb whose count is approx up to 50,000 but when I execute the query Java runs out of Heap space and server crashes down.
Following is my code ;
List<FormData> forms = ds.find(FormData.class).field("formId").equal(formId).asList();
Can anyone help me syntactically to fetch records in batchwise from mongodb.
Thanks in advance
I am not sure if the Java implementation has this but in the c# version there is a setBatchSize method.
Using that I could do
foreach(var item in coll.find(...).setBatchSize(1000)) {
}
This will fetch all items the find matches but it will not fetch all at once but ratgher 1000 in each batch. You code will not see this "batching" as it is all handled within the enumeration. Once the loop tries to get the 1001 item, another batch will be fetched from the mongodb server.
This should lessen heap space problem.
http://docs.mongodb.org/manual/reference/method/cursor.batchSize/
You could still have other problems depending on what you do within the loop, but that will be under your control.
Fetching 50k entries doesn't sound like a good idea with any database.
Depending on your use case, you might want to change your query or work with limit and offset:
ds.find(FormData.class)
.field("formId").equal(formId)
.limit(20).offset(0)
.asList();
Note that a range based query is more efficient than working with limit and offset.

Morphia is there a difference between fetch and asList in performance wise

We are using morphia 0.99 and java driver 2.7.3 I would like to learn is there any difference between fetching records one by one using fetch and retrieving results by asList (assume that there is enough memory to retrieve records through asList).
We iterate over a large collection, while using fetch I sometimes encounter cursor not found exception on the server during the fetch operation, so I need to execute another command to continue, what could be the reason for this?
1-)fetch the record
2-)do some calculation on it
3-)+save it back to database again
4-)fetch another record and repeat the steps until there isn't any more records.
So which one would be faster? Fetching records one by one or retrieving bulks of results using asList, or isn't there any difference between them using morphia implementation?
Thanks for the answers
As far as I understand the implementation, fetch() streams results from the DB while asList() will load all query results into memory. So they will both get every object that matches the query, but asList() will load them all into memory while fetch() leaves it up to you.
For your use case, it neither would be faster in terms of CPU, but fetch() should use less memory and not blow up in case you have a lot of DB records.
Judging from the source-code, asList() uses fetch() and aggregates the results for you, so I can't see much difference between the two.
One very useful difference would be if the following two conditions applied to your scenario:
You were using offset and limit in the query.
You were changing values on the object such that it would no longer be returned in the query.
So say you were doing a query on awesome=true, and you were using offset and limit to do multiple queries, returning 100 records at a time to make sure you didn't use up too much memory. If, in your iteration loop, you set awesome=false on an object and saved it, it would cause you to miss updating some records.
In a case like this, fetch() would be a better approach.