Retrieve large results ~ 1 billion using Typesafe Slick - postgresql

I am working on a cron job which needs to query Postgres on a daily basis. The table is huge ~ trillion records. On an average I would expect to retrieve about a billion records per execution. I couldn't find any documentation on using cursors or pagination for Slick 2.1.0 An easy approach I can think of is, get the count first and loop through using drop and take. Is there a better and efficient way to do this?

Map reduce using akka, postgresql-async, first count then distribute with offset+limit query to actors then map the data when needed then reduce the resul to elasticsearch or other store?

Related

How to get the lastest N records in a Mongodb collection that has millions of records?

Basically, I have the same question with this post: How to get the last N records in mongodb?
I saw the solution using sort() but would it be inefficient because it requires loading all of the data into memory, which can be slow and resource-intensive?
Would there be another way to solve this without having to use sort()?

Is it bad to have just 1 chunk size in Spring Batch?

I have to process a file which has records with same ID and different dates. If a specific ID has multiple records with the different dates, it has to sum all of it. Currently, my solution is writing by one chunk and and letting SQL query to do the summation part because I don't have a way to know if multiple entries of same ID are in the same chunk. Is there a huge performance effect of doing it this way especially that I am working on 100k worth of data?
Is there a huge performance effect of doing it this way especially that I am working on 100k worth of data?
Yes, this could impact the performance of your step since each item will be processed in its own transaction. With 100k you would have 100k transactions, whereas if chunk-size=1000 for example, you would have only 100 transactions.
The chunk-oriented processing model is not really suitable to what you are trying to do, as items with the same ID could span different chunks. A common technique for this kind of requirement is to load your data in a temporary table (which could be a very fast step if done against sqlite for example) and then run your aggregation SQL query against that table.

Fetching millions of records from cassandra using spark in scala performance issue

I have tried single node cluster and 3 node cluster on my local machine to fetch 2.5 million entries from cassandra using spark but in both scenarios it is takes 30 seconds just for SELECT COUNT(*) from table. I need this and similarly other counts for real time analytics.
SparkSession.builder().getOrCreate().sql("SELECT COUNT(*) FROM data").show()
Cassandra isn't designed to iterate over the entire data set in a single expensive query like this. If theres 10 petabytes in data for example this query would require reading 10 petabytes off disk, bring it into memory, stream it to coordinator which will resolve the tombstones/deduplication (you cant just have each replica send a count or you will massively under/over count it) and increment a counter. This is not going to work in a 5 second timeout. You can use aggregation functions over smaller chunks of the data but not in a single query.
If you really want to make this work like this, query the system.size_estimates table of each node, and for each range split according to the size such that you get an approximate max of say 5k per read. Then issue a count(*) for each with a TOKEN restriction for each of the split ranges and combine value of all those queries. This is how spark connector does its full table scans in the SELECT * rrds so you just have to replicate that.
Easiest and probably safer and more accurate (but less efficient) is to use spark to just read the entire data set and then count, not using an aggregation function.
How much does it take to run this query directly without Spark? I think that it is not possible to parallelize COUNT queries so you won't benefit from using Spark for performing such queries.

The fastest way to get count of processed dataframe records in Spark

I'm currently developing a Spark Application that uses dataframes to compute and aggregates specific columns from a hive table.
Aside from using count() function in dataframes/rdd. Is there a more optimal approach to get the number of records processed or number of count of records of a dataframe ?
I just need to know if there's something needed to override a specific function or so.
Any replies would be appreciated. I'm currently using Apache spark 1.6.
Thank you.
Aside from using count() function in dataframes/rdd, is there a more optimal
approach to get the number of records processed or number of count of records
of a dataframe?
Nope. Since an RDD may have an arbitrarily complex execution plan, involving JDBC table queries, file scans, etc., there's no apriori way to determine its size short of counting.

MongoDB: what is faster: single find() query or many find_one()?

I have the following problem connected to the MongoDB database design. Here is my situation:
I have a collection with about 50k documents (15kB each),
every document have a dictionary storing data samples,
my query always gets all the data from the document,
every query uses an index,
the collection have only one index (based on a single datetime field),
in most cases, I need to get data from many documents (typically 25 < N < 100),
it is easier for me to perform many SELECT queries over a single one,
I have a lot of updates in my databases, much less than SELECT ones,
I use the WiredTiger engine (the newest version of MongoDB),
server instance and web application are on the same machine.
I have two possibilities for making a SELECT query:
perform a single query retrieving all documents I am interested in,
perform N queries, everyone gets a single document, where typically 25 < N < 100 (what about a different scenario when 100 < N < 1k or 1k < N < 10k?)
So the question is if there is any additional overhead when I perform many small queries over a single one? In relational databases making many queries is a very bad practice - but in NoSQL? I am asking about a general practice - should I avoid that much queries?
In the documentation, I read that the number of queries is not important but the number of searches over documents - is that true?
Thanks for help ;)
There is a similar question like the one you asked : Is it ok to query mongodb multiple times
IMO, for your use-case i.e. 25<N<100, one should definitely go with batching.
In case of Single queries :
Looping in a single thread will not suffice, you'll have to make parallel requests which would create additional overhead
creates tcp/ip overhead for every request
there is a certain amount of setup and teardown for each query creating and exhausting cursors which would create unnecessary overhead.
As explained in the answer above, there appears be a sweet-spot for how many values to batch up vs. the number of round trips and that depends on your document type as well.
In broader terms, anything 10<N<1000 should go with batching and the remaining records should form part of other batches but querying single document at a time would definitely create unnecessary overhead.
The problem when you perform small queries over one query is network overhead that is the network latency roundtrip.
For a single request in a batch processing it may be not much, but if you make multiple requests like these or use this technique on frontend it will decrease performance.
Also you may need to preprocess the data like sorting aggregating it manually.