Fetching millions of records from cassandra using spark in scala performance issue - scala

I have tried single node cluster and 3 node cluster on my local machine to fetch 2.5 million entries from cassandra using spark but in both scenarios it is takes 30 seconds just for SELECT COUNT(*) from table. I need this and similarly other counts for real time analytics.
SparkSession.builder().getOrCreate().sql("SELECT COUNT(*) FROM data").show()

Cassandra isn't designed to iterate over the entire data set in a single expensive query like this. If theres 10 petabytes in data for example this query would require reading 10 petabytes off disk, bring it into memory, stream it to coordinator which will resolve the tombstones/deduplication (you cant just have each replica send a count or you will massively under/over count it) and increment a counter. This is not going to work in a 5 second timeout. You can use aggregation functions over smaller chunks of the data but not in a single query.
If you really want to make this work like this, query the system.size_estimates table of each node, and for each range split according to the size such that you get an approximate max of say 5k per read. Then issue a count(*) for each with a TOKEN restriction for each of the split ranges and combine value of all those queries. This is how spark connector does its full table scans in the SELECT * rrds so you just have to replicate that.
Easiest and probably safer and more accurate (but less efficient) is to use spark to just read the entire data set and then count, not using an aggregation function.

How much does it take to run this query directly without Spark? I think that it is not possible to parallelize COUNT queries so you won't benefit from using Spark for performing such queries.

Related

value count in pyspark is taking too much time

Spark is very efficient in reading through a set of billion dataset within 4 seconds but the count of distinct value in a df is pretty slow and less efficient & it's taking more than 5 mins even for a small set of data, I have tried these approaches:
value1 = df.where(df['values'] == '1').count()
or
df.groupBy("values").count().orderBy("value_count").show()
both returns the correct result but the time is the essence here.
I know that count is an lazy operator but is there an alternate approach to solve this problem ?
TIA
Count() is a function that makes Spark literally count through the rows. Operations like count(), distinct(), etc. will obviously take time due to the nature of those operations and are advised against in a distributed environment.
Spark has data structures that do not support indexing, and hence count() will take almost a full search of the data.

The fastest way to get count of processed dataframe records in Spark

I'm currently developing a Spark Application that uses dataframes to compute and aggregates specific columns from a hive table.
Aside from using count() function in dataframes/rdd. Is there a more optimal approach to get the number of records processed or number of count of records of a dataframe ?
I just need to know if there's something needed to override a specific function or so.
Any replies would be appreciated. I'm currently using Apache spark 1.6.
Thank you.
Aside from using count() function in dataframes/rdd, is there a more optimal
approach to get the number of records processed or number of count of records
of a dataframe?
Nope. Since an RDD may have an arbitrarily complex execution plan, involving JDBC table queries, file scans, etc., there's no apriori way to determine its size short of counting.

Timeseries storage in Mongodb

I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?
I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you
Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.
When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.
If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.
Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.

Retrieve large results ~ 1 billion using Typesafe Slick

I am working on a cron job which needs to query Postgres on a daily basis. The table is huge ~ trillion records. On an average I would expect to retrieve about a billion records per execution. I couldn't find any documentation on using cursors or pagination for Slick 2.1.0 An easy approach I can think of is, get the count first and loop through using drop and take. Is there a better and efficient way to do this?
Map reduce using akka, postgresql-async, first count then distribute with offset+limit query to actors then map the data when needed then reduce the resul to elasticsearch or other store?

Returning aggregations from HBASE data

I have an HBASE table of about 150k rows, each containing 3700 columns.
I need to select multiple rows at a time, and aggregate the results back, something like:
row[1][column1] + row[2][column1] ... + row[n][column1]
row[1][column2] + row[2][column2] ... + row[n][column2]
...
row[1][columnn] + row[2][columnn] ... + row[n][columnn]
Which I can do using a scanner, the issue is, I believe, that the scanner is like a cursor, and is not doing the work distributed over multiple machines at the same time, but rather getting data from one region, then hopping to another region to get the next set of data, and so on where my results span multiple regions.
Is there a way to scan in a distributed fashion (an option, or creating multiple scanners for each region's worth of data [This might be a can of worms in itself]) or is this something that must be done in a map/reduce job. If it's a M/R job, will it be "fast" enough for real time queries? If not, are there some good alternatives to doing these types of aggregations in realtime with a NOSQL type database?
What I would do in such cases is, have another table where I would have the aggregation summaries. That is When row[m] is inserted into table 1 in table 2 against (column 1) (which is the row key of table 2) I would save its summation or other aggregational results, be it average, standard deviation, max, min etc.
Another approach would be to index them into a search tool such as Lucene, Solr, Elastic Search etc. and run aggregational searches there. Here are some examples in Solr.
Finally, Scan spanning across multiple region or M/R jobs is not designed for real time queries (unless the clusters designed in such way, i.e. oversized to data requirements).
Hope it helps.