What is the best practise for querying an big Spark result? - postgresql

I'm trying to build an Recommendation Engine for an Onlineshop with ca. 50000 Articles. I created the frequently itemsets and rules with the FPGrowth an Apache Spark.
My first try was to put the Data (65G Rows) in a Database (PostgreSQL) as intarrays and with the gin Index the performance should be OK. But when Amount of rows is high the query need minutes. Low amounts needs ms.
What are the best practices for querying a big Spark result?

Related

sparksql with dataset of several GBs size

I didn't found the answer for this question in the web or other questions, so I'm trying here:
The size of my dataset is several GB's (~5GB to ~15GB).
I have multiple tables, some of them contains ~50M rows
I'm using postgresSQL which has it's own query optimization (parallel workers and indexing).
50% of my queries take advantages the indexing and multiple workers to finish the query faster.
Some of my queries use join command
I read that sparkSQL intends to run on huge datasets.
If I have multiple servers to run sparkSQL on, can I get better performance with sparkSQL ?
Does 15GB of datasets fit to work with sparkSQL or postgresSQL ?
When it will be best to choose sparkSQL over postgresSQL ?
If I have multiple servers to run sparkSQL on, can I get better performance with sparkSQL ?
-> If your data does not havea lot of skew, SparkSQL will give better performance in terms of query speeds as the query wold run on the spark cluster.
Does 15GB of datasets fit to work with sparkSQL or postgresSQL ?
-> SparkSQL is simply a Query Engine that is built into Apache Spark so it will process the data, and will allow you to create Views in-memory but that is it. Once the Application terminates, the view is removed.
PostgreSQL, on teh other hand is a, and I quote, a DATABASE. It will let you query data and store the results in its own native format.
Now coming to your question, 15GB of Data is not a lot to process for wither of the Engines, and your query performance would depend upon the data model.
When it will be best to choose sparkSQL over postgresSQL ?
-> Choose SparkSQL, when you wish to perform AD-HOC queries, and the dataset sizes range in the TeraByte range.
Choose PostgreSQL, when you wish to store transactional data, or datasets that are simply being used to drive BI tools, custom UIs or Applications.

how to improve performance of writing data to MongoDB using Spark?

I use python Spark to run a heavy iterative computing job and write data to MongoDB. In each iteration, there may contain 0.01 ~ 1 billion data in RDD or DataFrame to be computed(this procedure is simple and relatively fast) and about 100,000 data to be written to MongoDb. The problem is that procedure seems get stuck in MongoSpark Job every iteration(see image below). I don't know what is going on in this job. The computing part seems already have finished(see Job "runJob at PythonRDD.scala"). MongoDB, however, doesn't receive any data in most time of this job. In my estimation, writing 100,000 data to MongoDB directly only cost tiny time.
Can you explain what costs most time of this job, and how to improve the performance of this ?
Thanks for your help.

Amazon Redshift for SaaS application

I am currently testing Redshift for a SaaS near-realtime analytics application.
The queries performance are fine on a 100M rows dataset.
However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time.
I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying)
The requirements for the application are:
queries must return results within 10s
ad-hoc queries with filters on more than 100 columns
From 1 to 50 clients connected at the same time on the application
dataset growing at 10M rows / day rate
typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins
Is Redshift not correct for this use case? What other technologies would you consider for those requirements?
This question was also posted on the Redshift Forum. https://forums.aws.amazon.com/thread.jspa?messageID=498430&#498430
I'm cross-posting my answer for others who find this question via Google. :)
In the old days we would have used an OLAP product for this, something like Essbase or Analysis Services. If you want to look into OLAP there is an very nice open source implementation called Mondrian that can run over a variety of databases (including Redshift AFAIK). Also check out Saiku for an OSS browser based OLAP query tool.
I think you should test the behaviour of Redshift with more than 15 concurrent queries. I suspect that it will not be user noticeable as the queries will simply queue for a second or 2.
If you prove that Redshift won't work you could test Vertica's free 3-node edition. It's a bit more mature than Redshift (i.e. it will handle more concurrent users) and much more flexible about data loading.
Hadoop/Impala is overly complex for a dataset of your size, in my opinion. It is also not designed for a large number of concurrent queries or short duration queries.
Shark/Spark is designed for the case where you data is arriving quickly and you have a limited set of metrics that you can pre-calculate. Again this does not seem to match your requirements.
Good luck.
Redshift is very sensitive to the keys used in joins and group by/order by. There are no dynamic indexes, so usually you define your structure to suit the tasks.
What you need to ensure is that your joins match the structure 100%. Look at the explain plans - you should not have any redistribution or broadcasting, and no leader node activities (such as Sorting). It sounds like the most critical requirement considering the amount of queries you are going to have.
The requirement to be able to filter/aggregate on arbitrary 100 columns can be a problem as well. If the structure (dist keys, sort keys) don't match the columns most of the time, you won't be able to take advantage of Redshift optimisations. However, these are scalability problems - you can increase the number of nodes to match your performance, you just might be surprised of the costs of the optimal solution.
This may not be a serious problem if the number of projected columns is small, otherwise Redshift will have to hold large amounts of data in memory (and eventually spill) while sorting or aggregating (even in distributed manner), and that can again impact performance.
Beyond scaling, you can always implement sharding or mirroring, to overcome some queue/connection limits, or contact AWS support to have some limits lifted
You should consider pre-aggregation. Redshift can scan billions of rows in seconds as long as it does not need to do transformations like reordering. And it can store petabytes of data - so it's OK if you store data in excess
So in summary, I don't think your use case is not suitable based on just the definition you provided. It might require work, and the details depend on the exact usage patterns.

no sql read and write intensive bigdata table

I am having 10 different queries and a total of 40 columns.
Looking for solutions in available Big data noSQL data bases that will perform read and write intensive jobs (multiple queries with SLA).
Tried with HBase but its fast only for rowkey (scan) search ,for other queries (not running on row key) query response time is quite high.Making data duplication with different row keys is the only option for quick response but for 10 queries making 10 different tables is not a good idea.
Please suggest the alternatives.
Have you tried Druid? It is inspired on Dremel, precursor of Google BigQuery.
From the documentation:
Druid is a good fit for products that require real-time data ingestion of a single, large data stream. Especially if you are targeting no-downtime operation and are building your product on top of a time-oriented summarization of the incoming data stream. When talking about query speed it is important to clarify what "fast" means: with Druid it is entirely within the realm of possibility (we have done it) to achieve queries that run in less than a second across trillions of rows of data.

Which nosql solution fits my application HBase OR Hypertable OR Cassandra

I have an application with 100 million of data and growing. I want to scale out before it hits the wall.
I have been reading stuff about nosql technologies which can handle Big Data efficiently.
My needs:
There are more reads than writes.But writes are also significantly large in numbers (read:write=4:3)
Can you please explain difference among HBase, Hypertable and Cassandra? Which one fits my requirements?
Both HBase and Hypertable require hadoop. If you're not using Hadoop anyway (e.g. need to solve map/reduce related problems) - I'd go with cassandra as it is stand-alone
If you already having data Hive is the best solution for your application, Or you develop app from the scratch look into below link that explain overview of the nosql world
http://nosql-database.org/
http://www.vineetgupta.com/2010/01/nosql-databases-part-1-landscape/