sparksql with dataset of several GBs size - postgresql

I didn't found the answer for this question in the web or other questions, so I'm trying here:
The size of my dataset is several GB's (~5GB to ~15GB).
I have multiple tables, some of them contains ~50M rows
I'm using postgresSQL which has it's own query optimization (parallel workers and indexing).
50% of my queries take advantages the indexing and multiple workers to finish the query faster.
Some of my queries use join command
I read that sparkSQL intends to run on huge datasets.
If I have multiple servers to run sparkSQL on, can I get better performance with sparkSQL ?
Does 15GB of datasets fit to work with sparkSQL or postgresSQL ?
When it will be best to choose sparkSQL over postgresSQL ?

If I have multiple servers to run sparkSQL on, can I get better performance with sparkSQL ?
-> If your data does not havea lot of skew, SparkSQL will give better performance in terms of query speeds as the query wold run on the spark cluster.
Does 15GB of datasets fit to work with sparkSQL or postgresSQL ?
-> SparkSQL is simply a Query Engine that is built into Apache Spark so it will process the data, and will allow you to create Views in-memory but that is it. Once the Application terminates, the view is removed.
PostgreSQL, on teh other hand is a, and I quote, a DATABASE. It will let you query data and store the results in its own native format.
Now coming to your question, 15GB of Data is not a lot to process for wither of the Engines, and your query performance would depend upon the data model.
When it will be best to choose sparkSQL over postgresSQL ?
-> Choose SparkSQL, when you wish to perform AD-HOC queries, and the dataset sizes range in the TeraByte range.
Choose PostgreSQL, when you wish to store transactional data, or datasets that are simply being used to drive BI tools, custom UIs or Applications.

Related

Writing wide table (40,000+ columns) to Databricks Hive Metastore for use with AutoML

I want to train a regression prediction model with Azure Databricks AutoML using the GUI. The training data is very wide. All of the columns except for the response variable will be used as features.
To use the Databricks AutoML GUI I have to store the data as a table in the Hive metastore. I have a large DataFrame df with more than 40,000 columns.
print((df.count(), len(df.columns)))
(33030, 45502)
This data is written to a table in Hive using the following PySpark command (I believe this is standard):
df.write.mode('overwrite').saveAsTable("WIDE_TABLE")
Unfortunately this job does not finish within 'acceptable' time (10 hours). I cancel and hence don't have an error message.
When I reduce the number of columns with
df.select(df.columns[:500]).write.mode('overwrite').saveAsTable("WIDE_TABLE")
it fares better and finishes in 9.87 minutes, so the method should work.
Can this be solved:
With a better compute instance?
With a better script?
Not at all and if so, is there another approach?
[EDIT to address questions in comments]
Runtime and driver summary:
2-16 Workers 112-896 GB Memory 32-256 Cores (Standard_DS5_v2)
1 Driver 56 GB Memory, 16 Cores (Same as worker)
Runtime10.4.x-scala2.12
To give an impression of the timings I've added a table below.
columns
time (mins)
10
1.94
100
1.92
200
3.04
500
9.87
1000
25.91
5000
938.4
Data type of the remaining columns is Integer.
As far as I know I'm writing the table on the same environment that I am working on. Data flow: Azure Blob CSV -> Data read and wrangling -> PySpark DataFrame -> Hive Table. Last three steps are on the same cloud machine.
Hope this helps!
I think your case is not related to either Spark resource configuration or network connection, it's related to Spark design itself.
Long in short, Spark is designed for long and narrow data, which is exactly opposite of your dataframe. When you look at your experiment, the time consuming is in exponential growth when your column size increase. Although it's about reading the csv but not writing table, you can check this post for a good explanation on why Spark is not good at handling wide dataframe: Spark csv reading speed is very slow although I increased the number of nodes
Although I didn't use the Azure AutoML before, based on the dataset to achieve your goal, I think you can try:
Try to use python pandas dataframe and Hive connection library to see if there is any performance enhancement
Concatenate all your column into a single Array / Vector before you write to Hive

What is the best practise for querying an big Spark result?

I'm trying to build an Recommendation Engine for an Onlineshop with ca. 50000 Articles. I created the frequently itemsets and rules with the FPGrowth an Apache Spark.
My first try was to put the Data (65G Rows) in a Database (PostgreSQL) as intarrays and with the gin Index the performance should be OK. But when Amount of rows is high the query need minutes. Low amounts needs ms.
What are the best practices for querying a big Spark result?

Amazon Redshift for SaaS application

I am currently testing Redshift for a SaaS near-realtime analytics application.
The queries performance are fine on a 100M rows dataset.
However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time.
I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying)
The requirements for the application are:
queries must return results within 10s
ad-hoc queries with filters on more than 100 columns
From 1 to 50 clients connected at the same time on the application
dataset growing at 10M rows / day rate
typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins
Is Redshift not correct for this use case? What other technologies would you consider for those requirements?
This question was also posted on the Redshift Forum. https://forums.aws.amazon.com/thread.jspa?messageID=498430&#498430
I'm cross-posting my answer for others who find this question via Google. :)
In the old days we would have used an OLAP product for this, something like Essbase or Analysis Services. If you want to look into OLAP there is an very nice open source implementation called Mondrian that can run over a variety of databases (including Redshift AFAIK). Also check out Saiku for an OSS browser based OLAP query tool.
I think you should test the behaviour of Redshift with more than 15 concurrent queries. I suspect that it will not be user noticeable as the queries will simply queue for a second or 2.
If you prove that Redshift won't work you could test Vertica's free 3-node edition. It's a bit more mature than Redshift (i.e. it will handle more concurrent users) and much more flexible about data loading.
Hadoop/Impala is overly complex for a dataset of your size, in my opinion. It is also not designed for a large number of concurrent queries or short duration queries.
Shark/Spark is designed for the case where you data is arriving quickly and you have a limited set of metrics that you can pre-calculate. Again this does not seem to match your requirements.
Good luck.
Redshift is very sensitive to the keys used in joins and group by/order by. There are no dynamic indexes, so usually you define your structure to suit the tasks.
What you need to ensure is that your joins match the structure 100%. Look at the explain plans - you should not have any redistribution or broadcasting, and no leader node activities (such as Sorting). It sounds like the most critical requirement considering the amount of queries you are going to have.
The requirement to be able to filter/aggregate on arbitrary 100 columns can be a problem as well. If the structure (dist keys, sort keys) don't match the columns most of the time, you won't be able to take advantage of Redshift optimisations. However, these are scalability problems - you can increase the number of nodes to match your performance, you just might be surprised of the costs of the optimal solution.
This may not be a serious problem if the number of projected columns is small, otherwise Redshift will have to hold large amounts of data in memory (and eventually spill) while sorting or aggregating (even in distributed manner), and that can again impact performance.
Beyond scaling, you can always implement sharding or mirroring, to overcome some queue/connection limits, or contact AWS support to have some limits lifted
You should consider pre-aggregation. Redshift can scan billions of rows in seconds as long as it does not need to do transformations like reordering. And it can store petabytes of data - so it's OK if you store data in excess
So in summary, I don't think your use case is not suitable based on just the definition you provided. It might require work, and the details depend on the exact usage patterns.

no sql read and write intensive bigdata table

I am having 10 different queries and a total of 40 columns.
Looking for solutions in available Big data noSQL data bases that will perform read and write intensive jobs (multiple queries with SLA).
Tried with HBase but its fast only for rowkey (scan) search ,for other queries (not running on row key) query response time is quite high.Making data duplication with different row keys is the only option for quick response but for 10 queries making 10 different tables is not a good idea.
Please suggest the alternatives.
Have you tried Druid? It is inspired on Dremel, precursor of Google BigQuery.
From the documentation:
Druid is a good fit for products that require real-time data ingestion of a single, large data stream. Especially if you are targeting no-downtime operation and are building your product on top of a time-oriented summarization of the incoming data stream. When talking about query speed it is important to clarify what "fast" means: with Druid it is entirely within the realm of possibility (we have done it) to achieve queries that run in less than a second across trillions of rows of data.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.