Please help me to get the optimized performance while reading data from redshift.
Option 1: I unload the data from table to a S3 folder and then read it as a dataframe
Optin 2: I use sqlContext read.
My Data volume is currently less but expected to grow in coming months so when i tried both options it takes almost the same time.
Option : 1
unload ('select * from sales_hist')
to 's3://mybucket/tickit/unload/sales_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
hist_output_table_df = spark.read.format(config['reader_format'])\
.option('header', config['reader_header'])\
.option('delimiter', config['reader_delimiter'])\
.csv(s3_directory + config['reader_path'])
reader_path is same as the directory as unloaded above.
Option : 2
hist_output_table_df = sqlContext.read.\
format("com.databricks.spark.redshift")\
.option("url",jdbcConnection)\
.option("tempdir", tempS3Dir)\
.option("dbtable", table_name)\
.option("aws_iam_role",aws_role).load()
Is there a cost implication between the two approaches
The Spark Redshift driver used by sqlContext does an UNLOAD behind the scenes. That's why you must provide a tempS3Dir - that's where it unloads to.
So the performance will be roughly the same but I would suggest using sqlContext because it is more encapsulated.
Related
My company is implementing Azure Data Explorer (ADX) as a backend. They also want to use Databricks for Data Science projects including data exploration. I'm in charge of connecting Databricks to ADX.
I first tried Azure Kusto package.
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
import pandas as pd
...
df = dataframe_from_result_table(RESPONSE.primary_results[0])
Full steps here
Functionlly this works well.
But it loses completely the lazy loading feature of both ADX and Databricks-Spark.
I thought that because df is a just a Pandas dataframe, also if I try to convert this to a hive table it persists the data, which is not required, as we need fresh online data we don't want a local copy.
The next thing I've tried was to have this loaded in a spark dataframe. I've tried the following code (after installing the relevant libraries)
df = spark.read. \
format("com.microsoft.kusto.spark.datasource"). \
option("kustoCluster", KUSTO_URI). \
option("kustoDatabase",KUSTO_DATABASE). \
option("kustoQuery", "some_table_in_adx"). \
option("kustoAadAppId",CLIENT_ID). \
option("kustoAadAppSecret",CLIENT_SECRET). \
option("kustoAadAuthorityID", AAD_TENANT_ID).load()
which again loads the data into the spark dataframe without any issue.
However, performance wise it's far far away from the direct query in ADX. A count in a table of 600 thousands records is subseconds in ADX while it's more than 20 seconds in the Databricks notebook on a DS3_V2 14GB 4 cores.
Before even to consider a SaveAsTable or CreateOrReplaceTempView I wonder why I'm experiencing this performance issue. So my questions are :
Does this connection use the lazy loading (I know doing a count is not the right way to check that)?
If not is there any way to have lazy loading instea of loading the full table in dataframes before doing operations
what would happen if I create a hive table from the spark dataframe, will it copy the data or still have a virtual table pointing to ADX
Thanks for your help
For the Python part -
Pandas is not Spark dataframe therefore it's not lazy computed, to utilize these together you may use Spark parallelize.
For the Spark ADX connector -
This is indeed lazy loading. It is not evaluated until some evaluation method was requested - like the count.
If the count was done by Spark syntax i.e spark.read.kusto...count() then it would cause all the data to be first brought into spark and then operate count on it - so 20 seconds sounds legit, to compare with a query simply change the value of "kustoQuery" to "some_table_in_adx | count" which will lead to a count on the ADX side - uploading to spark just the final int result.
The connector offers simple Kusto query or distributed export command via the readMode option with ForceSingleMode to perform a simple Kusto query from driver node - as explained in the docs
As Hive table operated over dataframes - once you create a dataframe from Kusto read - it will also be lazy evaluated with table operations.
I am trying to read data from Postgres table using Spark. Initially I was reading the data on the single thread without using lowerBound, upperBound, partitionColumn and numPartitions. The data that I'm reading is huge, around 120 Million records. So I decided to read the data in parallel using partitionColumn. I am able to read the data but it is taking more time to read it by 12 parallel threads than by a single thread. I am unable to figure out how can I see the 12 SQL queries that gets generated to fetch the data in parallel for each partition.
The code that I am using is:
val query = s"(select * from db.testtable) as testquery"
val df = spark.read
.format("jdbc")
.option("url", jdbcurl)
.option("dbtable", query)
.option("partitionColumn","transactionbegin")
.option("numPartitions",12)
.option("driver", "org.postgresql.Driver")
.option("fetchsize", 50000)
.option("user","user")
.option("password", "password")
.option("lowerBound","2019-01-01 00:00:00")
.option("upperBound","2019-12-31 23:59:00")
.load
df.count()
Where and how can I see the 12 parallel queries that are getting created to read the data in parallel on each thread?
I am able to see that 12 tasks are created in the Spark UI but not able to find a way to see what separate 12 queries are generated to fetch data in parallel from the Postgres table.
Is there any way I can push the filter down so that it reads only this year worth of data, in this case 2019.
The SQL statement is printed using "info" log level, see here. You need to change Spark's log level to "info" to see the SQL. Additionally it printed the where condition alone too as here.
You can also view the SQL in your Postgresql database using pg_stat_statements view which requires a separate plugin to be installed. There is a way to log the SQLs and see them as mentioned here.
I suspect the parallelism is slow for you because there is no index on the "transactionbegin" column of your table. The partitionColumn must be indexed otherwise it will scan the entire table in all those parallel sessions which will choke.
It's not exactly multiple queries, but it will actually show the plan of execution that Spark has optimized based on your queries. It may not be perfect depending on stages you have to execute.
You can write your dag in the form of DataFrame and before actually calling an action, you can use explain() method on it. Reading it can be challenging, but it's upside down. Source is on the bottom while reading this. It may seem little bit unusual if you try to read, so start with basic transformations and go step by step if you're reading first time.
I use count to calculate the number of RDD,got 13673153,but after I transfer the rdd to df and insert into hive,and count again,got 13673182,why?
rdd.count
spark.sql("select count(*) from ...").show()
hive sql: select count(*) from ...
This could be caused by a mismatch between data in the underlying files and the metadata registered in hive for that table. Try running:
MSCK REPAIR TABLE tablename;
in hive, and see if the issue is fixed. The command updates the partition information of the table. You can find more info in the documentation here.
During a Spark Action and part of SparkContext, Spark will record which files were in scope for processing. So, if the DAG needs to recover and reprocess that Action, then the same results are gotten. By design.
Hive QL has no such considerations.
UPDATE
As you noted, the other answer did not help in this use case.
So, when Spark processes Hive tables it looks at the list of files that it will use for the Action.
In the case of a failure (node failure, etc.) it will recompute data from the generated DAG. If it needs to go back and re-compute as far as the start of reading from Hive itself, then it will know which files to use - i.e the same files, so that same results are gotten instead of non-deterministic outcomes. E.g. think of partitioning aspects, handy that same results can be recomputed!
It's that simple. It's by design. Hope this helps.
I have little experience in Hive and currently learning Spark with Scala. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below
ORC will do the same as parquet in Spark
Tez engine will give better performance like Spark engine
Joins are better/faster in Hive than Spark
I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.
sample links :
link1
link2
link3
Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.
Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.
Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
Multiple joins involved like inner and left outer since many look up tables used.
Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.
Method 1:
Hive on Tez with ORC tables
Python UDF thru TRANSFORM option
Joins with performance tuning like map join
Method 2:
SparkSQL with Parquet format which is converting from text/csv
Scala for UDF
Hope we can perform multiple inner and left outer join in Spark
The best way to implement the solution to your problem as below.
To load the data into the table the spark looks good option to me. You can read the tables from the hive metastore and perform the incremental updates using some kind of windowing functions and register them in hive. While ingesting as data is populated from various lookup table, you are able to write the code in programatical way in scala.
But at the end of the day, there need to be a query engine that is very easy to use. As your spark program register the table with hive, you can use hive.
Hive support three execution engines
Spark
Tez
Mapreduce
Tez is matured, spark is evolving with various commits from Facebook and community.
Business can understand hive very easily as a query engine as it is much more matured in the industry.
In short use spark to process the data for daily processing and register them with hive.
Create business users in hive.
I have TBs of structured data in a Greenplum DB. I need to run what is essentially a MapReduce job on my data.
I found myself reimplementing at least the features of MapReduce just so that this data would fit in memory (in a streaming fashion).
Then I decided to look elsewhere for a more complete solution.
I looked at Pivotal HD + Spark because I am using Scala and Spark benchmarks are a wow-factor. But I believe the datastore behind this, HDFS, is going to be less efficient than Greenplum. (NOTE the "I believe". I would be happy to know I am wrong but please give some evidence.)
So to keep with the Greenplum storage layer I looked at Pivotal's HAWQ which is basically Hadoop with SQL on Greenplum.
There are a lot of features lost with this approach. Mainly the use of Spark.
Or is it better to just go with the built-in Greenplum features?
So I am at the crossroads of not knowing which way is best. I want to process TBs of data that fits the relational DB model well, and I would like the benefits of Spark and MapReduce.
Am I asking for too much?
Before posting my answer, I want to rephrase the question based on my understanding (to make sure I understand the question correctly) as follows:
You have TBs of data that fits the relational DB model well, and you want to query the data using SQL most of the time (I think that's why you put it into Greenplum DB), but sometimes you want to use Spark and MapReduce to access the data because of their flexibility.
If my understanding is correct, I strongly recommend that you should have a try with HAWQ. Some features of HAWQ make it fit your requirements perfectly (Note: I may be biased, since I am a developer of HAWQ).
First of all, HAWQ is a SQL on Hadoop database, which means it employs HDFS as its datastore. HAWQ doesn't keep with the Greenplum DB storage layer.
Secondly, it is hard to argue against that "HDFS is going to less efficient than Greenplum". But the performance difference is not as significant as you might think. We have done some optimizations for accessing HDFS data. One example is that, if we find one data block is stored locally, we read it directly from disk rather than through normal RPC calls.
Thirdly, there is a feature with HAWQ named HAWQ InputFormat for MapReduce (which Greenplum DB doesn't have). With that feature, you can write Spark and MapReduce code to access the HAWQ data easily and efficiently. Different from the DBInputFormat provided by Hadoop (which would make the master become the performance bottleneck, since all the data goes through the master first), HAWQ InputFormat for MapReduce lets your Spark and MapReduce code access the HAWQ data stored in HDFS directly. It is totally distributed, and thus is very efficient.
Lastly, of course, you still can use SQL to query your data with HAWQ, just like what you do with Greenplum DB.
Have you tried using Spark - JDBC connector to read the Spark data ?
Use the partition column, lower bound, upper bound, and numPartitions to split the greenplum table across multiple Spark workers.
For example, you can use this example
import java.util.Random
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object SparkGreenplumApp extends App {
val conf = new SparkConf().setAppName("SparkGreenplumTest")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val df = sqlContext.load("jdbc", Map(
"url" -> "jdbc:postgresql://servername:5432/databasename?user=username&password=*******",
"dbtable" -> "(select col, col2, col3 where datecol > '2017-01-01' and datecol < '2017-02-02' ) as events",
"partitionColumn"-> "ID",
"lowerBound"->"100",
"upperBound"->"500",
"numPartitions"->"2",
"driver" -> "org.postgresql.Driver"))
}