I have to save AWS Glue data to Athena with minimum time.
I have saved AWS Glue data i.e. DynamicFrame to Athena table successfully. But for 17Gb table it takes around 19-20 Minutes. If number of DPU uses is 100, then I think it's too long time. Currently I am using method:
getCatalogSink( database : String,tableName : String,redshiftTmpDir : String = "",transformationContext : String = "") : DataSink
Is there any way to speed up the process? Or current required time is fine ?.
Thanks in advances.
Related
I have problems understanding the concept of delta lake. Example:
I read a parquet file:
taxi_df = (spark.read.format("parquet").option("header", "true").load("dbfs:/mnt/randomcontainer/taxirides.parquet"))
Then I save it using asTable:
taxi_df.write.format("delta").mode("overwrite").saveAsTable("taxi_managed_table")
I read the just stored managed table:
taxi_read_from_managed_table = (spark.read.format("delta").option("header", "true").load("dbfs:/user/hive/warehouse/taxi_managed_table/"))
... and when I check the type it shows "pyspark.sql.dataframe.DataFrame", not deltaTable:
type(taxi_read_from_managed_table) # returns pyspark.sql.dataframe.DataFrame
Only after I transform it explicitly using the following command, I receive the type DeltaTable
taxi_delta_table = DeltaTable.convertToDelta(spark,"parquet.dbfs:/user/hive/warehouse/taxismallmanagedtable/")
type(taxi_delta_table) #returns delta.tables.DeltaTable
/////////////////////////////
Does that mean that the table in stage 4. is not a delta table and won’t provide the automatic optimizations provided by delta lake?
How do you establish if something is part of the delta lake or not?
I understand that delta live tables only work with delta.tables.DeltaTables, is that correct?
When you use spark.read...load() - it returns the Spark's DataFrame object that you can use to process the data. Under the hood this DataFrame use the Delta Lake table. DataFrame is abstracting the data source so you can work with different sources and apply the same operations.
On other hand, DeltaTable is a specific object that allows to apply only Delta-specific operations. You even don't need to perform convertToDelta to get it - just use DeltaTable.forPath or DeltaTable.forName functions to obtain its instance.
P.S. if you saved data with .saveAsTable(my_name), then you don't need to use .load, just use spark.read.table(my_name).
I am new to Databricks, i have a requirement where in silver layer after transformation is happening i have to take the max(load_date) from my dataset and update that value in storage account (Transient folder). A .csv file is already available in the Transient folder where i have to overwrite the max(load_date) value every time my notebook runs.
for now i am doing it creating a empty Dataframe then assigning the max date and then loading it to the file but it seems not working that way.
Any idea to do it in a efficient way?
Create an empty SQL table then assign the max date into the data frame by using the merge operation.
I have stored my Max date in the kDF data frame.
kDF.write.option("mergeSchema","true").format("delta").mode("append").saveAsTable("test3")
After the merge operation, Max date will be stored in the test3 table and stored in this location /mnt/defaultDatalake/filedemo.
Now You can check the data:
df1 = spark.read.format("delta").option("header", "true").load("/mnt/defaultDatalake/filedemo")
display(df1)
Then perform the write operation with azure data lake gen2:
#Connect Gen2 with azure databricks
spark.conf.set(
"fs.azure.account.key.<storage_account>.dfs.core.windows.net",
"Access_key"
)
Overwrite the data into Gen2
df1.coalesce(1).write.format('csv').mode("overwrite").save("abfss://demo12#vamblob.dfs.core.windows.net/vam")
I am currently working with AWS and PySpark. My tables are stored in S3 and queryable from Athena.
In my Glue jobs, I'm used to load my tables as:
my_table_df = sparkSession.table("myTable")
However, this time, I want to access a table from another database, in the same data source (AwsDataCatalog). So I do something that works well:
my_other_table_df = sparkSession.sql("SELECT * FROM anotherDatabase.myOtherTable")
I am just looking for a better way to write the same thing, without using a SQL query, in one line, just by specifying the database for this operation. Something that should looks like
sparkSession.database("anotherDatabase").table("myOtherTable")
Any suggestion would be welcome
You can use the DynamicFrameReader for that. This will return you a DynamicFrame. You can just call .toDF() on that DynamicFrame to transform it into a native Spark DataFrame though.
sc = SparkContext()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
job = Job(glue_context)
data_source = glue_context.create_dynamic_frame.from_catalog(
database="database",
table_name="table_name"
).toDF()
Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)
using spark submit as:
spark-submit \
--class class-name\
--jar file
or can I add any extra Parameter in spark submit for improving the optimization.
scala code(sample):
All Imports
object sample_1 {
def main(args: Array[String]) {
//sparksession with enabled hivesuppport
var a1=sparksession.sql("load data inpath 'filepath' overwrite into table table_name")
var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from source_table")
}
}
First of all, you don't need to store the data in the temp table to write into hive table later. You can straightaway read the file and write the output using the DataFrameWriter API. This will reduce one step from your code.
You can write as follows:
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
val df = spark.read.csv(filePath) //Add header or delimiter options if needed
inputDF.write.mode("append").format(outputFormat).saveAsTable(outputDB + "." + outputTableName)
Here, the outputFormat will be orc, the outputDB will be your hive database and outputTableName will be your Hive table name.
I think using the above technique, your write time will reduce significantly. Also, please mention the resources your job is using and I may be able to optimize it further.
Another optimization you can use is to partition your dataframe while writing. This will make the write operation faster. However, you need to decide the columns on which to partition carefully so that you don't end up creating a lot of partitions.
I have a snowflake stored procedure which exports data to S3 based on dynamic input parameters. I am trying to set this up via tableau, so that I can use tableau parameters and call the snowflake stored procedure from Tableau, is this possible in any way?
While there's no straightforward solution, you could accomplish this task with a series of Snowflake facilities:
Create a task that monitors information_schema.query_history() every X minutes.
Have this task check for queries executed under a Tableau session.
If any of these queries have a parameter set by your Tableau dashboard that indicates the user wants to export these results, then do so.
You can check that a session was initiated by Tableau searching the query history for ALTER SESSION SET QUERY_TAG = { "tableau-query-origins": { "query-category": "Data" } }.