I have created a Databricks cluster with which I read data from ADLS and then create a series of TEMP View in order to do some data transformation with SparkSQL. In the trial version, I was able to create Tables from the final Temp View, which I cannot do now when I have a paid version of Databricks ( I know, that's ironic..).
Another problem I've spotten that in the "Spark UI" -> "SQL", the command "Show DATABASES" is showing up multiple times and runs for as long as 10-15 minutes, sometimes even after I complete with all the Temp View and I write the output data to ADLS. I can't make any sense of it, unfortunately. Did somebody else faced this problem and can help?
I'll attach a photo with what I can see always in the "Data" view from Databricks
Related
I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.
I am using databricks on azure,
Pyspark reads data that's dumped in azure data lake storage [adls]
Every now and then when i try to read the data from adls like so:
spark.read.format('delta').load(`/path/to/adls/mounted/interim_data.delta` )
it throws the following error
AnalysisException: `/path/to/adls/mounted/interim_data.delta` is not a Delta table.
the data necessarily exists
the folder contents and files show up when i run
%fs ls /path/to/adls/mounted/interim_data.delta
right now the only fix is to re run the script that populated the above interim_data.delta table which is not a viable fix
Make sure you have copied the data in delta format correctly.
Below is the standard command to do so:
df = spark.read.format(file_type).option("header","true").option("inferSchema", "true").option("delimiter", '|').load(file_location)
df.write.format("delta").save()
You access data in Delta tables either by specifying the path on DBFS ("/mnt/delta/events") or the table name ("events"). Make sure the path or file name should be in correct format. Please refer example below:
val events = spark.read.format("delta").load("/mnt/delta/events")
Refer https://learn.microsoft.com/en-us/azure/databricks/delta/quick-start#read-a-table to know more about Delta Lake.
Feel free to ask in case you have any query.
I am answering my own question...
TLDR: Root cause of the issue: frequent remounting of ADLS
There was this section of the code that remounts the ADLS gen2 to ADB, when other teams ran their script, the remounting took 20-45 seconds, and as the number of scripts that ran in the high concurrency cluster increased, it was a matter of time that one of us ran into the issue, where the scripts tired to read data from the ADLS while it was being mounted...
this is how it turned out to be intermittent...
Why was this remounting hack in place..? this was put in place because, we faced an issue with data not showing up, in adb even though it was visible in ADLS Gen2, and the only way to fix this back then, was to force a remount to make that data visible in ADB.
I have a DAG thet loads the data into no of raw tables .There is a control table that stores the list of tables and when they are last updated by the DAG .This is all managed by different ta=eam. I am trying to create a DAG to run a query on one of the raw table and load it into persistant table. I would like to run the DAG as soon as i see the latest time stamp than what i already processed in my control table.
I am new to the Cloud Composer, can you please let me know how i can accomplish it?
Thanks
I am running into an issue where I have a set up a pipeline that gets a list of tables from Teradata using a Lookup activity and then passes those items to a ForEach activity that then copies the data in parallel and saves them as a gzipped file. The requirement is to essentially archive some tables that are no longer being used.
For this pipeline I am not using any partition options as most of the tables are small and I kept it to be flexible.
Pipeline
COPY activity within ForEach activity
99% of the tables ran without issues and were copied as gz files into blob storage, but two tables in particular run for long time (apprx 4 to 6 hours) without any of the data being written into a blob storage account.
Note that the image above says "Cancelled", but that was done by me. Before that I had a run time as described above, but still no data being written. This is affecting only 2 tables.
I checked with our Teradata team and those tables are not being used by any one (hence its not locked). I also looked at "Teradata Viewpoint" (admin tool) and looked at the query monitor and saw that the query was running on Teradata without issues.
Any insight would be greatly apreciated.
Onlooking issue mention it look the data size of table is more than a blob can store ( As you are not using any partition options )
Use partition option for optimize performance and hold the data
Link
Just in case someone else comes across this, the way I solved this was to create a new data store connection called "TD_Prod_datasetname". The purpose of this dataset is to not point to a specific table, but to just accept a "item().TableName" value.
This datasource contains two main values. 1st is the #dataset().TeradataName
Dataset property
I only came up with that after doing a little bit of digging in Google.
I then created a parameter called "TeradataTable" as String.
I then updated my pipeline. As above the main two activities remain the same. I have a lookup and then a ForEach Activity (where for each will get the item values):
However, in the COPY command inside the ForEach activity I updated the source. Instead of getting "item().Name" I am passing through #item().TableName:
This then enabled me to then select the "Table" option and because I am using Table instead of query I can then use the "Hash" partition. I left it blank because according to Microsoft documentation it will automatically find the Primary Key that will be used for this.
The only issue that I ran into when using this was that if you run into a table that does not have a Primary Key then this item will fail and will need to be run through either a different process or manually outside of this job.
Because of this change the previously files that just hung there and did not copy now copied successfully into our blob storage account.
Hope this helps someone else that wants to see how to create parallel copies using Teradata as a source and pass through multiple table values.
I have created a table where I have a text input file which is 7.5 GB in size and there are 65 million records and now I want to push that data into an Amazon RedShift table.
But after processing 5.6 million records it's no longer moving.
What can be the issue? Is there any limitation with tFileOutputDelimited as the job has been running for 3 hours.
Below is the job which I have created to push data in to Redshift table.
tFileInputDelimited(.text)---tMap--->tFilOutputDelimited(csv)
|
|
tS3Put(copy output file to S3) ------> tRedShiftRow(createTempTable)--> tRedShiftRow(COPY to Temp)
The limitation comes from Tmap component, its not the good choice to deal with large amount of data, for your case, you have to enable the option "Store temp data" to overcome the memory consumption limitation of Tmap.
Its well described in Talend Help Center.
Looks like, tFilOutputDelimited(csv) is creating the problem. Any file can't handle after certain amount of data. Not sure thought. Try to find out a way to load only portion of the parent input file and commit it in redshift. Repeat the process till your parent input file gets completely processed.
use AWS Glue to push your file data from S3 to Redshift. AWS Glue will easily push the large data into redshift without any issue.
Steps:
1: Create a connection with your Redshift
2: create a database and two tables.
a: Data-from-S3 (this will use to crawl file data from S3)
b: data-to-redshift ( add redshift connection)
3: Create a Job:
a: In Data source, select the "Data-from-S3" table
b: In Data Target, select the "data-to-redshift" table
4: Run the job.
Note: You can also automate this with lambda and SNS trigger.
You can use copy command option to load large data into aws redshift, if the copy command doesn't support txt file, then we need to have csv file. processing 65 million records will create issue. so we need to perform split and run. for that create 65 iterations and do process 1 million data a a time . To implement this use tloop and set the values inside the component. take the global variables of tloop in header and limit values of tinputdelimited component
job:
tloop----->tinputfiledelimited---->tmap(if needed)--------> tfileoutdelimited
also enable the option "Store temp data" to handle the memory issue