Azure Data Lake and export SQL query with pyspark - pyspark

Looking to use databases I have stored in Azure Lake. I can run the SQL query with the notebook, for example (with PySpark set as the Language)
%%sql
*
from db1.table1
All I try to do now is to add another notebook / line of code to export the results of the above SELECT statement as the data frame (and CSV subsequently).

df = spark.sql("""SELECT * FROM db1.table1""")
df.coalesce(1).write.csv("path/df1.csv")

I would suggest to create global temp view. As a result, you can use this view in any notebook you want as long as your cluster is not terminated.
Having said that, you could create global temp view as below -
df.createOrReplaceGlobalTempView("temp_view")
Please follow the below official documentation from Databricks -
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-view.html

Related

DataBricks: Ingesting CSV data to a Delta Live Table in Python triggers "invalid characters in table name" error - how to set column mapping mode?

First off, can I just say that I am learning DataBricks at the time of writing this post, so I'd like simpler, cruder solutions as well as more sophisticated ones.
I am reading a CSV file like this:
df1 = spark.read.format("csv").option("header", True).load(path_to_csv_file)
Then I'm saving it as a Delta Live Table like this:
df1.write.format("delta").save("table_path")
The CSV headers have characters in them like space and & and /, and I get the error:
AnalysisException:
Found invalid character(s) among " ,;{}()\n\t=" in the column names of your
schema.
Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'.
For more details, refer to https://docs.databricks.com/delta/delta-column-mapping.html
Or you can use alias to rename it.
The documentation I've seen on the issue explains how to set the column mapping mode to 'name' AFTER a table has been created using ALTER TABLE, but does not explain how to set it at creation time, especially when using the DataFrame API as above. Is there a way to do this?
Is there a better way to get CSV into a new table?
UPDATE:
Reading the docs here and here, and inspired by Robert's answer, I tried this first:
spark.conf.set("spark.databricks.delta.defaults.columnMapping.mode", "name")
Still no luck, I get the same error. It's interesting how hard it is for a beginner to write a CSV file with spaces in its headers to a Delta Live Table
Thanks to Hemant on the Databricks community forum, I have found the answer.
df1.write.format("delta").option("delta.columnMapping.mode", "name")
.option("path", "table_path").saveAsTable("new_table")
Now I can either query it with SQL or load it into a Spark dataframe:
SELECT * FROM new_table;
delta_df = spark.read.format("delta").load("table_path")
display(delta_df)
SQL Way
This method does the same thing but in SQL.
First, create a CSV-backed table for your CSV file:
CREATE TABLE table_csv
USING CSV
OPTIONS (path '/path/to/file.csv', 'header' 'true', 'mode' 'FAILFAST');
Then create a Delta table using the CSV-backed table:
CREATE TABLE delta_table
USING DELTA
TBLPROPERTIES ("delta.columnMapping.mode" = "name")
AS SELECT * FROM table_csv;
SELECT * FROM delta_table;
I've verified that I get the same error as I did when using Python should I omit the TBLPROPERTIES statement.
I guess the Python answer would be to use spark.sql and run this using Python, that way I could embed the CSV path variable in the SQL.
You can set the option in the Spark Configuration of the cluster you are using. That is how you enable the mode at runtime.
You could also set the config at runtime like this:
spark.conf.set("spark.databricks.<name-of-property>", <value>)

How to create an BQ external table on top of the delta table in GCS and show only latest snapshot

I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.
I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.
is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?
So this question is asked like more than a year ago but I have tricky but powerful addition to Oliver's answer which eliminates data duplication and additional load logic.
Step 1 As Oliver suggested generate symlink_format_manifest files; you can either trigger it everytime you've updated or you can add a tblproperty to your file as stated here to
automatically create those files when delta table is updated;
ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
Step 2 Create a external table that points to delta table location
> bq mkdef --source_format=PARQUET "gs://test-delta/*.parquet" > bq_external_delta_logs
> bq mk --external_table_definition=bq_external_delta_logs test.external_delta_logs
Step 3 Create another external table pointing to symlink_format_manifest/manifest file
> bq mkdef --autodetect --source_format=CSV gs://test-delta/_symlink_format_manifest/manifest > bq_external_delta_manifest
> bq mk --table --external_table_definition=bq_external_delta_manifest test.external_delta_manifest
Step 4 Create a view with following query
> bq mk \
--use_legacy_sql=false \
--view \
'SELECT
*
FROM
`project_id.test.external_delta_logs`
WHERE
_FILE_NAME in (select * from `project_id.test.external_delta_logs`)' \
test.external_delta_snapshot
Now you can get the latest snapshot whenever your delta table refreshed from test.external_delta_snapshot view without any additional loading or data duplication. A downside of this solution is, in case of schema changes you have to add new fields to you table definition either manually or from your spark pipeline using BQ client etc. For those who are curious about how this solution works, please continue reading.
How this works;
symlink manifest files contains list of parquet files in new line delimited format pointing to current delta version partitions;
gs://delta-test/......-part1.parquet
gs://delta-test/......-part2.parquet
....
In addition to our delta location, we are defining another external table by treating this manifest file as CSV file (its actually a single column CSV file). The view we've defined takes advantage of _FILE_NAME pseudo column mentioned here, which points to parquet file location of every row in table. As stated in docs, _FILE_NAME pseudo column is defined for every external table that points to data stored in Cloud Storage and Google Drive.
So at this point, we have the list of parquet files required for loading latest snapshot and ability to filter files we want to read using _FILE_NAME column. The view we have defined just defines this procedure to get the latest snapshot. Whenever our delta table gets updated, manifest and delta log table will look for the newest data therefore we will always get the latest snapshot without any additional loading or data duplication.
Last word, its a known fact that execution on external tables are more expensive(execution cost) than BQ managed tables, so its better to experiment with dual writes as Oliver suggested and external table solution like you asked. Storage is cheaper than execution so there may be some cases where keeping data in both GCS and BQ costs less than keeping an external table like this.
I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Below workaround might work for small datasets.
Have a separate BQ table.
Read the delta lake files into a DataFrame and then df.overwrite into the BigQuery table.

Insert into hive from dataframe is not working

I am trying to insert the records from dataframe into hive tables using below command. The command is successful but the target table is not loaded with records.
mergerdd.write.mode("append").insertInto("db.tablename")
I expect records to be loaded into hive table.
Please Check with my Solution. It worked for me.
df.repartition(1).write.format("csv").insertInto('db.tablename',overwrite=True) # CSV
df.repartition(1).write.format("orc").insertInto('db.tablename',overwrite=True) # ORC
df.repartition(1).write.format("parquet").insertInto('db.tablename',overwrite=True) #PARQUET
this way works for me via spark.sql
df.coalesce(#numberofoutputfile).createOrReplaceTempView(#temptablename)
spark.sql(f"insert into {db}.{tablename} select * from {temptablename}")
also mergerdd is an rdd or spark dataframe?
Here is another way of achieving what you are trying to achieve:
df.write.mode("append").saveAsTable("db.tablename")
I use this all the time without any problems.
Hope that helps.

create_dynamic_frame_from_catalog returning zero results

I'm trying to create a dynamic glue dataframe from an athena table but I keep getting an empty data frame.
The athena table is part of my glue data catalog
The create_dynamic_frame_method call doesn't raise any error. I tried loading a random table and it did complain just as a sanity check.
I know the Athena table has data, since querying the exact same table using Athena returns results
The table is an external json, partitioned table on s3
I'm using pyspark as shown below:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
# Create a DynamicFrame using the 'raw_data' table
raw_data_df =
glueContext.create_dynamic_frame.from_catalog(database="***",
table_name="raw_***")
# Print out information about this data, im getting zero here
print "Count: ", raw_data_df.count()
#also getting nothing here
raw_data_df.printSchema()
Anyone facing the same issue ? Could this be a permissions issue or a glue bug since no errors are raised?
There are several poorly documented features/gotchas in Glue which is sometimes frustrating.
I would suggest to investigate the following configurations of your Glue job:
Does the S3 bucket name has aws-glue-* prefix?
Put the files in S3 folder and make sure the crawler table definition is on folder
rather than actual file.
I have also written a blog on LinkedIn about other Glue gotchas if that helps.
Do you have subfolders under the path where your Athena table points to? glueContext.create_dynamic_frame.from_catalog does not recursively read the data. Either put the data in the root of where the table is pointing to or add additional_options = {"recurse": True} to your from_catalog call.
credit: https://stackoverflow.com/a/56873939/5112418

How can I ensure Redshift Unload Copy columns are in correct order?

I am trying to use the UnloadCopyUtility to migrate an instance to an encrypted instance but some of the tables fail because it is trying to insert the values into the wrong columns. Is there are a way I can ensure the columns are mapped to the values correctly? I can adjust the python script locally if need be
I feel, this should be possible in UnloadCopy utility as well.
But here I'm trying to answer more of generic solution withput UnloadCopy utility, so that it may be helpful to others as an alternate solution.
In unload command you could specify the columns like C1,C2,C3,...
Use same sequence columns in copy command while loading data in RedShift.
Unload command example.
unload ('select C1,C2,C3,... from venue') to 's3://mybucket/tickit/unload/venue_' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' parallel off;
Copy command example with specific columns sequence of above unloaded files.
copy table (C1,C2,C3,...) from 's3://<your-bucket-name>/load/key_prefix' credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>' options;