Time-travel in a Managed Table with Pyspark

Time-travel in a Managed Table with Pyspark - pyspark

In the databricks spec it is stated:
all tables created in Databricks are Delta tables, by default.
I create a table with
df.write.saveAsTable("table_name")
With the sql api I can time-travel:
%sql
SELECT * FROM table_name VERSION AS OF 0
How can I now time-travel with python?
I search for something like
spark.table("mytab2").versionAsOf(3)

Simplest way:
spark.table("mytab2#v3") # as of version
or
spark.table("mytab2#20221012093243000") # as of timestamp
Reference: Table batch read and writes / # syntax. On the same page there's also an option for DataFrameReader API, although for this you need to provide explicit DBFS path to Delta table, so it's a bit less convenient.

This syntax also works:
spark.read.format("delta").option("versionAsOf", "0").table("mytab2")

Related

How to create a temporary table in snowflake based on pyspark dataframe

I can read the snowflake table in pyspark dataframe using sqlContext
sql = f"""select * from table1""";
df = sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.option("query", sql)
.load()
How do I create a temporary table in snowflake (using pyspark code) and insert values from this pyspark dataframe (df)?

just save as usual, with snowflake format
snowflake_options = {
...
'sfDatabase': 'dbabc',
'dbtable': 'tablexyz',
...
}
(df
.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.save()
)

I don't believe this can be done. At least not the way you want.
You can, technically, create a temporary table; but persisting it is something that I have had a great deal of difficulty finding how to do (i.e. I haven't). If you run the following:
spark.sparkContext._jvm.net.snowflake.spark.snowflake.Utils.runQuery(snowflake_options, 'create temporary table tmp_table (id int, value text)')
you'll notice that it successfully returns a java object indicating the temp table was created successfully; but once you try and run any further statements on it, you'll get nasty errors that mean it no longer exists. Somehow we mere mortals would need to find a way to access and persist the Snowflake session through the jvm api. That being said, I also think that would run contrary to the Spark paradigm.
If you really need the special-case performance boost of running transformations on Snowflake instead of bringing it all into Spark, just keep everything in Snowflake to begin with by either
Using CTEs in the query, or
Using the runQuery api described above to create "temporary" permanent/transient tables and designing Snowflake queries that insert directly to those and then clean them up (DROP them) when you are done.

Synapse spark select column with space

I am trying to read synapse table, which has spaces in column names.
Read table is working until I am selecting columns without spaces or special characters:
%%spark
val df = spark.read.synapsesql("<Pool>.<schema>.<table>").select("TYPE", "Year").limit(100)
df.show()
OUTPUT:
+------+----+
| TYPE|Year|
+------+----+
|BOUGHT|LAST|
|BOUGHT|LAST|
|BOUGHT|LAST|
|BOUGHT|LAST|
When I start selecting columns with spaces I am getting errors. I have tried many variants:
.select(col("""`Country Code`"""))
.select(col("`Country Code`"))
.select(col("""[Country Code]"""))
.select(col("Country Code"))
.select($"`Country Code`")
.select("`Country Code`")
.select("""`Country Code`""")
will return this error:
ERROR: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name 'Country'.
If I ommit ` in select for example:
.select("[Country Code]")
ERROR: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name '[Country Code]'.
With back-tick spark in synapse just take only first word as column.
Any experience?

The select on its own will work, adding show (or any other action like count) will not. There does seem to be an issue with the Synapse synapsesql API. The Invalid column name 'country' error is coming from the SQL engine because it seems like there is no way to pass square brackets back to it. Also parquet files do not support spaces in column names so it's probably connected.
The workaround is to simply not use spaces in column names. Fix up the tables in a previous Synapse pipeline step if required. I'll have a look into it but may be no other answer.
If you want to rename existing columns in the database you can use sp_rename, eg
EXEC sp_rename 'dbo.countries.country Type', 'countryType', 'COLUMN';
This code has been tested on a Synapse dedicated SQL pool.
That particular API (sysnapsesql.read) cannot handle views unfortunately. You would have to materialise it eg using a CTAS in a prior Synapse Pipeline step. The API useful for simple patterns (get table -> process -> put back) but is pretty limited. You can't even manage table distribution (hash, round_robin, replicate) or indexing (clustered columnstore, clustered index, heap) or partitioning but you never know they might add to it one day. I'll be keeping on eye out during the next MS conference anyway.

I have created the function to run query using JDBC. Thanks this I was able to read from view. I have added saplme code how to get password from KeyVault, using TokenLibrary.
def spark_query(db, query):
jdbc_hostname = "<synapse_db>.sql.azuresynapse.net"
user = "<spark_db_client>"
password = "<strong_password>"
# password_from_kv = TokenLibrary.getSecret("<Linked_Key_Vault_Service_Name>", "<Key_Vault_Key_Name>", "<Key_Vault_Name>")
return spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://{jdbc_hostname }:1433;databaseName={db};user={user};password={password}") \
.option("query", query) \
.load()
Then I have created VIEW with column names without spaces:
CREATE VIEW v_my_table
AS
SELECT [Country code] as country_code from my_table
Granted access to <spark_db_client>:
GRANT SELECT ON v_my_table to <spark_db_client>
After whole preparation I was able to read table from VIEW and save to spark database:
query = """
SELECT country_code FROM dbo.v_my_table
"""
df = spark_query(db="<my_database>", query=query)
spark.sql("CREATE DATABASE IF NOT EXISTS spark_poc")
df.write.mode("overwrite").saveAsTable("spark_poc.my_table")
df.registerTempTable("my_table")
This are <placeholder_variables>

Execute Postgresql Stored Procedure in PySpark

I am working on Pyspark in AWS Glue
I want to execute Stored Procedure/Function on Postgresql Database
Is it possible?
What is the syntax? Is there any special package needed?
Ankur

You can try using a module like pg8000 to run this function
You can also try calling the postgres function like you would select data from a specific table using the spark read function with jdbc as the format. Considering glue uses pyspark in the back end, i would imagine just giving the function name instead of a table name, should do the trick. Just remember to add the jdbc driver to your glue job
eg: You can do this in spark
jdbcDF = spark.read.format("jdbc").option("url","jdbc:postgresql://host:5432/db").option("driver", "org.postgresql.Driver").option("query", "SELECT * from function()").option("user", "user").option("password", "password").load()

Is there a way to describe an external/spectrum table via redshift?

In AWS Athena you can write
SHOW CREATE TABLE my_table_name;
and see a SQL-like query that describes how to build the table's schema. It works for tables whose schema are defined in AWS Glue. This is very useful for creating tables in a regular RDBMS, for loading and exploring data views.
Interacting with Athena in this way is manual, and I would like to automate the process of creating regular RDBMS tables that have the same schema as those in Redshift Spectrum.
How can I do this through a query that can be run via psql? Or is there another way to get this via the aws-cli?

Redshift Spectrum does not support SHOW CREATE TABLE syntax, but there are system tables that can deliver same information. I have to say, it's not as useful as the ready to use sql returned by Athena though.
The tables are
svv_external_schemas - gives you information about glue database mapping and IAM roles bound to it
svv_external_tables - gives you the location information, and also data format and serdes used
svv_external_columns - gives you the column names, types and order information.
Using that data, you could reconstruct the table's DDL.
For example to get the list of columns and their types in the CREATE TABLE format one can do:
select distinct
listagg(columnname || ' ' || external_type, ',\n')
within group ( order by columnnum ) over ()
from svv_external_columns
where tablename = '<YOUR_TABLE_NAME>'
and schemaname = '<YOUR_SCHEM_NAME>'
the query give you the output similar to:
col1 int,
col2 string,
...
*) I am using listagg window function and not the aggregate function, as apparently listagg aggregate function can only be used with user defined tables. Bummer.

I had been doing something similar to #botchniaque's answer in the past, but recently stumbled across a solution in the AWS-Labs' amazon-redshift-utils code package that seems to be more reliable than my hand-spun queries:
amazon-redshift-utils: v_generate_external_tbl_ddl
If you don't have the ability to create a view backed with the ddl listed in that package, you can run it manually by removing the CREATE statement from the start of the query. Assuming you can create it as a view, usage would be:
SELECT ddl
FROM admin.v_generate_external_tbl_ddl
WHERE schemaname = '<external_schema_name>'
-- Optionally include specific table references:
-- AND tablename IN ('<table_name_1>', '<table_name_2>', ..., '<table_name_n>')
ORDER BY tablename, seq
;

They added show external table now.
SHOW EXTERNAL TABLE external_schema.table_name [ PARTITION ]
SHOW EXTERNAL TABLE my_schema.my_table;
https://docs.aws.amazon.com/redshift/latest/dg/r_SHOW_EXTERNAL_TABLE.html

Trim/whitespace issue when load data from Db2 source to Postgresql DB using Talend Open source

We are seeing issue in table value which are populated from DB2 (source) to Postgres (Target).
I have including here all the job details for each component.
Based on the above approach and once the data has been populated, when we run the below query in the Postgres DB.
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where cust_gssn_cd='XY03666699' ;
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where cust_cntry_cd='847' ;
There will be no records were returned however, when we run the same query with Trim as below it works.
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where trim(cust_gssn_cd)='XY03666699' ;
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where trim(cust_cntry_cd)='847' ;
Below are the ways we have tried to overcome this but no luck.
Used tmap between source and target component.
Used trim in source component under Advanced setting.
Change the datatype in Postgres DB of cust_cntry_cd from char(5) to Character varying, this will allow value without any length restriction.
Please suggest what is missing as we have this issue in almost all the table where we have character/varchar columns.
We are using TOS.

The data type is probably character(5) in DB2.
That means that the trailing spaces are part of the column and will be migrated. You have to compare with
cust_cntry_cd = '847 '
or cast the right argument to character(5):
cust_cntry_cd = CAST ('847' AS character(5))

Maybe you could delete all spaces in the advanced settings of the tDB2Input component.
Like the screen :

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Time-travel in a Managed Table with Pyspark - pyspark

This syntax also works: spark.read.format("delta").option("versionAsOf", "0").table("mytab2")

Related

How to create a temporary table in snowflake based on pyspark dataframe

Synapse spark select column with space

Execute Postgresql Stored Procedure in PySpark

Is there a way to describe an external/spectrum table via redshift?

Trim/whitespace issue when load data from Db2 source to Postgresql DB using Talend Open source

Categories

Resources