create a database in pyspark using Python API's only - pyspark

I'm unable to locate any API to create a database in pyspark. All I can find is SQL based approach.
catalog doesn't mention a python method to create a database.
Is this even possible? I am trying to avoid using SQL.

I guess it might be a suboptimal solution, but you can call a CREATE DATABASE statement using SparkSession's sql method to create a database, like this:
spark.sql("CREATE DATABASE IF EXISTS test_db")
It's not pure PySpark API, but this way you don't have to switch context to SQL completely, to create a database :)

Related

How to create custom aggregate functions using SQLAlchemy ORM?

SQLAlchemy 1.4 ORM using an AsyncSession, Postgres backend, Python 3.6
I am trying to create a custom aggregate function using the SQLAlchemy ORM. The SQL query would look something like:
COUNT({group_by_function}),{group_by_function} AS {aggregated_field_name)}
I've been searching for information on this.
I know this can be created internally within the Postgres db first, and then used by SA, but this will be problematic for the way the codebase I'm working with is set up.
I know SQLAlchemy-Utils has functionality for this, but I would prefer not to use an external library.
The most direct post on this topic I can find says "The creation of new aggregate functions is backend-dependant, and must be done directly with the API of the underlining connection." But this is from quite a few years ago and thought there might have been updates since.
Am I missing something in the SA ORM docs that discusses this or is this not supported by SA, full stop?
you can try something this query
query = db.session.query(Model)\
.with_entities(
Model.id,
func.sum(Model.number).label('total_sum')
).group_by(Model.id)

Can Pyspark Use JDBC to Pass Alter Table

I would like to pass an alter table command to my PostgreSQL database after I load data from a Databricks notebook using pyspark. I know that I can pass a query using spark.read.jdbc but in this case I would like to add a unique constraint once the data has loaded. The purpose is to speed up the data load process into the db by reducing the time to create the unique index.
Spark is a framework for data processing therefore its API mostly developed for read and write operations with data sources. In your case, you have some DDL statements to execute and Spark isn't supposed to perform such operations.
It will better option, to keep DDL operation separate after data processing in spark sql. Here you can add one more PostgreSQL job to perform such operations.
I was experiencing this exact problem in Redshift. After reviewing the doc on JDBC connections, it looks like you can do something like this:
%sql
ALTER TABLE <jdbcTable> {SOME ACTIONS}
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:<databaseServerType>://<jdbcHostname>:<jdbcPort>",
dbtable "<jdbcDatabase>.atable",
user "<jdbcUsername>",
password "<jdbcPassword>"
)

Is it possible to create an on-the-fly (soft) dblink in postgresql

Wondering if it is possible to create a dblink in postgresql that is not saved as an object in the db, but rather it is an in-memory object during the session of a function or running code? Then use it to connect and do queries.
New to postgresql and not sure how to search for this.
Any examples on how to do this?

Initiating schema in PostgreSQL RDS?

I'm trying to build an app using Node/Express and RDS PostgreSQL on the back-end to get some more experience with these two technologies. More specifically, I'm looking to build this using the node-postgres package and without the aid of an ORM. I currently have a .sql file in my app that contains the desired schema.
What would be considered "best practice" when implementing a schema for the first time? For example, is it considered better to import a schema via the command line, use something like pgAdmin, or throw a bunch of "CREATE TABLEs" into queries through node-postgres?
Thanks in advance for the help!

Transact SQL - Information Schema

Is there a way to query an Information Schema from DB2 and dump the contents(tables - structure only),into another database? I'm looking for a way to create a shell model of a database schema from DB2 into a SQL Server database?
Thanks
You can use db2look to get the table structure (DDL) out of db2.
Once you've got it, however, I'm afraid you'll have to manually replace any db2-specific syntax (datatypes, storage parameters, etc.) with it's corresponding SQL Server syntax.
It doesn't look like Microsoft's Migration Tool works for db2 yet. :(