How to executing arbitrary sql from pyspark notebook using SQLContext? - ibm-cloud

I'm trying a basic test case of reading data from dashDB into spark and then writing it back to dashDB again.
Step 1. First within the notebook, I read the data:
sqlContext = SQLContext(sc)
dashdata = sqlContext.read.jdbc(
url="jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB:user=****;password=****;",
table="GOSALES.BRANCH"
).cache()
Step 2. Then from dashDB I create the target table:
DROP TABLE ****.FROM_SPARK;
CREATE TABLE ****.FROM_SPARK AS (
SELECT *
FROM GOSALES.BRANCH
) WITH NO DATA
Step 3. Finally, within the notebook I save the data to the table:
from pyspark.sql import DataFrameWriter
writer = DataFrameWriter(dashdata)
dashdata = writer.jdbc(
url="jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB:user=****;password=****;",
table="****.FROM_SPARK"
)
Question: Is it possible to run the sql in step 2 from pyspark? I couldn't see how this could be done from the pyspark documentation. I don't want to use vanilla python for connecting to dashDB because of the effort involved in setting up the library.

Use ibmdbpy. See this brief demo.
With as_idadataframe() you can upload DataFrames into dashDB as a table.
Added key steps here as stackoverflow doesn't like linking to answers:
Step 1: Add a cell containing:
#!pip install --user future
#!pip install --user lazy
#!pip install --user jaydebeapi
#!pip uninstall --yes ibmdbpy
#!pip install ibmdbpy --user --no-deps
#!wget -O $HOME/.local/lib/python2.7/site-packages/ibmdbpy/db2jcc4.jar https://ibm.box.com/shared/static/lmhzyeslp1rqns04ue8dnhz2x7fb6nkc.zip
Step 2: Then from annother notebook cell
from ibmdbpy import IdaDataBase
idadb = IdaDataBase('jdbc:db2://<dashdb server name>:50000/BLUDB:user=<dashdb user>;password=<dashdb pw>')
....

Yes,
You can create table in dashdb from Notebook.
Below is the code for Scala :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.log4j.Logger
import org.apache.log4j.Level
import java.sql.Connection
import java.sql.DriverManager
import java.sql.SQLException
import com.ibm.db2.jcc._
import java.io._
val jdbcClassName="com.ibm.db2.jcc.DB2Driver"
val url="jdbc:db2://awh-yp-small02.services.dal.bluemix.net:50001/BLUDB:sslConnection=true;" // enter the hostip fromc connection settings
val user="<username>"
val password="<password>"
Class.forName(jdbcClassName)
val connection = DriverManager.getConnection(url, user, password)
val stmt = connection.createStatement()
stmt.executeUpdate("CREATE TABLE COL12345(" +
"month VARCHAR(82))")
stmt.close()
connection.commit()
connection.close()

Related

Pyspark in google colab

I am trying to use pyspark on google colab. Every tutorial follows a similar method
!pip install pyspark # Import SparkSession
from pyspark.sql import SparkSession # Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate() # Check Spark Session Information
spark # Import a Spark function from library
from pyspark.sql.functions import col
But I get an error in
----> 4 spark = SparkSession.builder.master("local[*]").getOrCreate() # Check Spark Session Information
RuntimeError: Java gateway process exited before sending its port number
I tried installing java using something like this
# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
as suggested by the tutorials, but nothing seems to work.
This worked for me so i post in case someone needs it.
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
import pyspark
import pyspark.sql as pyspark_sql
import pyspark.sql.types as pyspark_types
import pyspark.sql.functions as pyspark_functions
from pyspark import SparkContext, SparkConf
# create the session
conf = SparkConf().set("spark.ui.port", "4050")
# create the context
sc = pyspark.SparkContext(conf=conf)
spark = pyspark_sql.SparkSession.builder.getOrCreate()

Pyspark ModuleNotFound when importing custom package

Context: I'm running a script on azure databricks and I'm using imports to import functions from a given file
Let's say we have something like this in a file called "new_file"
from old_file import x
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.types import *
spark = SparkSession.builder.appName('workflow').config(
"spark.driver.memory", "32g").getOrCreate()
The imported funcion "x" will take as argument a string that was read as a pyspark dataframe as such:
new_df_spark = spark.read.parquet(new_file)
new_df = ps.DataFrame(new_df_spark)
new_df is then passed as argument to a function that calls the function x
I then get an error like
ModuleNotFoundError: No module named "old_file"
Does this mean I can't use imports? Or do I need to install the old_file in the cluster for this to work? If so, how would this work and will the package update if I change old_file again?
Thanks

How to run arbitrary / DDL SQL statements or stored procedures using AWS Glue

Is it possible to execute arbitrary SQL commands like ALTER TABLE from AWS Glue python job? I know I can use it to read data from tables but is there a way to execute other database specific commands?
I need to ingest data into a target database and then run some ALTER commands right after.
So after doing extensive research and also opening a case with AWS support, they told me it is not possible from Python shell or Glue pyspark job at this moment. But I just tried something creative and it worked! The idea is to use py4j that sparks relies on already and utilize standard java sql package.
Two huge benefits of this approach:
A huge benefit of this that you can define your database connection as Glue data connection and keep jdbc details and credentials in there without hardcoding them in the Glue code. My example below does that by calling glueContext.extract_jdbc_conf('your_glue_data_connection_name') to get jdbc url and credentials, defined in Glue.
If you need to run SQL commands on a supported out of the box Glue database, you don't even need to use/pass jdbc driver for that database - just make sure you set up Glue connection for that database and add that connection to your Glue job - Glue will upload proper database driver jars.
Remember this code below is executed by a driver process and cannot be executed by Spark workers/executors.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
logger = glueContext.get_logger()
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# dw-poc-dev spark test
source_jdbc_conf = glueContext.extract_jdbc_conf('your_glue_database_connection_name')
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")
conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url'), source_jdbc_conf.get('user'), source_jdbc_conf.get('password'))
print(conn.getMetaData().getDatabaseProductName())
# call stored procedure, in this case I call sp_start_job
cstmt = conn.prepareCall("{call dbo.sp_start_job(?)}");
cstmt.setString("job_name", "testjob");
results = cstmt.execute();
conn.close()
I finally got this working after a couple of hours so hopefully the following will be helpful. My script is heavily influenced by the earlier responses, thank you.
Prerequisites:
You will want the Glue connection configured and tested before attempting any scripts.
When setting up your AWS Glue job, use Spark, Glue version 2.0 or later, and Python version 3.
I recommend to configure this job for just 2 worker threads to save on cost; the bulk of the work is going to be done by the database, not by glue.
The following is tested with an AWS RDS PostgreSQL instance, but is hopefully flexible enough to work for other databases.
The script needs 3 parameters updated near the top of the script (glue_connection_name, database_name, and stored_proc).
The JOB_NAME, connection string, and credentials are retrieved by the script and do not need to be supplied.
If your stored proc will return a dataset then replace executeUpdate with executeQuery.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glue_connection_name = '[Name of your glue connection (not the job name)]'
database_name = '[name of your postgreSQL database]'
stored_proc = '[Stored procedure call, for example public.mystoredproc()]'
#Below this point no changes should be necessary.
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glue_job_name = args['JOB_NAME']
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(glue_job_name, args)
job.commit()
logger = glueContext.get_logger()
logger.info('Getting details for connection ' + glue_connection_name)
source_jdbc_conf = glueContext.extract_jdbc_conf(glue_connection_name)
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")
conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url') + '/' + database_name, source_jdbc_conf.get('user'), source_jdbc_conf.get('password'))
logger.info('Connected to ' + conn.getMetaData().getDatabaseProductName() + ', ' + source_jdbc_conf.get('url') + '/' + database_name)
stmt = conn.createStatement();
rs = stmt.executeUpdate('call ' + stored_proc);
logger.info("Finished")
i modified the code shared by mishkin but it did not work for me. So after troubleshooting a bit i realized for me the connetion from the catalog do not work. so I had to modify it manually and tweak code a little bit. Now the its working but thorwoiung exception in the end as its not able to convert the java results to python result. I did a work around so use with caution.
below is my code.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
#source_jdbc_conf = glueContext.extract_jdbc_conf('redshift_publicschema')
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")
print('Trying to connect to DB')
conn = sc._gateway.jvm.DriverManager.getConnection('jdbc:redshift://redshift-cluster-2-url:4000/databasename', 'myusername', 'mypassword')
print('Trying to connect to DB success!')
print(conn.getMetaData().getDatabaseProductName())
# call stored procedure, in this case I call sp_start_job
stmt = conn.createStatement();
#cstmt = conn.prepareCall("call dbname.schemaname.my_storedproc();");
print('Call to proc trying ')
#cstmt.setString("job_name", "testjob");
try:
rs = stmt.executeQuery('call mySchemaName.my_storedproc()');
except:
print("An exception occurred but proc has run")
#results = cstmt.execute();`enter code here`
conn.close()
It depends. If you are using redshift as a target, you have the option of specifying pre and post actions as part of connection options. You would be able to specify alter actions over there. However for the rest of the target types you might need to use some python module like pg8000( in cases of Postgres) and others
If you attach a connection object to the glue job, you can easily get connection settings:
glue_client = boto3.client('glue')
getjob=glue_client.get_job(JobName=args["JOB_NAME"])
connection_settings = glue_client.get_connection(Name=getjob['Job']['Connections']['Connections'][0])
conn_name = connection_settings['Connection']['Name']
df = glueContext.extract_jdbc_conf(conn_name)
Thanks mishkin for sharing the script. I got below error when I followed the script for Redshift
An error occurred while calling z:java.sql.DriverManager.getConnection. [Amazon]JDSI Required
setting ConnSchema is not present in connection settings
It looks like source_jdbc_conf.get('url') is not passing database name in the JDBC url, so I ended up appending database name to the JDBC url.

No module named 'pyspark' in Zeppelin

I am new to Spark and just started using it. Trying to import SparkSession from pyspark but it throws an error: 'No module named 'pyspark'. Please see my code below.
# Import our SparkSession so we can use it
from pyspark.sql import SparkSession
# Create our SparkSession, this can take a couple minutes locally
spark = SparkSession.builder.appName("basics").getOrCreate()```
Error:
```---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-6ce0f5f13dc0> in <module>
1 # Import our SparkSession so we can use it
----> 2 from pyspark.sql import SparkSession
3 # Create our SparkSession, this can take a couple minutes locally
4 spark = SparkSession.builder.appName("basics").getOrCreate()
ModuleNotFoundError: No module named 'pyspark'```
I am in my conda env and I tried ```pip install pyspark``` but I already have it.
If you are using Zepl, they have their own specific way of importing. This makes sense, they need their own syntax since they are running in the cloud. It clarifies their specific syntax vs. Python itself. For instance %spark.pyspark.
%spark.pyspark
from pyspark.sql import SparkSession

Databricks-connect with Python + Scala UDF in JAR file not working locally

I'm trying to use a JAR file in python (using Databricks-connect) in Vs Code.
I already checked the path to the jar file.
I have the following code as example:
import datetime
import time
from pyspark.sql import SparkSession
from pyDataHub import LoadProcessorBase, ProcessItem
from pyspark.sql.functions import col, lit, sha1, concat, udf, array
from pyspark.sql import functions
from pyspark.sql.types import TimestampType, IntegerType, DoubleType, StringType
from pyspark import SparkContext
from pyspark.sql.functions import sha1, upper
from pyspark.sql.column import Column, _to_java_column, _to_seq
spark = SparkSession \
.builder \
.config("spark.jars", "/users/Phill/source/jar/DataHub_Core_Functions.jar") \
.getOrCreate()
sc = spark.sparkContext
def PhillHash(col):
f = sc._jvm.com.narato.datahub.core.HashContentGenerator.getGenerateHashUdf()
return upper(sha1(Column(f.apply(_to_seq(sc, [col], _to_java_column)))))
sc._jsc.addJar("/users/Phill/source/jar/DataHub_Core_Functions.jar")
spark.range(100).withColumn("test", PhillHash("id")).show()
Any help would be appreciated cause I'm out of options here...
The error I get is the following:
Exception has occurred: TypeError 'JavaPackage' object is not callable
Add the jar to a dbfs location and update the path accordingly. The workers cannot connect to your local filesystem.
Also make sure you are running version 5.4 of the databricks runtime (or higher).