How can I add class to classpath in pyspark? - pyspark

I am trying to create a temporary function like this:
spark.sql("create temporary function ptyUnprotectStr as 'com.protegrity.hive.udf.ptyUnprotectStr'")
Error I get is:
AnalysisException: "Can not load class 'com.protegrity.hive.udf.ptyUnprotectStr' when regisitering the function 'ptyUnprotectStr' please make sure it is on the classpath"
So how can I add 'com.protegrity.hive.udf.ptyUnprotectStr' to the classpath? I am using Pyspark in Jupyter on Anaconda enterprise.

Related

How to list and delete files faster in Databricks - using pyspark

I want to implement efficient file listing and deletion on Databricks using pyspark. The following link has an implementation in Scala, is there an equivalent pyspark version?
https://kb.databricks.com/en_US/data/list-delete-files-faster
You can use dbutils, the DataBricks file utility APIs.
To delete a file or a directory:
dbutils.fs.rm("dbfs:/filepath")
To delete all files from a dir, and optionally delete the dir, I use a custom written util function:
def empty_dir(dir_path, remove_dir=False):
listFiles = dbutils.fs.ls(dir_path)
for _file in listFiles:
if _file.isFile():
dbutils.fs.rm(_file.path)
if remove_dir:
dbutils.fs.rm(dir_path)

Pass parameters/arguments to HDInsight/Spark Activity in Azure Data Factory

I have an on-demand HDInsight cluster that is launched from a Spark Activity within Azure Data Factory and runs PySpark 3.1. To test out my code, I normally launch Jupyter Notebook from the created HDInsight Cluster page.
Now, I would like to pass some parameters to that Spark activity and retrieve these parameters from within Jupyter notebook code. I've tried doing so in two ways, but none of them worked for me:
Method A. as Arguments and then tried to retrieve them using sys.argv[].
Method B. as Spark configuration and then tried to retrieve them using sc.getConf().getAll().
I suspect that either:
I am not specifying parameters correctly
or using a wrong way to retrieve them in Jupyter Notebook code
or parameters are only valid for the Python *.py scripts specified in the "File path" field, but not for the Jupyter notebooks.
Any pointers on how to pass parameters into HDInsight Spark activity within Azure Data Factory would be much appreciated.
The issue is with the entryFilePath. In the Spark activity of HDInsight cluster, you must either give the entryFilePath as a .jar file or .py file. When we follow this, we can successfully pass arguments which can be utilized using sys.argv.
The following is an example of how you can pass arguments to python script.
The code inside nb1.py (sample) is as shown below:
from pyspark import SparkContext
from pyspark.sql import *
import sys
sc = SparkContext()
sqlContext = HiveContext(sc)
# Create an RDD from sample data which is already available
hvacText = sc.textFile("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
# Create a schema for our data
Entry = Row('Date', 'Time', 'TargetTemp', 'ActualTemp', 'BuildingID')
# Parse the data and create a schema
hvacParts = hvacText.map(lambda s: s.split(',')).filter(lambda s: s[0] != 'Date')
hvac = hvacParts.map(lambda p: Entry(str(p[0]), str(p[1]), int(p[2]), int(p[3]), int(p[6])))
# Infer the schema and create a table
hvacTable = sqlContext.createDataFrame(hvac)
hvacTable.registerTempTable('hvactemptable')
dfw = DataFrameWriter(hvacTable)
#using agrument from pipeline to create table.
dfw.saveAsTable(sys.argv[1])
When the pipeline is triggered, it runs successfully and the required table will be created (name of this table is passed as an argument from pipeline Spark activity). We can query this table in HDInsight cluster's Jupyter notebook using the following query:
select * from new_hvac
NOTE:
So, please ensure that you are passing arguments to python script (.py file) but not a python notebook.

sequence files from sqoop import

I have imported a table using sqoop and saved it as a sequence file.
How do I read this file into an RDD or Dataframe?
I have tried sc.sequenceFile() but I'm not sure what to pass as keyClass and value Class. I tried tried using org.apache.hadoop.io.Text, org.apache.hadoop.io.LongWritable for keyClass and valueClass
but it did not work. I am using pyspark for reading the files.
in python its not working however in SCALA it works:
You need to do following steps:
step1:
If you are importing as sequence file from sqoop, there is a jar file generated, you need to use that as ValueClass while reading sequencefile. This jar file is generally placed in /tmp folder, but you can redirect it to a specific folder (i.e. to local folder not hdfs) using --bindir option.
example:
sqoop import --connect jdbc:mysql://ms.itversity.com/retail_export --
username retail_user --password itversity --table customers -m 1 --target-dir '/user/srikarthik/udemy/practice4/problem2/outputseq' --as-sequencefile --delete-target-dir --bindir /home/srikarthik/sqoopjars/
step2:
Also, you need to download the jar file from below link:
http://www.java2s.com/Code/Jar/s/Downloadsqoop144hadoop200jar.htm
step3:
Suppose, customers table is imported using sqoop as sequence file.
Run spark-shell --jars path-to-customers.jar,sqoop-1.4.4-hadoop200.jar
example:
spark-shell --master yarn --jars /home/srikarthik/sqoopjars/customers.jar,/home/srikarthik/tejdata/kjar/sqoop-1.4.4-hadoop200.jar
step4: Now run below commands inside the spark-shell
scala> import org.apache.hadoop.io.LongWritable
scala> val data = sc.sequenceFile[LongWritable,customers]("/user/srikarthik/udemy/practice4/problem2/outputseq")
scala> data.map(tup => (tup._1.get(), tup._2.toString())).collect.foreach(println)
You can use SeqDataSourceV2 package to read the sequence file with the DataFrame API without any prior knowledge of the schema (aka keyClass and valueClass).
Please note that the current version is only compatible with Spark 2.4
$ pyspark --packages seq-datasource-v2-0.2.0.jar
df = spark.read.format("seq").load("data.seq")
df.show()

zeppelin-0.7.3 Interpreter pyspark not found

I get the below error when I use pyspark via Zeppelin.
The python & spark interpreters work and all environment variables are set correctly.
print os.environ['PYTHONPATH']
/x01/spark_u/spark/python:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python/lib/pyspark.zip:/x01/spark_u/spark/python:/x01/spark_u/spark/python/pyspark:/x01/spark_u/zeppelin/interpreter/python/py4j-0.9.2/src:/x01/spark_u/zeppelin/interpreter/lib/python
zepplin-env.sh is set with the below vars
export PYSPARK_PYTHON=/usr/local/bin/python2
export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${PYTHONPATH}
export SPARK_YARN_USER_ENV="PYTHONPATH=${PYTHONPATH}"
See the below log file
INFO [2017-11-01 12:30:42,972] ({pool-2-thread-4}
RemoteInterpreter.java[init]:221) - Create remote interpreter
org.apache.zeppelin.spark.PySparkInterpreter
org.apache.zeppelin.interpreter.InterpreterException:
paragraph_1509038605940_-1717438251's Interpreter pyspark not
found
Thank you in advance
I found a workaround for the above issue.The interpreter not found issue does not happen when I create note inside a directory.The issue only happens when I use notes at toplevel.Addionally I foud out that this issue does not happen in 0.7.2 version
Ex :
enter image description here

Postgres PL/JAVA: java.lang.ClassNotFoundException error after loading JAR file in database

I am getting the java.lang.ClassNotFoundException: error inside Postgres when running a function that calls a JAR file I have loaded. I have installed and configured PL/JAVA (including the delivered examples) in my database and can run the examples to success. I am not attempting to load/install my first JAR, but I am doing something wrong.
My host controls the OS version: CentOS 6.8. Postgres is version 8.4.
I am attempting to install my own very simple java class, which is a derivative of the delivered example Parameters.addOne class. All my code is in /tmp. Here are the steps I've followed:
Doug.java:
package com.msmetric;
import java.math.BigDecimal;
import java.sql.Date;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Time;
import java.sql.Timestamp;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.TimeZone;
import java.util.logging.Logger;
public class Doug {
public static int addOne(int value) {
return value + 1;
}
}
Compile Doug.java using 'javac Doug.java' succeeds.
Create JAR file with Doug.class file in it using 'jar -cvf Doug.jar Doug.class. This works fine.
Now I load the JAR file into Postgres (public schema), change the classpath, create the function that calls the JAR, then attempt to run at psql prompt.
Run sqlj.install_jar from psql:
select sqlj.install_jar('file:/tmp/Doug.jar','Doug',false);
Set the classpath inside Postgres (from psql prompt postgres=#):
select sqlj.set_classpath('public','Doug');
Create the function that calls the JAR. This create function code is taken directly from the examples.ddr file that came with PL/JAVA. I simply changed org.postgres to com.msmetric.
create or replace function addone(int) returns int as 'com.msmetric.Doug.addOne(java.lang.Integer)' language java;
Now with the JAR loaded and function created, I attempt to run it. This function should simply add 1 to the number provided.
select addone(3);
Results:
ERROR: java.lang.ClassNotFoundException: com.msmetric.Doug
Thoughts?
I'm very sorry I didn't see your question sooner. Underneath all the exotic details (PostgreSQL, PL/Java, schemas, classpaths...), there's just a bit of basic Java going on here: if a jar file contains a class Doug.class in package com.msmetric, its path within the jar has to reflect that: it has to be com/msmetric/Doug.class. Otherwise, it won't be found.
You can set up that whole structure step by step:
javac Doug.java
mkdir com
mkdir com/msmetric
mv Doug.class com/msmetric/
jar -cvf Doug.jar com/msmetric/Doug.class
Or, you can let javac do more of the work for you:
mkdir classes
javac -d classes Doug.java
jar -cvf Doug.jar -C classes .
When you give javac a -ddirectory option, instead of just writing class files next to their .java sources, it will put them all in their proper places under the directory you named, and then you can just tell jar to change into that directory and slurp them all up (don't overlook the . at the end of that jar command).
Once you fix that, if you retry your original steps, you'll see that you now get a different error:
ERROR: Unable to find static method com.msmetric.Doug.addOne with signature (Ljava/lang/Integer;)I
That happens because you declared the function in Doug.java with int addOne(int value) (that is, taking a primitive int argument), but you declared it in SQL with returns int as 'com.msmetric.Doug.addOne(java.lang.Integer)' taking an Integer object.
Once you correct that:
create or replace function addone(int) returns int as 'com.msmetric.Doug.addOne(int)' language java;
you'll be able to see:
# select addone(3);
addone
--------
4
(1 row)
If you happen to see this belated answer, may I ask what version of PL/Java you are using? That's one detail you didn't mention. If it is older than 1.5.0, there are newer features that can help you out. For one, you can just annotate that function:
#Function
public static int addOne(int value) {
return value + 1;
}
and have javac spit out not only the Doug.class file but also a pljava.ddr file with your SQL function declaration already written correctly (no mixing up argument types!). There is a way to include that .ddr file into the jar you create so that you can just call sqlj.install_jar with the last parameter true so it runs the commands in the .ddr and your functions are ready to use. There's a Hello, world example in the docs that shows more of how it's done.
Cheers,
-Chap