how to use python logging in databricks pyspark for UDF? - pyspark

when I registered the python function in pyspark UDF then it will not logged the logs in file. So please help me here to capture those logs in file.

Related

How to get Nats messages using pySpark (no Scala)

Found two libraries for working with Nats, for Java and Scala(https://github.com/Logimethods/nats-connector-spark-scala, https://github.com/Logimethods/nats-connector-spark). Writing a separate connector on Scala and sending the output to pySpark its wrong. Is there any other way to connect pySpark to Nats?
Spark-submit version 2.3.0.2.6.5.1100-53
Thanks in advance.
In general, there is no normal way :( I found only an option using a python connector, sending output to a socket, and there pyspark processes the received data.

Can Spark jobs be scheduled through Airflow

I am new to spark and need to clarify some doubts i have.
Can I schedule Spark Jobs through Airflow
My Airflow (Spark) jobs process raw csv files present in S3 bucket and then transforms into parquet format , stores it into S3 bucket and then finally stores it into Presto Hive after completely processed. End user connects to Presto and queries the data to create visualisation.
Can this processed data be stored in Hive only or Presto only so that user can connect to Presto or Hive and accordingly to perform query on the database.
Well you can always spark_submit_operator
to schedule and submit your spark jobs or you can use bash operator
where you can use the spark-submit bash command to schedule and submit spark jobs.
to your second question, After spark created parquet files you can use spark (same spark instance) to write it to hive or presto.

Not able to execute Pyspark script using spark action in Oozie - Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog

I am facing below error while running spark action through oozie workflow on an EMR 5.14 cluster:
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog'"
My Pyspark script runs fine when executed as a normal spark job but is not being executed via Oozie
Pyspark Program:-
spark = SparkSession.builder.appName("PysparkTest").config("hive.support.quoted.identifiers", "none").enableHiveSupport().getOrCreate()
sc = SparkContext.getOrCreate();
sqlContext = HiveContext(sc)
sqlContext.sql("show databases").show()
I have created a workflow.xml and job.properties taking reference from the LINK.
I copied all the spark and hive related configuration file under the same directory($SPARK_CONF_DIR/).
Hive is also configured to use MySQL for the metastore.
It will be great if you can help me figure out the problem which I am facing when running this Pyspark program as a jar file in an Oozie spark action.
Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog' This means the Catalog jar its trying find is not is ooziesharelib spark directory.
Please add the following property in your job.properties file.
oozie.action.sharelib.for.spark=hive,spark,hcatalog
Also can you please post the whole log?
And if possible could you please run the same on EMR 5.29, I have faced few jar issue on 5.26 and the lower version while running PySpark.

connect to memsql from pyspark shell

Is it possible to connect to memsql from pyspark?
I heard that memsql recently built the streamliner infrastructure on top of pyspark to allow for custom python transformation
But does this mean I can run pyspark or submit a python spark job that connects to memsql?
Yes to both questions.
Streamliner is the best approach if your aim is to get data into MemSQL or perform transformation during ingest. How to use Python with Streamliner: http://docs.memsql.com/latest/spark/memsql-spark-interface-python/
You can also query MemSQL from a Spark application. Details on that here: http://docs.memsql.com/latest/spark/spark-sql-pushdown/
You can also run a Spark shell. See http://docs.memsql.com/latest/ops/cli/SPARK-SHELL/ & http://docs.memsql.com/latest/spark/admin/#launching-the-spark-shell

How can I change SparkContext.sparkUser() setting (in pyspark)?

I am new with Spark and pyspark.
I use pyspark, after my rdd processing, I tried to save it to hdfs using the saveAsTextfile() function.
But I get a 'permission denied' error message because pyspark tries to write hdfs
using my local account, 'kjlee', which does not exist on the hdfs system.
I can check the spark user name by SparkContext().sparkUser(), But I can't find how to change the spark user name.
How can I change the spark user name?
There is an environment variable for this : HADOOP_USER_NAME
so simply use export HADOOP_USER_NAME=anyuser or in pyspark you can use os.environ["HADOOP_USER_NAME"] = "anyuser"
In Scala could be done with System.setProperty:
System.setProperty("HADOOP_USER_NAME","newUserName")
val spark = SparkSession
.builder()
.appName("SparkSessionApp")
.master("local[*]")
.getOrCreate()
println(spark.sparkContext.sparkUser)