pyspark equivalent of the flag "isLocal" from Scala API - scala

Scala has a flag isLocal. If this is true then we know that Spark is running in local mode, else it is running on a cluster. Is there any pyspark alternative to this? Or do we simply check sc.master?

It's not available in the Python API, but you can call isLocal on the Java SparkContext as:
sc._jsc.isLocal()

Related

Use pyspark on yarn cluster without creating the context

I'll try to do my best to explain my self. I'm using JupyterHub to connect to the cluster of my university and write some code. Basically i'm using pyspark but, since i'va always used "yarn kernel" (i'm not sure of what i'm sying) i've never defined the spark context or the spark session. Now, for some reason, it doesn't work anymore and when i try to use spark this error appears:
Code ->
df = spark.read.csv('file:///%s/.....
Error ->
name 'spark' is not defined
It already happend to me but i just solved by installing another version of pyspark. Now i don't know what to do.

Scala code execution on master of spark cluster?

The spark application uses some API calls which do not use spark-session. I believe when the piece of code doesn't use spark it is getting executed on the master node!
Why do I want to know this?
I am getting a java heap space error while I am trying to POST some files using API calls and I believe if I upgrade the master and increase driver mem it can be solved.
I want to understand how this type of application is executed on the Spark cluster?
Is my understanding right or am I missing something?
It depends - closures/functions passed to the built-in function transform or any code in udfs you create, code in forEachBatch (and maybe a few other places) will run on the workers. Other code runs on driver

configuring scheduling pool in spark using zeppelin, scala and EMR

In pyspark I'm able to change to a fair scheduler within zeppelin (on AWS EMR) by doing the following:
conf = sc.getConf()
conf.set('spark.scheduler.allocation.file',
'/etc/spark/conf.dist/fairscheduler.xml.template')
sc.setLocalProperty("spark.scheduler.pool", 'production')
However if I try something similar in a scala cell it then things continue to run in the FIFO pool
val conf = sc.getConf()
conf.set("spark.scheduler.allocation.file",
"/etc/spark/conf.dist/fairscheduler.xml.template")
sc.setLocalProperty("spark.scheduler.pool", "FAIR")
I've tried so many combinations, but nothing has worked. Any advice is appreciated.
I ran into a similar issue with Spark 2.4. In my case, the problem was resolved by removing the default "spark.scheduler.pool" option in my Spark config. It might be that your Scala Spark interpreter is set up with spark.scheduler.pool but your python isn't.
I traced the issue to a bug in Spark - https://issues.apache.org/jira/browse/SPARK-26988. The problem is that if you set the config property "spark.scheduler.pool" in the base configuration, you can't then override it using setLocalProperty. Removing it from the base configuration made it work correctly. See the bug description for more detail.

How do I run "s3-dist-cp" command inside pyspark shell / pyspark script in EMR 5.x

I had some problems in running a s3-dist-cp" command in my pyspark script as I needed some data movement from s3 to hdfs for performance enhancement. so here I am sharing this.
import os
os.system("/usr/bin/s3-dist-cp --src=s3://aiqdatabucket/aiq-inputfiles/de_pulse_ip/latest/ --dest=/de_pulse/ --groupBy='.*(additional).*' --targetSize=64 --outputCodec=none")
Note : - please make sure that you give the full path of s3-dist-cp like (/usr/bin/s3-dist-cp)
also, I think we can use subprocess.
If you're running a pyspark application, you'll have to stop the spark application first. The s3-dist-cp will hang because the pyspark application is blocking.
spark.stop() # spark context
os.system("/usr/bin/s3-dist-cp ...")

How to set hadoop configuration values from pyspark

The Scala version of SparkContext has the property
sc.hadoopConfiguration
I have successfully used that to set Hadoop properties (in Scala)
e.g.
sc.hadoopConfiguration.set("my.mapreduce.setting","someVal")
However the python version of SparkContext lacks that accessor. Is there any way to set Hadoop configuration values into the Hadoop Configuration used by the PySpark context?
sc._jsc.hadoopConfiguration().set('my.mapreduce.setting', 'someVal')
should work
You can set any Hadoop properties using the --conf parameter while submitting the job.
--conf "spark.hadoop.fs.mapr.trace=debug"
Source: https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L105
I looked into the PySpark source code (context.py) and there is not a direct equivalent. Instead some specific methods support sending in a map of (key,value) pairs:
fileLines = sc.newAPIHadoopFile('dev/*',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'mapreduce.input.fileinputformat.input.dir.recursive':'true'}
).count()