Reading a csv file in pyspark (1.6.0) - pyspark

Maybe the question is trivial but i am getting issues while reading a csv from local directory in Pyspark.
I tried,
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark import SparkContext as sc
mydata = sc.textFile("/home/documents/mydata.csv")
newdata = mydata.map(lambda line: line.split(","))
But getting a error like,
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unbound method textFile() must be called with SparkContext instance as first argument (got str instance instead)
Now my question is I have called SparkContext just before that. Then why am I getting such error? Please guide me where I am lacking.

You do not import SparkContext as sc:
In interactive usage (i.e. pyspark shell), sc is already initialized, so sc.textFile() should work fine
In self-contained applications, you should initialize sc first:
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
where the arguments in SparkContext() matter - see the provided links for more details.
Finally, Spark 1.x cannot natively read CSV files into dataframes - you will need the Spark CSV external package. You may find a relevant blog post I wrote some time ago for Spark 1.5 useful...

Related

Name sc is not defined

I am just trying to execute sc.version inside pyspark shell however getting an error as sc not defined.
>>> sc.version()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
If i run SparkContext.getOrCreate()
>>> SparkContext.getOrCreate()
<pyspark.context.SparkContext object at 0x7f206aa8cfd0>
I am not getting even the output of sc.version(). What is the problem?
A few things:
Inside the pyspark shell you automatically only have access to the spark session (which can be referenced by "spark").
To get the sparkcontext, you can get it from the spark session by sc = spark.sparkContext. Or using the getOrCreate() method as mentioned by #Smurphy0000 in the comments
Version is an attribute of the spark context. To get the version from sparkcontext (sc in this case), version = sc.version. Version can also be extracted from the session directly as version = spark.version

Error in Pycharm when linking to pyspark: name 'spark' is not defined

When I run the example code in cmd, everything is ok.
>>> import pyspark
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1='Alice', _2=1)]
But when I execute the code in pycharm, I get an error.
spark.createDataFrame(l).collect()
NameError: name 'spark' is not defined
Maybe something wrong when I link Pycharm to pyspark.
Environment Variable
Project Structure
Project Interpreter
When you start pyspark from the command line, you have a sparkSession object and a sparkContext available to you as spark and sc respectively.
For using it in pycharm, you should create these variables first so you can use them.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
EDIT:
Please have a look at : Failed to locate the winutils binary in the hadoop binary path

No module named 'pyspark' in Zeppelin

I am new to Spark and just started using it. Trying to import SparkSession from pyspark but it throws an error: 'No module named 'pyspark'. Please see my code below.
# Import our SparkSession so we can use it
from pyspark.sql import SparkSession
# Create our SparkSession, this can take a couple minutes locally
spark = SparkSession.builder.appName("basics").getOrCreate()```
Error:
```---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-6ce0f5f13dc0> in <module>
1 # Import our SparkSession so we can use it
----> 2 from pyspark.sql import SparkSession
3 # Create our SparkSession, this can take a couple minutes locally
4 spark = SparkSession.builder.appName("basics").getOrCreate()
ModuleNotFoundError: No module named 'pyspark'```
I am in my conda env and I tried ```pip install pyspark``` but I already have it.
If you are using Zepl, they have their own specific way of importing. This makes sense, they need their own syntax since they are running in the cloud. It clarifies their specific syntax vs. Python itself. For instance %spark.pyspark.
%spark.pyspark
from pyspark.sql import SparkSession

Dataproc: functools.partial no attribute '__module__' error for pyspark UDF

I am using GCP/Dataproc for some spark/graphframe calculations.
In my private spark/hadoop standalone cluster,
I have no issue using functools.partial when defining pysparkUDF.
But, now with GCP/Dataproc, I have an issue as below.
Here are some basic settings to check whether partial works well or not.
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
def power(base, exponent):
return base ** exponent
In the main function, functools.partial works well in ordinary cases as we expect:
# see whether partial works as it is
square = partial(power, exponent=2)
print "*** Partial test = ", square(2)
But, if I put this partial(power, exponent=2) function to PySparkUDF as below,
testSquareUDF = F.udf(partial(power, exponent=2),T.FloatType())
testdf = inputdf.withColumn('pxsquare',testSquareUDF('px'))
I have this error message:
Traceback (most recent call last):
File "/tmp/bf297080f57a457dba4d3b347ed53ef0/gcloudtest-partial-error.py", line 120, in <module>
testSquareUDF = F.udf(square,T.FloatType())
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1971, in udf
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1955, in _udf
File "/opt/conda/lib/python2.7/functools.py", line 33, in update_wrapper
setattr(wrapper, attr, getattr(wrapped, attr))
AttributeError: 'functools.partial' object has no attribute '__module__'
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [bf297080f57a457dba4d3b347ed53ef0] entered state [ERROR] while waiting for [DONE].
=========
I had no this kind of issue with my standalone cluster.
My spark cluster version is 2.1.1.
The GCP dataproc's is 2.2.x
Anyone can recognize what prevents me from passing the partial function to the UDF?
As discussed in the comments, the issue was with spark 2.2. And, since spark 2.3 is also supported by Dataproc, just using --image-version=1.3 when creating the cluster fixes it.

1: error: ';' expected but 'import' found

I am running this code in Zeppelin, I am getting following error message
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext(appName="PythonSQL")
hive_context = HiveContext(sc)
bank = hive_context.table("default.invites_orc")
bank.show()
bank.registerTempTable("bank_temp")
hive_context.sql("select * from bank_temp").show()
sc.stop()
:1: error: ';' expected but 'import' found.
from pyspark import SparkContext
^
The Spark Interpreter group currently has 4 interpreter as listed here...
https://zeppelin.incubator.apache.org/docs/0.5.0-incubating/interpreter/spark.html
The default interpreter is %spark and default interpreter is selected based on the order of the interpreter listed in the zeppelin.interpreters property in zeppelin-site.xml config file.
The current order of interpreter in your zeppelin-site.xml (zeppelin.interpreters property) will be this ...
org.apache.zeppelin.spark.SparkInterpreter,org.apache.zeppelin.spark.PySparkInterpreter
Modify this to ...
org.apache.zeppelin.spark.PySparkInterpreter, org.apache.zeppelin.spark.SparkInterpreter
and restart Zeppelin (zeppelin-daemon.sh restart)
This will make %pyspark as default interpreter.
OR
You can write like this
%pyspark
from pyspark import SparkContext