Could not read csv file in pyspark - pyspark

I am new to pyspark and I did some initial tutorials. When I am trying to load a CSV file on my local host in the Spark framework using Jupyter Notebook, the below mentioned error pops. My java version is 8.0
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName('sql based spark data analysis') \
.config('spark.some.config.option', 'some-value') \
.getOrCreate()
df = spark.read.csv('C:/Users/sitaram/Downloads/creditcardfraud/creditcard.csv')
My error is as follows:
Py4JJavaError: An error occurred while calling o55.csv.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException:
java.lang.RuntimeException: Error while running command to get file
permissions : java.io.IOException: (null) entry in command string: null ls -F C:\tmp\hive
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:65
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Error
while running command to get file permissions : java.io.IOException: (null) entry in command string: null ls -F C:\tmp\hive

Please try C://Users//sitaram//Downloads//creditcardfraud//creditcard.csv

Related

SQLException: No suitable driver found for jdbc:phoenix:(hosts)

I am running a spark-submit command that will do some database work via a Scala class.
spark-submit
--verbose
--class mycompany.MyClass
--conf spark.driver.extraJavaOptions=-Dconfig.resource=dev-test.conf
--conf spark.executor.extraJavaOptions=-Dconfig.resource=dev-test.conf
--master yarn
--driver-library-path /usr/lib/hadoop-lzo/lib/native/
--jars /home/hadoop/mydir/dbp.spark-utils-1.1.0-SNAPSHOT.jar,/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar,/usr/lib/hadoop-lzo/lib/hadoop-lzo.jar,/usr/lib/hadoop/lib/commons-compress-1.18.jar,/usr/lib/hadoop/hadoop-aws-3.2.1-amzn-5.jar,/usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar
--files /home/hadoop/mydir/dev-test.conf
--num-executors 1
--executor-memory 3g
--driver-memory 3g
--queue default /home/hadoop/mydir/dbp.spark-utils-1.1.0-SNAPSHOT.jar
<<args to MyClass>>
When I run, I get an exception:
Caused by: java.sql.SQLException: No suitable driver found for jdbc:phoenix:host1,host2,host3:2181:/hbase;
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:422)
at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:414)
Here are the relevant parts of my Scala code:
val conf: SerializableHadoopConfiguration =
new SerializableHadoopConfiguration(sc.hadoopConfiguration)
Class.forName("org.apache.phoenix.jdbc.PhoenixDriver")
val tableRowKeyPairs: RDD[(Cell, ImmutableBytesWritable)] =
df.rdd.mapPartitions(partition => {
val configuration = conf.get()
val partitionConn: JavaConnection = QueryUtil.getConnection(configuration)
// ...
}
My spark-submit command includes /usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar using the --jars argument. When I search that file for "org.apache.phoenix.jdbc.PhoenixDriver", I find it:
$ jar -tf /usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar | grep -i driver
...
org/apache/phoenix/jdbc/PhoenixDriver.class
...
So why can't my program locate the driver?
I was able to get the program to find the driver by adding the following argument to the spark-submit command shown in the question:
--conf "spark.executor.extraClassPath=/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar"
This StackOverflow article has great explanations for what the various arguments do.

syntax error line 1 at position 15 unexpected 'copy' pyspark

I am trying to run a copy command in pyspark. as below. How to get rid of this error?
spark.write.format("snowflake") \
.options(**config.sfparams) \
.option("query", "copy into people_data from (select $1:Company_ID::varchar as Company_ID from #company_stage/pitchbook/"+config.todays_date+"/Company/")\
.load()
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling o49.load.
: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
syntax error line 1 at position 15 unexpected 'copy'.
Copy doesnt work in pyspark. created a external table to read data directly from s3 and enabled autorefresh=true

Pyspark: SPARK_HOME may not be configured correctly

I'm trying to run pyspark using a notebook in a conda enviroment.
$ which python
inside the enviroment 'env', returns:
/Users/<username>/anaconda2/envs/env/bin/p
ython
and outside the environment:
/Users/<username>/anaconda2/bin/python
My .bashrc file has:
export PATH="/Users/<username>/anaconda2/bin:$PATH"
export JAVA_HOME=`/usr/libexec/java_home`
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.1.2
export PYTHONPATH=$SPARK_HOME/libexec/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
But still, when I run:
import findspark
findspark.init()
I'm getting the error:
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
Any ideas?
Full traceback
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~/anaconda2/envs/ai/lib/python3.7/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
142 try:
--> 143 py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
144 except IndexError:
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
/var/folders/dx/dfb8h2h925l7vmm7y971clpw0000gn/T/ipykernel_72686/1796740182.py in <module>
1 import findspark
2
----> 3 findspark.init()
~/anaconda2/envs/ai/lib/python3.7/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
144 except IndexError:
145 raise Exception(
--> 146 "Unable to find py4j, your SPARK_HOME may not be configured correctly"
147 )
148 sys.path[:0] = [spark_python, py4j]
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
EDIT:
If I run the following in the notebook:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
I get the error:
/usr/local/Cellar/apache-spark/3.1.2/bin/load-spark-env.sh: line 2: /usr/local/Cellar/apache-spark/3.1.2/libexec/bin/load-spark-env.sh: Permission denied
/usr/local/Cellar/apache-spark/3.1.2/bin/load-spark-env.sh: line 2: exec: /usr/local/Cellar/apache-spark/3.1.2/libexec/bin/load-spark-env.sh: cannot execute: Undefined error: 0

Reading a BigQuery table into a Spark RDD on GCP DataProc, why is the class missing for use in newAPIHadoopRDD

Approximately just over one week ago, I was able to read a BigQuery table into an RDD for a Spark job running on a Dataproc cluster using the guide at https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example as a template. Since then, I am now encountering missing class issues, despite no changes being affected to the guide.
I have attempted to track down the missing class, com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList, although I cannot find any information on whether or not this class is now excluded from the gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
The job submission request is as follows:
gcloud dataproc jobs submit pyspark \
--cluster $CLUSTER_NAME \
--jars gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
--bucket gs://$BUCKET_NAME \
--region europe-west2 \
--py-files $PYSPARK_PATH/main.py
The PySpark code breaks at the following point:
bq_table_rdd = spark_context.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
where conf is a Python dict structured as follows:
conf = {
'mapred.bq.project.id': project_id,
'mapred.bq.gcs.bucket': gcs_staging_bucket,
'mapred.bq.temp.gcs.path': input_staging_path,
'mapred.bq.input.project.id': bq_input_project_id,
'mapred.bq.input.dataset.id': bq_input_dataset_id,
'mapred.bq.input.table.id': bq_input_table_id,
}
When my output indicates that the code has reached the above spark_context.newAPIHadoopRDD function, the following is printed to stdout:
class com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.DefaultPlatform: cannot cast result of calling 'com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance' to 'com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory': java.lang.ClassCastException: Cannot cast com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory to com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory
Traceback (most recent call last):
File "/tmp/0af805a2dd104e46b087037f0790691f/main.py", line 31, in <module>
sc)
File "/tmp/0af805a2dd104e46b087037f0790691f/extract.py", line 65, in bq_table_to_rdd
conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 749, in newAPIHadoopRDD
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList
This had not been an issue as recently as last week. I am concerned that even the hello world example on the GCP website is not stable in the short term. If anyone could shed some light on this issue, it would be greatly appreciated. Thanks.
I reproduced the problem
$ gcloud dataproc clusters create test-cluster --image-version=1.4
$ gcloud dataproc jobs submit pyspark wordcount_bq.py \
--cluster test-cluster \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
then exactly the same error happened:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList
I noticed there was a new release 1.0.0 on Aug 23:
$ gsutil ls -l gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-**
...
4038762 2018-10-03T20:59:35Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.8.jar
4040566 2018-10-19T23:32:19Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.9.jar
14104522 2019-06-28T21:08:57Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0-RC1.jar
14104520 2019-07-01T20:38:18Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0-RC2.jar
14149215 2019-08-23T21:08:03Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0.jar
14149215 2019-08-24T00:27:49Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
Then I tried version 0.13.9, it worked:
$ gcloud dataproc jobs submit pyspark wordcount_bq.py \
--cluster test-cluster \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.9.jar
It is a problem with 1.0.0, there is already an issue filed on GitHub. We'll fix it and improve the tests.

"ERROR 6000, Output location validation failed" using PIG MongoDB-Hadoop Connector on EMR

I get an "output location validation failed" exception in my pig script on EMR.
It fails when saving data back S3.
I use this simple script to narrow the problem:
REGISTER /home/hadoop/lib/mongo-java-driver-2.13.0.jar
REGISTER /home/hadoop/lib/mongo-hadoop-core-1.3.2.jar
REGISTER /home/hadoop/lib/mongo-hadoop-pig-1.3.2.jar
example = LOAD 's3://xxx/example-full.bson'
USING com.mongodb.hadoop.pig.BSONLoader();
STORE example INTO 's3n://xxx/out/example.bson' USING com.mongodb.hadoop.pig.BSONStorage();
This is the Stacktrace Produced:
================================================================================
Pig Stack Trace
---------------
ERROR 6000:
<line 8, column 0> Output Location Validation Failed for: 's3://xxx/out/example.bson More info to follow:
Output directory not set.
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias example
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1637)
at org.apache.pig.PigServer.registerQuery(PigServer.java:577)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1091)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:543)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 6000:
<line 8, column 0> Output Location Validation Failed for: 's3://xxx/out/example.bson More info to follow:
Output directory not set.
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:95)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:317)
at org.apache.pig.PigServer.compilePp(PigServer.java:1382)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1307)
at org.apache.pig.PigServer.execute(PigServer.java:1299)
at org.apache.pig.PigServer.access$400(PigServer.java:124)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1632)
... 13 more
Caused by: org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:138)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80)
... 26 more
To setup the MongoConnector I used this Bootstrap script:
#!/bin/sh
wget -P /home/hadoop/lib http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.13.0/mongo-java-driver-2.13.0.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-core-1.3.2.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-pig-1.3.2.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-hive-1.3.2.jar
cp /home/hadoop/lib/mongo* /home/hadoop/hive/lib
cp /home/hadoop/lib/mongo* /home/hadoop/pig/lib
The error suggests that the output directory does not exist.
Of course the solution would be to create the output directory.
For a quick check it is also possible to make the output directory equal to the input directory. If the directory actually does exist, it may be a rights issue.