I have followed this blog to read data stored in google bucket. https://cloud.google.com/dataproc/docs/connectors/install-storage-connector
It has worked fine. The following command
hadoop fs -ls gs://the-bucket-you-want-to-list
gave me expected results.But when I tried reading data using pyspark using
rdd = sc.textFile("gs://crawl_tld_bucket/"),
it throws the following error:
`
py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions.
: java.io.IOException: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
`
How to get it done?
To access Google Cloud Storage you have to include Cloud Storage connector:
spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py
or
pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar
Related
Im trying to connect to an aurora db using a the jar file dowwnloaded locally this is my code :
spark = SparkSession.builder.master("local").appName("PySpark_app").config("spark.driver.memory", "16g").config("spark.jars", "D:/spark/jars/postgresql-42.2.5.jar")\
.getOrCreate()
spark_df = spark.read.format("jdbc").option("url", "postgresql://aws-som3l1nk.us-west-2.rds.amazonaws.com") \
.option("driver", "org.postgresql.Driver").option("user", "username")\
.option("password", "password").option("query",query).load()
after trying to read the data i get this error :
An error occurred while calling o401.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver
at java.net.URLClassLoader.findClass(Unknown Source)...
not sure if im missing something or even if the jar file is considered or not.
im working on a local machine on this and trying to install packages using pyspark --packages fail as well!
I'm trying to read CSV files which are on s3 bucket which is located in Mumbai Region.I'm trying to read the files using datastax dse spark-submit.
I tried changing hadoop-aws version to various other versions. Currently, hadoop-aws version is 2.7.3
spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", secretAccessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val df = spark.read.csv("s3a://bucket_path/csv_name.csv")
Upon Executing, Following is the error which I'm getting,
Exception in thread "main"
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400,
AWS Service: Amazon S3, AWS Request ID: 8C7D34A38E359FCE, AWS Error
Code: null, AWS Error Message: Bad Request at
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at
org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at
org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392) at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355) at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
at
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
Your signature V4 option is not applied. See This
Add the java option when you run the spark-submit or spark-shell.
spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
Or, set the system property such as:
System.setProperty("com.amazonaws.services.s3.enableV4", "true");
Thank you for all the help. I figured out from the answer of Lamanus that signature V4 option was not applied even after adding it in
spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
So I added the following line and now the code works perfectly fine.
import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")
I'm trying to perform some read/write operation with Hbase using Spark. When I'm running my spark code using spark-submit command
bin/spark-submit --master local[*] --class com.test.driver.Driver /home/deb/computation/target/computation-1.0-SNAPSHOT.jar "function=avg" "signals=('.tagname_qwewf')" "startTime=2018-10-10T13:51:47.135Z" "endTime=2018-10-10T14:36:11.073Z"
it's executing without any error.
But when I'm trying to do the same from Intellij I'm getting the below errors
8/12/17 01:51:45 ERROR HbaseConnector: An exception while reading dataframe from HBase
18/12/17 01:51:45 ERROR HbaseConnector: Can't get the location for replica 0
18/12/17 01:51:45 ERROR Driver: No historical data found for signals in the expression.
Any suggestion how to resolve this issue.
I'm using spark 2.3.0 and hadoop 2.9.1
I'm trying to load a CSV file located in hdfs with spark
scala> val dataframe = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("hdfs://127.0.0.1:50075/filesHDFS/data.csv")
But I get the following error:
2018-11-14 11:47:58 WARN FileStreamSink:66 - Error while looking for metadata directory.
java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "Desktop-Presario-CQ42-Notebook-PC/127.0.0.1"; destination host is: "localhost":50070;
Instead of using 127.0.0.1 use the default FS name. You can find it in the core-site.xml file under the property fs.defaultFS
It should solve your problem.
I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.
I have consulted the following resources:
Parsing files from Amazon S3 with Apache Spark
How to access s3a:// files from Apache Spark?
Hortonworks Spark 1.6 and S3
Cloudera
Custom s3 endpoints
I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key
spark.hadoop.fs.s3a.secret.key=secret-key
I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:
bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar
In the shell, here is how I try to load data from the S3 bucket:
val p = spark.read.textFile("s3a://sparkcookbook/person")
And here is the error that results:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
When I instead try to start the Spark shell as follows:
bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1
Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:
:: problems summary ::
:::: ERRORS
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
And here is the second:
val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)
Could someone suggest how to get this working? Thanks.
If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.
$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar
After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.