reading google bucket data in spark - pyspark

I have followed this blog to read data stored in google bucket. https://cloud.google.com/dataproc/docs/connectors/install-storage-connector
It has worked fine. The following command
hadoop fs -ls gs://the-bucket-you-want-to-list
gave me expected results.But when I tried reading data using pyspark using
rdd = sc.textFile("gs://crawl_tld_bucket/"),
it throws the following error:
`
py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions.
: java.io.IOException: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
`
How to get it done?

To access Google Cloud Storage you have to include Cloud Storage connector:
spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py
or
pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar

Related

connecting to auroradb with postgres jar driver local mode in pyspark

Im trying to connect to an aurora db using a the jar file dowwnloaded locally this is my code :
spark = SparkSession.builder.master("local").appName("PySpark_app").config("spark.driver.memory", "16g").config("spark.jars", "D:/spark/jars/postgresql-42.2.5.jar")\
.getOrCreate()
spark_df = spark.read.format("jdbc").option("url", "postgresql://aws-som3l1nk.us-west-2.rds.amazonaws.com") \
.option("driver", "org.postgresql.Driver").option("user", "username")\
.option("password", "password").option("query",query).load()
after trying to read the data i get this error :
An error occurred while calling o401.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver
at java.net.URLClassLoader.findClass(Unknown Source)...
not sure if im missing something or even if the jar file is considered or not.
im working on a local machine on this and trying to install packages using pyspark --packages fail as well!

Read Files from S3 bucket to Spark Dataframe using Scala in Datastax Spark Submit giving AWS Error Message: Bad Request

I'm trying to read CSV files which are on s3 bucket which is located in Mumbai Region.I'm trying to read the files using datastax dse spark-submit.
I tried changing hadoop-aws version to various other versions. Currently, hadoop-aws version is 2.7.3
spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", secretAccessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val df = spark.read.csv("s3a://bucket_path/csv_name.csv")
Upon Executing, Following is the error which I'm getting,
Exception in thread "main"
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400,
AWS Service: Amazon S3, AWS Request ID: 8C7D34A38E359FCE, AWS Error
Code: null, AWS Error Message: Bad Request at
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at
org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at
org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392) at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355) at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
at
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
Your signature V4 option is not applied. See This
Add the java option when you run the spark-submit or spark-shell.
spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
Or, set the system property such as:
System.setProperty("com.amazonaws.services.s3.enableV4", "true");
Thank you for all the help. I figured out from the answer of Lamanus that signature V4 option was not applied even after adding it in
spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
So I added the following line and now the code works perfectly fine.
import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")

ERROR HbaseConnector: Can't get the location for replica 0

I'm trying to perform some read/write operation with Hbase using Spark. When I'm running my spark code using spark-submit command
bin/spark-submit --master local[*] --class com.test.driver.Driver /home/deb/computation/target/computation-1.0-SNAPSHOT.jar "function=avg" "signals=('.tagname_qwewf')" "startTime=2018-10-10T13:51:47.135Z" "endTime=2018-10-10T14:36:11.073Z"
it's executing without any error.
But when I'm trying to do the same from Intellij I'm getting the below errors
8/12/17 01:51:45 ERROR HbaseConnector: An exception while reading dataframe from HBase
18/12/17 01:51:45 ERROR HbaseConnector: Can't get the location for replica 0
18/12/17 01:51:45 ERROR Driver: No historical data found for signals in the expression.
Any suggestion how to resolve this issue.

Read csv from hdfs with spark/scala

I'm using spark 2.3.0 and hadoop 2.9.1
I'm trying to load a CSV file located in hdfs with spark
scala> val dataframe = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("hdfs://127.0.0.1:50075/filesHDFS/data.csv")
But I get the following error:
2018-11-14 11:47:58 WARN FileStreamSink:66 - Error while looking for metadata directory.
java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "Desktop-Presario-CQ42-Notebook-PC/127.0.0.1"; destination host is: "localhost":50070;
Instead of using 127.0.0.1 use the default FS name. You can find it in the core-site.xml file under the property fs.defaultFS
It should solve your problem.

How to use s3 with Apache spark 2.2 in the Spark shell

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.
I have consulted the following resources:
Parsing files from Amazon S3 with Apache Spark
How to access s3a:// files from Apache Spark?
Hortonworks Spark 1.6 and S3
Cloudera
Custom s3 endpoints
I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key
spark.hadoop.fs.s3a.secret.key=secret-key
I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:
bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar
In the shell, here is how I try to load data from the S3 bucket:
val p = spark.read.textFile("s3a://sparkcookbook/person")
And here is the error that results:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
When I instead try to start the Spark shell as follows:
bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1
Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:
:: problems summary ::
:::: ERRORS
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
And here is the second:
val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)
Could someone suggest how to get this working? Thanks.
If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.
$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar
After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.