Spark unable to read parquet file from partitioned S3 bucket - scala

I have my S3 bucket partitioned like this:
bucket
|--2018
|--2019
|--01
|--02
|--01
|--files.parquet
...
It works fine when I read using this command (Spark 2.1.1):
val dfo = sqlContext.read.parquet("s3://bucket/2019/04/03/*")
but it hits an error when I try to add a partition variable to the path:
val dfo = sqlContext.read.parquet("s3://bucket/2019/04/day=03/*")
or
val dfo = sqlContext.read.parquet("s3://bucket/y=2019/m=04/day=03")
Error:
Name: org.apache.spark.sql.AnalysisException
Message: Path does not exist: s3://bucket/2019/04/day=03/*;
StackTrace: at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:377)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)

Related

File domain patron error when running spark streaming

I faced this error when running my application for several hours.
My spark application read stream from a streaming hudi table (hudi table that is constantly updated) and write to a parquet file. There is another stream read that same parquet file and write to another hudi table. The flow is as follow
Hudi -> stream 1 -> parquet -> stream 2 -> hudi
I can see the error appears when stream 2 read from the parquet file. The underlying storage is OneFS
User class threw exception: org.apache.spark.sql.streaming.StreamingQueryException: Failed to get file domain patron for path /path/_temporary. Error: Name: _temporary Status: STATUS_OBJECT_NAME_NOT_FOUND
=== Streaming Query ===
Identifier: [id = 72e6b29c-a641-47ff-82fc-ccd8146a4226, runId = 22af0796-7cb4-4599-b29f-ee95bda27cb3]
Current Committed Offsets: {FileStreamSource[hdfs://path]: {"logOffset":60}}
Current Available Offsets: {FileStreamSource[hdfs://path]: {"logOffset":60}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
FileStreamSource[hdfs://path]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:356)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): Failed to get file domain patron for path path/_temporary. Error: Name: _temporary Status: STATUS_OBJECT_NAME_NOT_FOUND
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.getListing(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:578)
at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy11.getListing(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2086)
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:944)
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:927)
at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:872)
at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:868)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:886)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1696)
at org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:220)
at org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:158)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:131)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:94)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:66)
at org.apache.spark.sql.execution.streaming.FileStreamSource.allFilesUsingInMemoryFileIndex(FileStreamSource.scala:248)
at org.apache.spark.sql.execution.streaming.FileStreamSource.fetchAllFiles(FileStreamSource.scala:301)
at org.apache.spark.sql.execution.streaming.FileStreamSource.fetchMaxOffset(FileStreamSource.scala:128)
at org.apache.spark.sql.execution.streaming.FileStreamSource.latestOffset(FileStreamSource.scala:325)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$3(MicroBatchExecution.scala:394)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:385)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:128)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:382)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:613)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:378)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:

Problems in the configuration between hadoop and spark

I have a problem in a program and I do not have this problem with spark-shell.
When I call:
FileSystem.get(spark.sparkContext.hadoopConfiguration)
In the spark-shell, everything works perfectly, but when I try to use it in the code, I can't read the core-site.xml. I still get it to work when I use:
val conf = new Configuration()
conf.addResource(new Path("path to conf/core-site.xml"))
FileSystem.get(conf)
This solution is not acceptable, since I need to use the Hadoop configuration without passing the configuration explicitly.
Both in (Spark-shell and in the program) the master is called with the parameters spark: //x.x.x.x: 7077
How can I configure spark to use the hadoop configuration?
Code:
val HdfsPrefix: String = "hdfs://"
val path: String = "/tmp/"
def getHdfs(spark: SparkSession): FileSystem = {
//val conf = new Configuration()
//conf.addResource(new Path("/path to/core-site.xml"))
//FileSystem.get(conf)
FileSystem.get(spark.sparkContext.hadoopConfiguration)
}
val dfs = getHdfs(session)
data.select("name", "value").collect().foreach{ x =>
val os = dfs.create(new Path(HdfsPrefix + path + x.getString(0)))
val content: String = x.getString(1)
os.write(content.getBytes)
os.hsync()
}
Error log:
Wrong FS: hdfs:/tmp, expected: file:///
java.lang.IllegalArgumentException: Wrong FS: hdfs:/tmp, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:428)
at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:690)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:446)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
at com.bbva.ebdm.ocelot.io.hdfs.HdfsIO$HdfsOutputFile$$anonfun$write$1.apply(HdfsIO.scala:116)
at com.bbva.ebdm.ocelot.io.hdfs.HdfsIO$HdfsOutputFile$$anonfun$write$1.apply(HdfsIO.scala:115)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at com.bbva.ebdm.ocelot.io.hdfs.HdfsIO$HdfsOutputFile.write(HdfsIO.scala:115)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseApp$$anonfun$exec$1.apply(SparkSqlBaseApp.scala:33)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseApp$$anonfun$exec$1.apply(SparkSqlBaseApp.scala:31)
at scala.collection.immutable.Map$Map3.foreach(Map.scala:161)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseApp$class.exec(SparkSqlBaseApp.scala:31)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseAppTest$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$2$$anonfun$apply$2$$anon$1.exec(SparkSqlBaseAppTest.scala:47)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseAppTest$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$3.apply(SparkSqlBaseAppTest.scala:49)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseAppTest$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$3.apply(SparkSqlBaseAppTest.scala:47)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseAppTest$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(SparkSqlBaseAppTest.scala:47)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseAppTest$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(SparkSqlBaseAppTest.scala:47)
at wvlet.airframe.Design.runWithSession(Design.scala:169)
at wvlet.airframe.Design.withSession(Design.scala:182)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseAppTest$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(SparkSqlBaseAppTest.scala:47)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSpecLike$$anon$1.apply(FunSpecLike.scala:454)
at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
at org.scalatest.FunSpec.withFixture(FunSpec.scala:1630)
at org.scalatest.FunSpecLike$class.invokeWithFixture$1(FunSpecLike.scala:451)
at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:464)
at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:464)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSpecLike$class.runTest(FunSpecLike.scala:464)
at org.scalatest.FunSpec.runTest(FunSpec.scala:1630)
at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:497)
at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:497)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:373)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:410)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSpecLike$class.runTests(FunSpecLike.scala:497)
at org.scalatest.FunSpec.runTests(FunSpec.scala:1630)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at org.scalatest.FunSpec.org$scalatest$FunSpecLike$$super$run(FunSpec.scala:1630)
at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:501)
at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:501)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSpecLike$class.run(FunSpecLike.scala:501)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseAppTest.org$scalatest$BeforeAndAfterAll$$super$run(SparkSqlBaseAppTest.scala:31)
at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at com.bbva.ebdm.ocelot.templates.spark_sql.SparkSqlBaseAppTest.run(SparkSqlBaseAppTest.scala:31)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1346)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1340)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1340)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1011)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1010)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1506)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010)
at org.scalatest.tools.Runner$.run(Runner.scala:850)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:131)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
You need to put the hdfs-site.xml, core-site.xml in spark class path i.e classpath of your program when you are running it
https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration
According to doc's:
If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark’s classpath:
hdfs-site.xml, which provides default behaviors for the HDFS client.
core-site.xml, which sets the default filesystem name.
The location of these configuration files varies across Hadoop versions, but a common location is inside of /etc/hadoop/conf. Some tools create configurations on-the-fly, but offer a mechanism to download copies of them.
To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/conf/spark-env.sh to a location containing the configuration files.
The problem was 'ScalaTest', it doesn't read the core-site.xml when Maven is compiling the proyect, but spark-submit reads it correctly when the proyect is compiled.

Relative path in absolute URI: txt Spark mac

I am running Spark on Mac (jupyter notebook) and not Windows. I am trying to read a txt file:
val text = sc.textFile("shakespeare.txt")
val relevant_lines = text.filter(l => l.contains("Music"))
val result = relevant_lines.count()
I get the following error:
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: Module 3:%20Apache%20Spark
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Globber.glob(Globber.java:211)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
... 37 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: Module 3:%20Apache%20Spark
at java.base/java.net.URI.checkPath(URI.java:1941)
at java.base/java.net.URI.<init>(URI.java:757)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 61 more
Could you help me fix it?
Thank you
Give the complete path where the text file is located in your MAC.
eg -: "/user/name/shakespeare.txt"
For multiple text files
Syntax-: sc.textFile("/user/name/*")
val text = sc.textFile("/user/name/shakespeare.txt")
val relevant_lines = text.filter(l => l.contains("Music"))
val result = relevant_lines.count()

spark dealing with carbondata

Below is the code snippet I'm trying to use to create a carbondata table in S3. However, inspite of setting the aws credentials in hadoopconfiguration, it still complains about secret key and access key not being set. What is the issue here?
import org.apache.spark.sql.CarbonSession._
import org.apache.spark.sql.CarbonSession._
val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("s3n://url")
carbon.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId","<accesskey>")
carbon.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","<secretaccesskey>")
carbon.sql("CREATE TABLE IF NOT EXISTS test_table(id string,name string,city string,age Int) STORED BY 'carbondata'")
Last command yields error:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret
Access Key must be specified as the username or password
(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
or fs.s3n.awsSecretAccessKey properties (respectively)
Spark Version : 2.2.1
Command used to start spark-shell:
$SPARK_PATH/bin/spark-shell --jars /localpath/jar/apache-carbondata-1.3.1-bin-spark2.2.1-hadoop2.7.2/apache-carbondata-1.3.1-bin-spark2.2.1-hadoop2.7.2.jar,/localpath/jar/spark-avro_2.11-4.0.0.jar --packages com.amazonaws:aws-java-sdk-pom:1.9.22,org.apache.hadoop:hadoop-aws:2.7.2,org.slf4j:slf4j-simple:1.7.21,asm:asm:3.2,org.xerial.snappy:snappy-java:1.1.7.1,com.databricks:spark-avro_2.11:4.0.0
UPDATE:
Found that S3 support is only available in 1.4.0 RC1. So I built RC1 and tested the below code against the same. But still I seem to be running into issues. Any help appreciated.
Code:
import org.apache.spark.sql.CarbonSession._
import org.apache.hadoop.fs.s3a.Constants.{ACCESS_KEY, ENDPOINT, SECRET_KEY}
import org.apache.spark.sql.SparkSession
import org.apache.carbondata.core.constants.CarbonCommonConstants
object sample4 {
def main(args: Array[String]) {
val (accessKey, secretKey, endpoint) = getKeyOnPrefix("s3n://")
//val rootPath = new File(this.getClass.getResource("/").getPath
// + "../../../..").getCanonicalPath
val path = "/localpath/sample/data1.csv"
val spark = SparkSession
.builder()
.master("local")
.appName("S3UsingSDKExample")
.config("spark.driver.host", "localhost")
.config(accessKey, "<accesskey>")
.config(secretKey, "<secretkey>")
//.config(endpoint, "s3-us-east-1.amazonaws.com")
.getOrCreateCarbonSession()
spark.sql("Drop table if exists carbon_table")
spark.sql(
s"""
| CREATE TABLE if not exists carbon_table(
| shortField SHORT,
| intField INT,
| bigintField LONG,
| doubleField DOUBLE,
| stringField STRING,
| timestampField TIMESTAMP,
| decimalField DECIMAL(18,2),
| dateField DATE,
| charField CHAR(5),
| floatField FLOAT
| )
| STORED BY 'carbondata'
| LOCATION 's3n://bucketName/table/carbon_table'
| TBLPROPERTIES('SORT_COLUMNS'='', 'DICTIONARY_INCLUDE'='dateField, charField')
""".stripMargin)
}
def getKeyOnPrefix(path: String): (String, String, String) = {
val endPoint = "spark.hadoop." + ENDPOINT
if (path.startsWith(CarbonCommonConstants.S3A_PREFIX)) {
("spark.hadoop." + ACCESS_KEY, "spark.hadoop." + SECRET_KEY, endPoint)
} else if (path.startsWith(CarbonCommonConstants.S3N_PREFIX)) {
("spark.hadoop." + CarbonCommonConstants.S3N_ACCESS_KEY,
"spark.hadoop." + CarbonCommonConstants.S3N_SECRET_KEY, endPoint)
} else if (path.startsWith(CarbonCommonConstants.S3_PREFIX)) {
("spark.hadoop." + CarbonCommonConstants.S3_ACCESS_KEY,
"spark.hadoop." + CarbonCommonConstants.S3_SECRET_KEY, endPoint)
} else {
throw new Exception("Incorrect Store Path")
}
}
def getSparkMaster(args: Array[String]): String = {
if (args.length == 6) args(5)
else if (args(3).contains("spark:") || args(3).contains("mesos:")) args(3)
else "local"
}
}
Error:
18/05/17 12:23:22 ERROR SegmentStatusManager: main Failed to read metadata of load
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.ServiceException: Request Error: Empty key
I also tried against the sample code in (tried s3,s3n,s3a protocols as well):
https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/S3Example.scala
Ran as:
S3Example.main(Array("accesskey","secretKey","s3://bucketName/path/carbon_table","https://bucketName.s3.amazonaws.com","local"))
Error stacktrace:
org.apache.hadoop.fs.s3.S3Exception:
org.jets3t.service.S3ServiceException: Request Error: Empty key at
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:175)
at
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveINode(Jets3tFileSystemStore.java:221)
at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy21.retrieveINode(Unknown Source) at
org.apache.hadoop.fs.s3.S3FileSystem.getFileStatus(S3FileSystem.java:340)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) at
org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.isFileExist(AbstractDFSCarbonFile.java:426)
at
org.apache.carbondata.core.datastore.impl.FileFactory.isFileExist(FileFactory.java:201)
at
org.apache.carbondata.core.statusmanager.SegmentStatusManager.readTableStatusFile(SegmentStatusManager.java:246)
at
org.apache.carbondata.core.statusmanager.SegmentStatusManager.readLoadMetadata(SegmentStatusManager.java:197)
at
org.apache.carbondata.core.cache.dictionary.ManageDictionaryAndBTree.clearBTreeAndDictionaryLRUCache(ManageDictionaryAndBTree.java:101)
at
org.apache.spark.sql.hive.CarbonFileMetastore.dropTable(CarbonFileMetastore.scala:460)
at
org.apache.spark.sql.execution.command.table.CarbonCreateTableCommand.processMetadata(CarbonCreateTableCommand.scala:148)
at
org.apache.spark.sql.execution.command.MetadataCommand.run(package.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.(Dataset.scala:183) at
org.apache.spark.sql.CarbonSession$$anonfun$sql$1.apply(CarbonSession.scala:107)
at
org.apache.spark.sql.CarbonSession$$anonfun$sql$1.apply(CarbonSession.scala:96)
at
org.apache.spark.sql.CarbonSession.withProfiler(CarbonSession.scala:144)
at org.apache.spark.sql.CarbonSession.sql(CarbonSession.scala:94) at
$line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$S3Example$.main(:68) at $line26.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:31)
at $line26.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:36) at
$line26.$read$$iw$$iw$$iw$$iw$$iw$$iw.(:38) at
$line26.$read$$iw$$iw$$iw$$iw$$iw.(:40) at
$line26.$read$$iw$$iw$$iw$$iw.(:42) at
$line26.$read$$iw$$iw$$iw.(:44) at
$line26.$read$$iw$$iw.(:46) at
$line26.$read$$iw.(:48) at
$line26.$read.(:50) at
$line26.$read$.(:54) at
$line26.$read$.() at
$line26.$eval$.$print$lzycompute(:7) at
$line26.$eval$.$print(:6) at $line26.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at
scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at
scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at
scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) at
scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) at
scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) at
scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) at
scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415) at
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:923)
at
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at
scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909) at
org.apache.spark.repl.Main$.doMain(Main.scala:74) at
org.apache.spark.repl.Main$.main(Main.scala:54) at
org.apache.spark.repl.Main.main(Main.scala) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused
by: org.jets3t.service.S3ServiceException: Request Error: Empty key
at org.jets3t.service.S3Service.getObject(S3Service.java:1470) at
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:163)
Is any of the arguments that I'm passing wrong.
I'm able to access the s3 path using aws cli:
aws s3 ls s3://bucketName/path
exists in S3.
You can try it using this example https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/S3Example.scala
You have to provide aws credentials properties to spark first after that you will be creating carbonSession.
If you have already created sparkContext without aws properties being provided. Then it do not pick up those properties even after you give it to carbonContext.
hi vikas looking at your exception empty key simply means that your acesss key and secret key is not binded in carbon session because when we give the s3 implementation we write the logic that if any of key is not provide by user then it then their value should be taken as empty
so to make things easy
first build the carbon data jar using this command
mvn -Pspark-2.1 clean package
then execute spark submit with this command
./spark-submit --jars file:///home/anubhav/Downloads/softwares/spark-2.2.1-bin-hadoop2.7/carbonlib/apache-carbondata-1.4.0-SNAPSHOT-bin-spark2.2.1-hadoop2.7.2.jar --class org.apache.carbondata.examples.S3Example /home/anubhav/Documents/carbondata/carbondata/carbondata/examples/spark2/target/carbondata-examples-spark2-1.4.0-SNAPSHOT.jar local
replace my jar path with yours and see it should work,its working for me

Error while running spark on standalone cluster

I'm trying to run a simple Spark code on standalone cluster. Below is the code:
from pyspark import SparkConf,SparkContext
if __name__ == "__main__":
conf = SparkConf().setAppName("even-numbers").setMaster("spark://sumit-Inspiron-N5110:7077")
sc = SparkContext(conf)
inp = sc.parallelize([1,2,3,4,5])
even = inp.filter(lambda x: (x % 2 == 0)).collect()
for i in even:
print(i)
but, I'm getting error stating " Could not parse Master URL":
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: Could not parse Master URL: '<pyspark.conf.SparkConf object at 0x7fb27e864850>'
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2760)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
18/01/07 16:59:47 INFO ShutdownHookManager: Shutdown hook called
18/01/07 16:59:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-0d71782f-617f-44b1-9593-b9cd9267757e
I also tried setting the master as 'local', but it didn't work. Can someone help?
And yes, the command to run the job is
./bin/spark-submit even.py
Replace your following line
sc = SparkContext(conf)
with
sc = SparkContext(conf=conf)
you should have it solved.