Error while running spark on standalone cluster - pyspark

I'm trying to run a simple Spark code on standalone cluster. Below is the code:
from pyspark import SparkConf,SparkContext
if __name__ == "__main__":
conf = SparkConf().setAppName("even-numbers").setMaster("spark://sumit-Inspiron-N5110:7077")
sc = SparkContext(conf)
inp = sc.parallelize([1,2,3,4,5])
even = inp.filter(lambda x: (x % 2 == 0)).collect()
for i in even:
print(i)
but, I'm getting error stating " Could not parse Master URL":
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: Could not parse Master URL: '<pyspark.conf.SparkConf object at 0x7fb27e864850>'
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2760)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
18/01/07 16:59:47 INFO ShutdownHookManager: Shutdown hook called
18/01/07 16:59:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-0d71782f-617f-44b1-9593-b9cd9267757e
I also tried setting the master as 'local', but it didn't work. Can someone help?
And yes, the command to run the job is
./bin/spark-submit even.py

Replace your following line
sc = SparkContext(conf)
with
sc = SparkContext(conf=conf)
you should have it solved.

Related

Intermittent org.apache.spark.SparkException: Could not find spark-version-info.properties in K8s

My application uses K8s cronjob to schedule the application run, consequently creating a pod for each occurrence.
In most of the cases, the application runs well, but in some of them, it fails with the following error:
java.lang.ExceptionInInitializerError
2022-12-23T10:45:32.899555393Z at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2022-12-23T10:45:32.899572625Z at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2022-12-23T10:45:32.899576309Z at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2022-12-23T10:45:32.899590250Z at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
2022-12-23T10:45:32.899601166Z at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:600)
2022-12-23T10:45:32.899604720Z at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:678)
2022-12-23T10:45:32.899608169Z at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:737)
2022-12-23T10:45:32.899610934Z at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
2022-12-23T10:45:32.899613557Z at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
2022-12-23T10:45:32.899629628Z at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
2022-12-23T10:45:32.899633296Z at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
2022-12-23T10:45:32.899637988Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:661)
[...]
2022-12-23T10:45:32.899959490Z Caused by: java.lang.ExceptionInInitializerError
2022-12-23T10:45:32.899972035Z at org.apache.spark.package$.<init>(package.scala:93)
2022-12-23T10:45:32.899985422Z at org.apache.spark.package$.<clinit>(package.scala)
2022-12-23T10:45:32.899990812Z at org.apache.spark.SparkContext.$anonfun$new$1(SparkContext.scala:183)
2022-12-23T10:45:32.899996770Z at org.apache.spark.internal.Logging.logInfo(Logging.scala:54)
2022-12-23T10:45:32.899998834Z at org.apache.spark.internal.Logging.logInfo$(Logging.scala:53)
2022-12-23T10:45:32.900029297Z at org.apache.spark.SparkContext.logInfo(SparkContext.scala:73)
2022-12-23T10:45:32.900031985Z at org.apache.spark.SparkContext.<init>(SparkContext.scala:183)
2022-12-23T10:45:32.900033999Z at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2526)
2022-12-23T10:45:32.900049879Z at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930)
2022-12-23T10:45:32.900055477Z at scala.Option.getOrElse(Option.scala:189)
2022-12-23T10:45:32.900061375Z at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
[...]
2022-12-23T10:45:32.900111374Z at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
2022-12-23T10:45:32.900134359Z at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
2022-12-23T10:45:32.900137775Z at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
2022-12-23T10:45:32.900145849Z at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
2022-12-23T10:45:32.900159046Z at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
2022-12-23T10:45:32.900178955Z at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
2022-12-23T10:45:32.900183168Z at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
2022-12-23T10:45:32.900206355Z at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
2022-12-23T10:45:32.900214981Z at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
2022-12-23T10:45:32.900227137Z at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
2022-12-23T10:45:32.900377715Z Caused by: org.apache.spark.SparkException: Could not find spark-version-info.properties
2022-12-23T10:45:32.900381131Z at org.apache.spark.package$SparkBuildInfo$.<init>(package.scala:62)
2022-12-23T10:45:32.900396317Z at org.apache.spark.package$SparkBuildInfo$.<clinit>(package.scala)
2022-12-23T10:45:32.900399396Z ... 25 more
How spark sesssion is being build
lazy val spark: SparkSession = SparkSession.builder
.appName("My application")
.master("local") //Error occur here
.getOrCreate()
Spark version: 2.4.8
Scala version: 2.12
Anyone already had a similar problem?

Spark to read Cassandra

Hi I am trying to extract data from Cassandra using AWS Glue and writing PySpark Code. Below is the code and gave me error. Please suggest me how i can import classes/drivers.
I want to extract from Cassandra and create files into S3 Buckets.
#from awsglue.transforms import sys
import sys
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sparkContext = SparkContext()
glueContext = GlueContext(sparkContext)
sparkSession = glueContext.spark_session
#Use the CData JDBC driver to read Cassandra data from the Customer table into a DataFrame ##Note the populated JDBC URL and driver class name
#source_df = sparkSession.read.format("jdbc").option("url","jdbc:cassandra:RTK=5246...;Database=MyCassandraDB;Port=7000;Server=db-datastax02c-dc2.stage.impello.co.uk;")\.option("dbtable","reads_by_received_date").option("driver","cdata.jdbc.cassandra.CassandraDriver").load()*/
#df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).load()
glueJob = Job(glueContext)
glueJob.init(args['JOB_NAME'], args)
testdf = sparkSession.read.format("org.apache.spark.sql.cassandra")\
.option("spark.cassandra.connection.host", "server")\
.options(table="reads_by_received_date",keyspace="keyspace")\
.option("spark.cassandra.auth.username", "username") \
.option("spark.cassandra.auth.password", "username") \
.load()\
#.select(*)\
#.where( "received_year in (2020)")\
#.cache()
##Convert DataFrames to AWS Glue's DynamicFrames Object
dynamic_dframe = DynamicFrame.fromDF(testdf, glueContext, "dynamic_df")
##Write the DynamicFrame as a file in CSV format to a folder in an S3 bucket.
datatransfer = glueContext.write_dynamic_frame.from_options(frame = dynamic_dframe\
, connection_type = "s3"\
, connection_options = {"path": "s3://bucket/"}\
, format = "csv"\
, transformation_ctx = "datasink4"
)
glueJob.commit()
Error:
Aug 28, 2020, 4:43:27 PM Pending execution
Traceback (most recent call last): File "/tmp/CassandraToS3", line 27, in <module> .option("spark.cassandra.auth.password", "password") \ File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load return self._df(self._jreader.load()) File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o75.load. : java.io.IOException: Failed to open native connection to Cassandra at {} :: Could not reach any contact point, make sure you've provided valid addresses (showing first 1 nodes, use getAllErrors() for more): Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=4f522a41): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException)] at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:181) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:169) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:169) at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:32) at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69) at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57) at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:89) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111) at com.datastax.spark.connector.rdd.partitioner.dht.TokenFactory$.forSystemLocalPartitioner(TokenFactory.scala:98) at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:680) at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:57) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses (showing first 1 nodes, use getAllErrors() for more): Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=4f522a41): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException)] at com.datastax.oss.driver.api.core.AllNodesFailedException.copy(AllNodesFailedException.java:141) at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149) at com.datastax.oss.driver.api.core.session.SessionBuilder.build(SessionBuilder.java:633) at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createSession(CassandraConnectionFactory.scala:144) at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:175) ... 25 more Suppressed: com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException) at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler$InitRequest.fail(ProtocolInitHandler.java:342) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.writeListener(ChannelHandlerRequest.java:87) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:183) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:95) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:30) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.send(ChannelHandlerRequest.java:76) at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler$InitRequest.send(ProtocolInitHandler.java:183) at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler.onRealConnect(ProtocolInitHandler.java:118) at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.lambda$connect$0(ConnectInitHandler.java:57) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more Suppressed: com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9042 Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) at com.datastax.oss.driver.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: java.nio.channels.ClosedChannelException at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:921) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:897) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:748) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:740) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:726) at com.datastax.oss.driver.shaded.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:748) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:763) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:788) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:756) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:806) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1025) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:294) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.send(ChannelHandlerRequest.java:75) ... 20 more
AWS Glue does not provide native library support to Cassandra. You need to get Cassandra connector and follow the steps mentioned in ETL jobs against non-native JDBC data sources.
Once you have the jar downloaded from here then you can pass to your job and use it in your pyspark script.

pyspark on emr with boto3, copy of s3 object result with Futures timed out after [100000 milliseconds]

I have a pyspark application that will transform csv to parquet and before this happen I'm copying some S3 object from a bucket to another.
pyspark with spark 2.4, emr 5.27, maximizeResourceAllocation set to true
I have various csv files size, from 80kb to 500mb.
Nonetheless, my EMR cluster (it doesn't fail on local with spark-submit) fails at 70% completion on a file that is 166mb (a previous at 480mb succeeded).
The job is simple:
def organise_adwords_csv():
s3 = boto3.resource('s3')
bucket = s3.Bucket(S3_ORIGIN_RAW_BUCKET)
for obj in bucket.objects.filter(Prefix=S3_ORIGIN_ADWORDS_RAW + "/"):
key = obj.key
copy_source = {
'Bucket': S3_ORIGIN_RAW_BUCKET,
'Key': key
}
key_tab = obj.key.split("/")
if len(key_tab) < 5:
print("continuing from length", obj)
continue
file_name = ''.join(key_tab[len(key_tab)-1:len(key_tab)])
if file_name == '':
print("continuing", obj)
continue
table = file_name.split("_")[1].replace("-", "_")
new_path = "{0}/{1}/{2}".format(S3_DESTINATION_ORDERED_ADWORDS_RAW_PATH, table, file_name)
print("new_path", new_path) <- the last print will end here
try:
s3.meta.client.copy(copy_source, S3_DESTINATION_RAW_BUCKET, new_path)
print("copy done")
except Exception as e:
print(e)
print("an exception occured while copying")
if __name__=='__main__':
organise_adwords_csv()
print("copy Final done") <- never printed
spark = SparkSession.builder.appName("adwords_transform") \
...
but, in the stdout, no errors / exception are showing.
In stderr logs:
19/10/09 16:16:57 INFO ApplicationMaster: Waiting for spark context initialization...
19/10/09 16:18:37 ERROR ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
19/10/09 16:18:37 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
)
19/10/09 16:18:37 INFO ShutdownHookManager: Shutdown hook called
I'm completely blind, I don't understand what is failing / why.
How can I figure that out? On local it works like a charm (but super slow of course)
Edit:
After many tries I can confirm that the function:
s3.meta.client.copy(copy_source, S3_DESTINATION_RAW_BUCKET, new_path)
make the EMR cluster timeout, even tho it processed 80% of the files already.
Does anyone have a recommendation about this?
s3.meta.client.copy(copy_source, S3_DESTINATION_RAW_BUCKET, new_path)
This will fail for any source object larger than 5 GB. please use multipart upload in AWS. See https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#multipartupload

spark dealing with carbondata

Below is the code snippet I'm trying to use to create a carbondata table in S3. However, inspite of setting the aws credentials in hadoopconfiguration, it still complains about secret key and access key not being set. What is the issue here?
import org.apache.spark.sql.CarbonSession._
import org.apache.spark.sql.CarbonSession._
val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("s3n://url")
carbon.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId","<accesskey>")
carbon.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","<secretaccesskey>")
carbon.sql("CREATE TABLE IF NOT EXISTS test_table(id string,name string,city string,age Int) STORED BY 'carbondata'")
Last command yields error:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret
Access Key must be specified as the username or password
(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
or fs.s3n.awsSecretAccessKey properties (respectively)
Spark Version : 2.2.1
Command used to start spark-shell:
$SPARK_PATH/bin/spark-shell --jars /localpath/jar/apache-carbondata-1.3.1-bin-spark2.2.1-hadoop2.7.2/apache-carbondata-1.3.1-bin-spark2.2.1-hadoop2.7.2.jar,/localpath/jar/spark-avro_2.11-4.0.0.jar --packages com.amazonaws:aws-java-sdk-pom:1.9.22,org.apache.hadoop:hadoop-aws:2.7.2,org.slf4j:slf4j-simple:1.7.21,asm:asm:3.2,org.xerial.snappy:snappy-java:1.1.7.1,com.databricks:spark-avro_2.11:4.0.0
UPDATE:
Found that S3 support is only available in 1.4.0 RC1. So I built RC1 and tested the below code against the same. But still I seem to be running into issues. Any help appreciated.
Code:
import org.apache.spark.sql.CarbonSession._
import org.apache.hadoop.fs.s3a.Constants.{ACCESS_KEY, ENDPOINT, SECRET_KEY}
import org.apache.spark.sql.SparkSession
import org.apache.carbondata.core.constants.CarbonCommonConstants
object sample4 {
def main(args: Array[String]) {
val (accessKey, secretKey, endpoint) = getKeyOnPrefix("s3n://")
//val rootPath = new File(this.getClass.getResource("/").getPath
// + "../../../..").getCanonicalPath
val path = "/localpath/sample/data1.csv"
val spark = SparkSession
.builder()
.master("local")
.appName("S3UsingSDKExample")
.config("spark.driver.host", "localhost")
.config(accessKey, "<accesskey>")
.config(secretKey, "<secretkey>")
//.config(endpoint, "s3-us-east-1.amazonaws.com")
.getOrCreateCarbonSession()
spark.sql("Drop table if exists carbon_table")
spark.sql(
s"""
| CREATE TABLE if not exists carbon_table(
| shortField SHORT,
| intField INT,
| bigintField LONG,
| doubleField DOUBLE,
| stringField STRING,
| timestampField TIMESTAMP,
| decimalField DECIMAL(18,2),
| dateField DATE,
| charField CHAR(5),
| floatField FLOAT
| )
| STORED BY 'carbondata'
| LOCATION 's3n://bucketName/table/carbon_table'
| TBLPROPERTIES('SORT_COLUMNS'='', 'DICTIONARY_INCLUDE'='dateField, charField')
""".stripMargin)
}
def getKeyOnPrefix(path: String): (String, String, String) = {
val endPoint = "spark.hadoop." + ENDPOINT
if (path.startsWith(CarbonCommonConstants.S3A_PREFIX)) {
("spark.hadoop." + ACCESS_KEY, "spark.hadoop." + SECRET_KEY, endPoint)
} else if (path.startsWith(CarbonCommonConstants.S3N_PREFIX)) {
("spark.hadoop." + CarbonCommonConstants.S3N_ACCESS_KEY,
"spark.hadoop." + CarbonCommonConstants.S3N_SECRET_KEY, endPoint)
} else if (path.startsWith(CarbonCommonConstants.S3_PREFIX)) {
("spark.hadoop." + CarbonCommonConstants.S3_ACCESS_KEY,
"spark.hadoop." + CarbonCommonConstants.S3_SECRET_KEY, endPoint)
} else {
throw new Exception("Incorrect Store Path")
}
}
def getSparkMaster(args: Array[String]): String = {
if (args.length == 6) args(5)
else if (args(3).contains("spark:") || args(3).contains("mesos:")) args(3)
else "local"
}
}
Error:
18/05/17 12:23:22 ERROR SegmentStatusManager: main Failed to read metadata of load
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.ServiceException: Request Error: Empty key
I also tried against the sample code in (tried s3,s3n,s3a protocols as well):
https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/S3Example.scala
Ran as:
S3Example.main(Array("accesskey","secretKey","s3://bucketName/path/carbon_table","https://bucketName.s3.amazonaws.com","local"))
Error stacktrace:
org.apache.hadoop.fs.s3.S3Exception:
org.jets3t.service.S3ServiceException: Request Error: Empty key at
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:175)
at
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveINode(Jets3tFileSystemStore.java:221)
at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy21.retrieveINode(Unknown Source) at
org.apache.hadoop.fs.s3.S3FileSystem.getFileStatus(S3FileSystem.java:340)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) at
org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.isFileExist(AbstractDFSCarbonFile.java:426)
at
org.apache.carbondata.core.datastore.impl.FileFactory.isFileExist(FileFactory.java:201)
at
org.apache.carbondata.core.statusmanager.SegmentStatusManager.readTableStatusFile(SegmentStatusManager.java:246)
at
org.apache.carbondata.core.statusmanager.SegmentStatusManager.readLoadMetadata(SegmentStatusManager.java:197)
at
org.apache.carbondata.core.cache.dictionary.ManageDictionaryAndBTree.clearBTreeAndDictionaryLRUCache(ManageDictionaryAndBTree.java:101)
at
org.apache.spark.sql.hive.CarbonFileMetastore.dropTable(CarbonFileMetastore.scala:460)
at
org.apache.spark.sql.execution.command.table.CarbonCreateTableCommand.processMetadata(CarbonCreateTableCommand.scala:148)
at
org.apache.spark.sql.execution.command.MetadataCommand.run(package.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.(Dataset.scala:183) at
org.apache.spark.sql.CarbonSession$$anonfun$sql$1.apply(CarbonSession.scala:107)
at
org.apache.spark.sql.CarbonSession$$anonfun$sql$1.apply(CarbonSession.scala:96)
at
org.apache.spark.sql.CarbonSession.withProfiler(CarbonSession.scala:144)
at org.apache.spark.sql.CarbonSession.sql(CarbonSession.scala:94) at
$line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$S3Example$.main(:68) at $line26.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:31)
at $line26.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:36) at
$line26.$read$$iw$$iw$$iw$$iw$$iw$$iw.(:38) at
$line26.$read$$iw$$iw$$iw$$iw$$iw.(:40) at
$line26.$read$$iw$$iw$$iw$$iw.(:42) at
$line26.$read$$iw$$iw$$iw.(:44) at
$line26.$read$$iw$$iw.(:46) at
$line26.$read$$iw.(:48) at
$line26.$read.(:50) at
$line26.$read$.(:54) at
$line26.$read$.() at
$line26.$eval$.$print$lzycompute(:7) at
$line26.$eval$.$print(:6) at $line26.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at
scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at
scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at
scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) at
scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) at
scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) at
scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) at
scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415) at
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:923)
at
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at
scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909) at
org.apache.spark.repl.Main$.doMain(Main.scala:74) at
org.apache.spark.repl.Main$.main(Main.scala:54) at
org.apache.spark.repl.Main.main(Main.scala) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused
by: org.jets3t.service.S3ServiceException: Request Error: Empty key
at org.jets3t.service.S3Service.getObject(S3Service.java:1470) at
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:163)
Is any of the arguments that I'm passing wrong.
I'm able to access the s3 path using aws cli:
aws s3 ls s3://bucketName/path
exists in S3.
You can try it using this example https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/S3Example.scala
You have to provide aws credentials properties to spark first after that you will be creating carbonSession.
If you have already created sparkContext without aws properties being provided. Then it do not pick up those properties even after you give it to carbonContext.
hi vikas looking at your exception empty key simply means that your acesss key and secret key is not binded in carbon session because when we give the s3 implementation we write the logic that if any of key is not provide by user then it then their value should be taken as empty
so to make things easy
first build the carbon data jar using this command
mvn -Pspark-2.1 clean package
then execute spark submit with this command
./spark-submit --jars file:///home/anubhav/Downloads/softwares/spark-2.2.1-bin-hadoop2.7/carbonlib/apache-carbondata-1.4.0-SNAPSHOT-bin-spark2.2.1-hadoop2.7.2.jar --class org.apache.carbondata.examples.S3Example /home/anubhav/Documents/carbondata/carbondata/carbondata/examples/spark2/target/carbondata-examples-spark2-1.4.0-SNAPSHOT.jar local
replace my jar path with yours and see it should work,its working for me

How to connect (Py)Spark to Postgres database using JDBC

I have followed instructions from this posting to read data from an existing Postgres database with table named "objects" as defined and created by the Objects class in SQLalchemy. In my Jupyter notebook, my code is
from pyspark import SparkContext
from pyspark import SparkConf
from random import random
#spark conf
conf = SparkConf()
conf.setMaster("local[*]")
conf.setAppName('pyspark')
sc = SparkContext(conf=conf)
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
properties = {
"driver": "org.postgresql.Driver"
}
url = 'jdbc:postgresql://PG_USER:PASSWORD#PG_SERVER_IP/db_name'
df = sqlContext.read.jdbc(url=url, table='objects', properties=properties)
the last line results in the following:
Py4JJavaError: An error occurred while calling o25.jdbc.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:158)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:117)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
so it looks like it can't resolve the table. How do I test from here to make sure that I am connected to the database properly?
Problems with name resolving are indicated by org.postgresql.util.PSQLException and don't result in NPE. The source of the issue is actually a connection string and in particular the way you provide user credentials. At first glance it looks like a bug but if you're looking for a quick solution you can either use URL properties:
url = 'jdbc:postgresql://PG_SERVER_IP/db_name?user=PG_USER&password=PASSWORD'
or properties argument:
properties = {
"user": "PG_USER",
"password": "PASSWORD",
"driver": "org.postgresql.Driver"
}