Unable to write dataframe to cassandra using pyspark - pyspark

I am trying to write a dataframe to cassandra using pyspark but its thworing me an error:
py4j.protocol.Py4JJavaError: An error occurred while calling o74.save.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 6 in stage 3.0 failed 4 times, most recent failure: Lost task 6.3
in stage 3.0 (TID 24, ip-172-31-11-193.us-west-2.compute.internal,
executor 1): java.lang.NoClassDefFoundError:
com/twitter/jsr166e/LongAdder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsUpdater.(OutputMetricsUpdater.scala:153)
at org.apache.spark.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:75)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:209)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:197)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:183)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Below is my code for write:
DataFrame.write.format(
"org.apache.spark.sql.cassandra"
).mode(
'append'
).options(
table="student1",
keyspace="university"
).save()
I have added the below mentioned spark-caasandra connector in spark-default.conf
spark.jars.packages datastax:spark-cassandra-connector:2.4.0-s_2.11
I am able to read the data from cassandra but issue is with write.

I am not an expert of Spark, but this might help:
These errors are commonly thrown when the Spark Cassandra Connector or
its dependencies are not on the runtime classpath of the Spark
Application. This is usually caused by not using the prescribed
--packages method of adding the Spark Cassandra Connector and its dependencies to the runtime classpath.
Source:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#why-cant-the-spark-job-find-spark-cassandra-connector-classes-classnotfound-exceptions-for-scc-classes

Related

PySpark - Inseroverwrite - File already exists exception

I am trying to extract data from Redshift and insert into a S3 using an newly created Glue table on that S3 location.
versions
Pyspark - 2.4.0
EMR-emr-5.21.0
My write looks like something like below:
date_filtered_df.coalesce(int(args.numpartitions)) \
.write \
.mode("overwrite") \
.format("parquet") \
.insertInto("{}.{}_stg".format(args.database, args.table))
That is a newly created table just before this insert and it points to a completly empty location.
But the code is erroring with
19/12/09 14:37:48 WARN TaskSetManager: Lost task 576.1 in stage 1.0 (TID 3125, ip-172-31-31-203.ec2.internal, executor 401): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://bucketname/trusted/databasename/dw/2019-12-08/tbalename/.hive-staging_hive_2019-12-09_12-49-15_136_2979557082816709535-1/-ext-10000/sales_dt=20051130/biz_unit_code=CS/geo_code=AMER/part-00576-84bcde9a-4fc0-402b-8aab-8d71e06f8c43.c000
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:123)
at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236)
at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245)
... 10 more
Any help is appreciated.

pyspark-mongodb collection read command won't execute

I have following versions installed :-
spark 2.1.0,
scala 2.11.6,
mongoDB 3.2.17
I tried to start the pyspark shell with following command
./bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
after this i started spark session as follows
from pyspark.sql import SparkSession
my_spark = SparkSession.builder.appName("myApp").config("spark.mongodb.input.uri", "mongodb://127.0.0.1/mycollection.dummy").config("spark.mongodb.output.uri", "mongodb://127.0.0.1/mycollection.dummy").getOrCreate()
i performed writing to a collection in mongodb db and it is getting executed successfully
but, when i try to read the collection using the command
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri","mongodb://127.0.0.1/mycollection.dummy").load()
it is showing error as follows
17/10/13 10:43:33 ERROR executor.Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2;
at com.mongodb.spark.sql.MongoInferSchema$.com$mongodb$spark$sql$MongoInferSchema$$compatibleType(MongoInferSchema.scala:135)
at com.mongodb.spark.sql.MongoInferSchema$$anonfun$3.apply(MongoInferSchema.scala:78)
at com.mongodb.spark.sql.MongoInferSchema$$anonfun$3.apply(MongoInferSchema.scala:78)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1135)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1135)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$25.apply(RDD.scala:1136)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$25.apply(RDD.scala:1136)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
You seem to have an inconsistency in the dataframe reading line. First of all, you initialize my_spark, but then use spark. You also use "uri", instead of "spark.mongodb.input.uri". Try this below:
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").option("spark.mongodb.input.uri","mongodb://127.0.0.1/mycollection.dummy").load()
Otherwise provide more code to check as a whole.

Spark-Application to Local Directory

PROBLEM
Spark Application error due to Mkdirs failed to create.
I'm using spark 1.6.3 unable to save output on my local dir
java.io.IOException: Mkdirs failed to create file:/home/zooms/output/sample1/sample1.txt/_temporary/0/_temporary/attempt_201709251225_0005_m_000000_10
(exists=false, cwd=file:/grid/1/hadoop/yarn/local/usercache/zooms/appcache/application_1504506749061_0086/container_e01_1504506749061_0086_01_000003)
Updated logs
17/09/25 13:39:02 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 10, worker3.hdp.example.com): java.io.IOException: Mkdirs failed to create file:/home/zooms/output/sample1/sample1.txt/_temporary/0/_temporary/attempt_201709251339_0005_m_000000_10 (exists=false, cwd=file:/grid/1/hadoop/yarn/local/usercache/zooms/appcache/application_1504506749061_0099/container_e01_1504506749061_0099_01_000003)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:930)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:823)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1191)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1183)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Code:
val output = "file:///home/zooms/output/sample1/sample1.txt"
result.coalesce(1).saveAsTextFile(output)
SOLUTION
Make sure that the whole cluster have access to the local or specific directory.
On my case, the cluster or the spark executors doesn't have access to the specific directory.
Here's the answer to my question.
Since i'm running on a cluster mode or client mode, workers won't able to create the directory on each node unless you define it. Use
spark-submit -v --master local ...
References:
Writing files to local system with Spark in Cluster mode
Why does Spark job fails to write output?

Spark 2.0.2 nested K-means in rdds/nested rdds or dataframes or datasets

I am trying to run lots of k-means in parallel. I have a room and lots of data for it and i would like to calculate clusters for each room. So i have
roomsSignals[(room:String, signals:List[org.apache.spark.mllib.linalg.Vector]]
roomsSignals.map{l=>
val data=sc.parallelize(l.signals)
val clusterCenters=2
val model = KMeans.train(data, clusterCenters, 5)
model.clusterCenters.map { r =>r.toJson.toString}.mkString(",")
}.collect.foreach(println)
Which gives me the error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 33 in stage 18.0 failed 4 times, most recent failure: Lost task 33.3 in stage 18.0 (TID 1284, 192.168.181.122): java.lang.NullPointerException
at $anonfun$1.apply(<console>:77)
at $anonfun$1.apply(<console>:76)
at scala.collection.Iter ator$$anon$11.next(Iter ator.scala:409)
at scala.collection.Iter ator$class.foreach(Iter ator.scala:893)
at scala.collection.AbstractIter ator.foreach(Iter ator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIter ator.to(Iter ator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIter ator.toBuffer(Iter ator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIter ator.toArray(Iter ator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
Unfortunately it is not possible. Any type of nesting of this kind is not supported by Spark at all.
Either train independent distributed models iterating over roomsSignals.collect or use local library of choice to build models in distributed structure.

Spark: is there a way to print out classpath of both spark-shell and spark?

I can run a spark job successfully in the spark-shell but when its packages and run through spark-submit Im getting a NoSuchMethodError.
This indicates to me some sort of mismatch of classpaths. Is there a way I can compare the two classpaths? Some sort of logging statement?
Thanks!
15/05/28 12:46:46 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at com.ldamodel.LdaModel$$anonfun$5$$anonfun$apply$5.apply(LdaModel.scala:22)
at com.ldamodel.LdaModel$$anonfun$5$$anonfun$apply$5.apply(LdaModel.scala:22)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at com.ldamodel.LdaModel$$anonfun$5.apply(LdaModel.scala:22)
at com.ldamodel.LdaModel$$anonfun$5.apply(LdaModel.scala:22)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I think this should work:
import java.lang.ClassLoader
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
Without modifying the code:
SPARK_PRINT_LAUNCH_COMMAND=true /usr/lib/spark/bin/spark-shell
Also works with spark-submit.
This should do the trick without requiring any code changes:
--conf 'spark.driver.extraJavaOptions=-verbose:class'
--conf 'spark.executor.extraJavaOptions=-verbose:class'
/opt/spark/bin/compute-classpath.sh