How to persist a scala list to mongodb using spark - mongodb

so I have a spark code that fetch some documents from mongodb, does some transformation and tries to store it back to mongodb.
The problem happens when I try to persist a List object using the following functions:
First I generate some tuples using this function:
val usersRDD = rdd.flatMap( breakoutFileById ).distinct().groupByKey().mapValues(_.toList)
Then I convert the tuples fields to Documents using a custom mapToDocument function and call the saveToMongoDB function:
usersRDD.map( mapToDocument ).saveToMongoDB()
I'm getting the following error message:
org.bson.codecs.configuration.CodecConfigurationException: Can't find a codec for class scala.collection.immutable.$colon$colon.
at org.bson.codecs.configuration.CodecCache.getOrThrow(CodecCache.java:46)
at org.bson.codecs.configuration.ProvidersCodecRegistry.get(ProvidersCodecRegistry.java:63)
at org.bson.codecs.configuration.ChildCodecRegistry.get(ChildCodecRegistry.java:51)
at org.bson.codecs.DocumentCodec.writeValue(DocumentCodec.java:174)
at org.bson.codecs.DocumentCodec.writeMap(DocumentCodec.java:189)
at org.bson.codecs.DocumentCodec.encode(DocumentCodec.java:131)
at org.bson.codecs.DocumentCodec.encode(DocumentCodec.java:45)
at org.bson.codecs.BsonDocumentWrapperCodec.encode(BsonDocumentWrapperCodec.java:63)
at org.bson.codecs.BsonDocumentWrapperCodec.encode(BsonDocumentWrapperCodec.java:29)
at com.mongodb.connection.InsertCommandMessage.writeTheWrites(InsertCommandMessage.java:101)
at com.mongodb.connection.InsertCommandMessage.writeTheWrites(InsertCommandMessage.java:43)
at com.mongodb.connection.BaseWriteCommandMessage.encodeMessageBodyWithMetadata(BaseWriteCommandMessage.java:129)
at com.mongodb.connection.RequestMessage.encodeWithMetadata(RequestMessage.java:160)
at com.mongodb.connection.WriteCommandProtocol.sendMessage(WriteCommandProtocol.java:212)
at com.mongodb.connection.WriteCommandProtocol.execute(WriteCommandProtocol.java:101)
at com.mongodb.connection.InsertCommandProtocol.execute(InsertCommandProtocol.java:67)
at com.mongodb.connection.InsertCommandProtocol.execute(InsertCommandProtocol.java:37)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:286)
at com.mongodb.connection.DefaultServerConnection.insertCommand(DefaultServerConnection.java:115)
at com.mongodb.operation.MixedBulkWriteOperation$Run$2.executeWriteCommandProtocol(MixedBulkWriteOperation.java:455)
at com.mongodb.operation.MixedBulkWriteOperation$Run$RunExecutor.execute(MixedBulkWriteOperation.java:646)
at com.mongodb.operation.MixedBulkWriteOperation$Run.execute(MixedBulkWriteOperation.java:401)
at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:179)
at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:168)
at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:230)
at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:221)
at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168)
at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74)
at com.mongodb.Mongo.execute(Mongo.java:781)
at com.mongodb.Mongo$2.execute(Mongo.java:764)
at com.mongodb.MongoCollectionImpl.insertMany(MongoCollectionImpl.java:323)
at com.mongodb.MongoCollectionImpl.insertMany(MongoCollectionImpl.java:311)
at com.mongodb.spark.MongoSpark$$anonfun$save$1$$anonfun$apply$1$$anonfun$apply$2.apply(MongoSpark.scala:132)
at com.mongodb.spark.MongoSpark$$anonfun$save$1$$anonfun$apply$1$$anonfun$apply$2.apply(MongoSpark.scala:132)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at com.mongodb.spark.MongoSpark$$anonfun$save$1$$anonfun$apply$1.apply(MongoSpark.scala:132)
at com.mongodb.spark.MongoSpark$$anonfun$save$1$$anonfun$apply$1.apply(MongoSpark.scala:131)
at com.mongodb.spark.MongoConnector$$anonfun$withCollectionDo$1.apply(MongoConnector.scala:186)
at com.mongodb.spark.MongoConnector$$anonfun$withCollectionDo$1.apply(MongoConnector.scala:184)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:171)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:171)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:154)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:171)
at com.mongodb.spark.MongoConnector.withCollectionDo(MongoConnector.scala:184)
at com.mongodb.spark.MongoSpark$$anonfun$save$1.apply(MongoSpark.scala:131)
at com.mongodb.spark.MongoSpark$$anonfun$save$1.apply(MongoSpark.scala:130)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
If I remove the list (do not put as a field in the document) in the mapToDocument function, everything works. I already searched over the internet for similar problems and I couldn't find any solution that fits.
Does anyone has a clue of how to solve it?
Thanks in advance

From the unsupported types section in the documentation:
Some Scala types (e.g. Lists) are unsupported and should be converted
to their Java equivalent. To convert from Scala into native types
include the following import statement to use the .asJava method.
import scala.collection.JavaConverters._
import org.bson.Document
val documents = sc.parallelize(
Seq(new Document("fruits", List("apples", "oranges", "pears").asJava))
)
MongoSpark.save(documents)
The reason they are unsupported is due to the Mongo Spark Connector using the Mongo Java Driver underneath as theres no point in using the Scala async driver in this context. However, it does mean for RDD's you have to map to supported Java types. When using Datasets these conversions are automatically done for you.

Related

Schema capitalization(uppercase) problem when reading with Spark

Using Scala here:
Val df = spark.read.format("jdbc").
option("url", "<host url>").
option("dbtable", "UPPERCASE_SCHEMA.table_name").
option("user", "postgres").
option("password", "<password>").
option("numPartitions", 50).
option("fetchsize", 20).
load()
The database I'm using the above code to call from has many schemas and they are all in uppercase letters (UPPERCASE_SCHEMA).
No matter how I try to denote that the schema is in all caps, Spark converts it to lowercase which fails to initialize with the actual DB.
I've tried making it a variable and explicitly denoting it is all uppercase, etc. in multiple languages, but no luck.
Would anyone know a workaround?
When I went into the actual DB (Postgres) and temporarily changed the schema to all lowercase, it worked absolutely fine.
Try to set spark.sql.caseSensitive to true (false by default)
spark.conf.set('spark.sql.caseSensitive', true)
You can see in the source code its definition:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L833
In addition, you can see in the JDBCWriteSuite how it affects the JDBC connector:
https://github.com/apache/spark/blob/ee95ec35b4f711fada4b62bc27281252850bb475/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

Scala Spark - Cannot resolve a column name

This should be pretty straightforward, but I'm having an issue with the following code:
val test = spark.read
.option("header", "true")
.option("delimiter", ",")
.csv("sample.csv")
test.select("Type").show()
test.select("Provider Id").show()
test is a dataframe like so:
Type
Provider Id
A
asd
A
bsd
A
csd
B
rrr
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve '`Provider Id`' given input columns: [Type, Provider Id];;
'Project ['Provider Id]
It selected and shows the Type column just fine but couldn't get it to work for the Provider Id. I wondered if it were because the column name had a space, so I tried using backticks, removing and replacing the space, but nothing seemed to work. Also, it ran fine when I'm using Spark libraries 3.x but doesn't work when I'm using Spark 2.1.x (meanwhile I need to use 2.1.x)
Additional: I tried changing the CSV column order from Type - Provider Id to Provider Id then Type. The error was the opposite, Provider Id shows but for Type it's throwing an exception now.
Any suggestions?
test.printSchema()
You can use the result from printSchema() to see how exactly spark read your column in, then use that in your code.

How to output nested Row from Beam SQL (SqlTransform)?

I want to have Row with nested Row from output of Beam SQL (SqlTransform), but failing.
Questions:
What is the proper way to output Row with nested Row from SqlTransform? (Row type is described in the docs, so I believe it's supported)
If this is a bug/missing feature, is the problem of Beam itself? Or runner-dependent? (I'm currently using on DirectRunner, but going to use DataflowRunner in future.)
Version info:
OS: macOS 10.15.7 (Catalina)
Java: 11.0.11 (AdoptOpenJDK)
Beam SDK: 2.32.0
Here's what I've tried, with no luck.
With Calcite dialect
SELECT ROW(foo, bar) as my_nested_row FROM PCOLLECTION
I was expecting this outputs row with following schema
Field{name=my_nested_row, description=, type=ROW<foo STRING NOT NULL, bar INT64 NOT NULL> NOT NULL, options={{}}}
but actually row is split into scalar fields like
Field{name=my_nested_row$$0, description=, type=STRING NOT NULL, options={{}}}
Field{name=my_nested_row$$1, description=, type=INT64 NOT NULL, options={{}}}
Zeta SQL
SELECT STRUCT(foo, bar) as my_nested_row FROM PCOLLECTION
I got an error
java.lang.UnsupportedOperationException: Does not support expr node kind RESOLVED_MAKE_STRUCT
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.convertRexNodeFromResolvedExpr (ExpressionConverter.java:363)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.convertRexNodeFromResolvedExpr (ExpressionConverter.java:323)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.convertRexNodeFromComputedColumnWithFieldList (ExpressionConverter.java:375)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.retrieveRexNode (ExpressionConverter.java:203)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ProjectScanConverter.convert (ProjectScanConverter.java:45)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ProjectScanConverter.convert (ProjectScanConverter.java:29)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.QueryStatementConverter.convertNode (QueryStatementConverter.java:102)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.QueryStatementConverter.convert (QueryStatementConverter.java:89)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.QueryStatementConverter.convertRootQuery (QueryStatementConverter.java:55)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLPlannerImpl.rel (ZetaSQLPlannerImpl.java:98)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRelInternal (ZetaSQLQueryPlanner.java:197)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRel (ZetaSQLQueryPlanner.java:185)
at org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery (BeamSqlEnv.java:111)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand (SqlTransform.java:171)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand (SqlTransform.java:109)
at org.apache.beam.sdk.Pipeline.applyInternal (Pipeline.java:548)
at org.apache.beam.sdk.Pipeline.applyTransform (Pipeline.java:482)
at org.apache.beam.sdk.values.PCollection.apply (PCollection.java:363)
at dev.tmshn.playbeam.Main.main (Main.java:29)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:829)
Unfortunately Beam SQL does not yet support nested rows, mainly due to a lack of support in Calcite (and therefore a corresponding lack of support for the ZetaSQL implementation). See this similar question focused on Dataflow.
On the bright side, the Jira issue tracking this support seems to be resolved for 2.34.0, so proper support is likely upcoming.

Spark Streaming, reading from Socket: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String

I am on Windows 10 trying to multiple read text lines, separated by '\n' from a TCPsocket source (test purpose so far) using Spark Streaming (Spark 2.4.4). Words should be counted and current word count regularly displayed on the console. This is a standard test of Spark streaming, found in several books and web posts, but seems to fail with the socket source:
Text strings are sent from a Java program like:
serverOutSock = new ServerSocket(9999);
// Establish connection; wait for Spark to connect
sockOut = serverOutSock.accept();
// Set UTF-8 as format
sockOutput = new OutputStreamWriter(sockOut.getOutputStream(),"UTF-8");
// Multiple Java Strings are now written (thousands of them) like
sockOutput.write(string+'\n');
On the Spark receiving side, the Scala code looks like:
val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
val socketDF = spark.readStream.format("socket").option("host","localhost").option("port",9999).load
val words = socketDF.as[String].flatMap(_.split(" ")).coalesce(1)
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream
.trigger(Trigger.Continuous("1 second"))
.outputMode("complete")
.format("console")
.start
.awaitTermination
So, I would like to get a once-a-second write out on the console of the current word count.
But I get an error:
java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String
and nothing seems to be processed by Spark from the source (due to cast exception of source input?). At least nothing is written out on the console. What can be the reason for this?
Full stack trace follows:
Exception in thread "null-4" java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at org.apache.spark.sql.execution.streaming.continuous.shuffle.RPCContinuousShuffleWriter.write(RPCContinuousShuffleWriter.scala:51)
at org.apache.spark.sql.execution.streaming.continuous.ContinuousCoalesceRDD$$anonfun$4$$anon$1.run(ContinuousCoalesceRDD.scala:112)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have tried to remove coalesque(1) and replaced the Continuous trigger with a ProcessingTime trigger. This makes the error not to happen, but the console printout becomes:
Batch: 0
+-----+-----+
|value|count|
+-----+-----+
+-----+-----+
That is, no output, even though many words indeed are injected into the socket. Also, this output is shown only ondce, and much later than after 1 second.

Apache Spark streaming mapping object and printing attribute

I'm reading from a text file, parsing each line to JSON and am attempting to print one of the attributes:
val msgData = ssc.textFileStream(dataDir)
val msgs = msgData.map(MessageParser.parse)
msgs.foreach(msg => println(msg.my_attribute))
However, I get the following error on compilation:
value my_attribute is not a member of org.apache.spark.rdd.RDD[com.imgzine.analytics.messages.Message]
What am I missing?
Thanks
Spark Streaming discretizes a stream of data by creating micro-batch containers. Those are called 'DStreams' and contain a collection of RDD's.
Translated to your case, you need to operate on the content of the RDD, not the DStream:
msgs.foreach(rdd => rdd.foreach(elem => println(elem.my_attribute))
DStreams offer a help method to print the first elements (10 I think) of each RDD:
dstream.print()
Of course, that will just invoke .toString on the objects contained in the RDD and print the result. Maybe not what you want with my_attribute as stated in the question.