Spark Read multiple paths with automatic partitions discovery - scala

I'm trying to read some avro files to a DataFrame from multiple path.
Let's say my path is "s3a://bucket_name/path/to/file/year=18/month=11/day=01"
Under this path I have two more partitions let's say country=XX/region=XX
I want to read multiple dates at once without explicitly name country and region partitions. In addition I want country and region to be columns in this DataFrame.
sqlContext.read.format("com.databricks.spark.avro").load("s3a://bucket_name/path/to/file/year=18/month=11/day=01")
This line works perfectly well since I only read one path. It detects country and region partitions and infer their schema.
When I trying to read multiple dates let's say
val paths = Seq("s3a://bucket_name/path/to/file/year=18/month=11/day=01", "s3a://bucket_name/path/to/file/year=18/month=11/day=02")
sqlContext.read.format("com.databricks.spark.avro").load(paths:_*)
I get this error:
18/12/03 03:13:53 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result insub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/12/03 03:13:53 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:?
s3a://bucket_name/path/to/file/year=18/month=11/day=02
s3a://bucket_name/path/to/file/year=18/month=11/day=01
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:106)
at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:621)
at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:526)
at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:525)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:524)
at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionColumns$1.apply(interfaces.scala:578)
at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionColumns$1.apply(interfaces.scala:578)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.sources.HadoopFsRelation.partitionColumns(interfaces.scala:578)
at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:637)
at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:39)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:136)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:34)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:36)
at $iwC$$iwC$$iwC.<init>(<console>:38)
at $iwC$$iwC.<init>(<console>:40)
at $iwC.<init>(<console>:42)
at <init>(<console>:44)
at .<init>(<console>:48)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1326)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:821)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:852)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:800)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1064)
at org.apache.spark.repl.Main$.main(Main.scala:35)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Obviosly I can't use basePath because the paths don't share one. I also try to use /* in the end of each path, this actually works but ignores completely country and region partitions.
I can read the path one by one and union it, but I feel like I missing something.
Any idea why it's working only for a single path and how to make it work for multiple paths?

Really wish all error messages would be as clear - If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
Does the relative path year=18/month=11/day=01 is due to partitioning, or did you just used the same convention?
If the former is correct, then you should just read s3a://bucket_name/path/to/file/, and use predicates to filter desired dates. Or maybe as suggested by the error, you could try sqlContext.read.option("basePath","s3a://bucket_name/path/to/file/").format("com.databricks.spark.avro").load(paths:_*), where paths are relative
If the latter is true, then you should query each separately and apply unionAll on dataframes (as error message suggests). Perhaps treating the year/month/day as partition columns would work as well in this case, even though you didn't use partitionBy when you wrote the data...

Old question, but this is what I ended up doing in a similar situation
spark.read.parquet(paths:_*)
.withColumn("year", regexp_extract(input_file_name, "year=(.+?)/", 1))
.withColumn("month", regexp_extract(input_file_name, "month=(.+?)/", 1))
.withColumn("day", regexp_extract(input_file_name, "day=(.+?)/", 1))
Works when you have a static partition structure. Who's up to challenge to extend it to dynamic (i.e. to parse out arbitrary partition structure of the form 'x=y/z=c' and convert it to columns)?

Related

How to output nested Row from Beam SQL (SqlTransform)?

I want to have Row with nested Row from output of Beam SQL (SqlTransform), but failing.
Questions:
What is the proper way to output Row with nested Row from SqlTransform? (Row type is described in the docs, so I believe it's supported)
If this is a bug/missing feature, is the problem of Beam itself? Or runner-dependent? (I'm currently using on DirectRunner, but going to use DataflowRunner in future.)
Version info:
OS: macOS 10.15.7 (Catalina)
Java: 11.0.11 (AdoptOpenJDK)
Beam SDK: 2.32.0
Here's what I've tried, with no luck.
With Calcite dialect
SELECT ROW(foo, bar) as my_nested_row FROM PCOLLECTION
I was expecting this outputs row with following schema
Field{name=my_nested_row, description=, type=ROW<foo STRING NOT NULL, bar INT64 NOT NULL> NOT NULL, options={{}}}
but actually row is split into scalar fields like
Field{name=my_nested_row$$0, description=, type=STRING NOT NULL, options={{}}}
Field{name=my_nested_row$$1, description=, type=INT64 NOT NULL, options={{}}}
Zeta SQL
SELECT STRUCT(foo, bar) as my_nested_row FROM PCOLLECTION
I got an error
java.lang.UnsupportedOperationException: Does not support expr node kind RESOLVED_MAKE_STRUCT
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.convertRexNodeFromResolvedExpr (ExpressionConverter.java:363)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.convertRexNodeFromResolvedExpr (ExpressionConverter.java:323)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.convertRexNodeFromComputedColumnWithFieldList (ExpressionConverter.java:375)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ExpressionConverter.retrieveRexNode (ExpressionConverter.java:203)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ProjectScanConverter.convert (ProjectScanConverter.java:45)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.ProjectScanConverter.convert (ProjectScanConverter.java:29)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.QueryStatementConverter.convertNode (QueryStatementConverter.java:102)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.QueryStatementConverter.convert (QueryStatementConverter.java:89)
at org.apache.beam.sdk.extensions.sql.zetasql.translation.QueryStatementConverter.convertRootQuery (QueryStatementConverter.java:55)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLPlannerImpl.rel (ZetaSQLPlannerImpl.java:98)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRelInternal (ZetaSQLQueryPlanner.java:197)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRel (ZetaSQLQueryPlanner.java:185)
at org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery (BeamSqlEnv.java:111)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand (SqlTransform.java:171)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand (SqlTransform.java:109)
at org.apache.beam.sdk.Pipeline.applyInternal (Pipeline.java:548)
at org.apache.beam.sdk.Pipeline.applyTransform (Pipeline.java:482)
at org.apache.beam.sdk.values.PCollection.apply (PCollection.java:363)
at dev.tmshn.playbeam.Main.main (Main.java:29)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:829)
Unfortunately Beam SQL does not yet support nested rows, mainly due to a lack of support in Calcite (and therefore a corresponding lack of support for the ZetaSQL implementation). See this similar question focused on Dataflow.
On the bright side, the Jira issue tracking this support seems to be resolved for 2.34.0, so proper support is likely upcoming.

Spark Streaming, reading from Socket: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String

I am on Windows 10 trying to multiple read text lines, separated by '\n' from a TCPsocket source (test purpose so far) using Spark Streaming (Spark 2.4.4). Words should be counted and current word count regularly displayed on the console. This is a standard test of Spark streaming, found in several books and web posts, but seems to fail with the socket source:
Text strings are sent from a Java program like:
serverOutSock = new ServerSocket(9999);
// Establish connection; wait for Spark to connect
sockOut = serverOutSock.accept();
// Set UTF-8 as format
sockOutput = new OutputStreamWriter(sockOut.getOutputStream(),"UTF-8");
// Multiple Java Strings are now written (thousands of them) like
sockOutput.write(string+'\n');
On the Spark receiving side, the Scala code looks like:
val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
val socketDF = spark.readStream.format("socket").option("host","localhost").option("port",9999).load
val words = socketDF.as[String].flatMap(_.split(" ")).coalesce(1)
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream
.trigger(Trigger.Continuous("1 second"))
.outputMode("complete")
.format("console")
.start
.awaitTermination
So, I would like to get a once-a-second write out on the console of the current word count.
But I get an error:
java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String
and nothing seems to be processed by Spark from the source (due to cast exception of source input?). At least nothing is written out on the console. What can be the reason for this?
Full stack trace follows:
Exception in thread "null-4" java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at org.apache.spark.sql.execution.streaming.continuous.shuffle.RPCContinuousShuffleWriter.write(RPCContinuousShuffleWriter.scala:51)
at org.apache.spark.sql.execution.streaming.continuous.ContinuousCoalesceRDD$$anonfun$4$$anon$1.run(ContinuousCoalesceRDD.scala:112)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have tried to remove coalesque(1) and replaced the Continuous trigger with a ProcessingTime trigger. This makes the error not to happen, but the console printout becomes:
Batch: 0
+-----+-----+
|value|count|
+-----+-----+
+-----+-----+
That is, no output, even though many words indeed are injected into the socket. Also, this output is shown only ondce, and much later than after 1 second.

How to persist a scala list to mongodb using spark

so I have a spark code that fetch some documents from mongodb, does some transformation and tries to store it back to mongodb.
The problem happens when I try to persist a List object using the following functions:
First I generate some tuples using this function:
val usersRDD = rdd.flatMap( breakoutFileById ).distinct().groupByKey().mapValues(_.toList)
Then I convert the tuples fields to Documents using a custom mapToDocument function and call the saveToMongoDB function:
usersRDD.map( mapToDocument ).saveToMongoDB()
I'm getting the following error message:
org.bson.codecs.configuration.CodecConfigurationException: Can't find a codec for class scala.collection.immutable.$colon$colon.
at org.bson.codecs.configuration.CodecCache.getOrThrow(CodecCache.java:46)
at org.bson.codecs.configuration.ProvidersCodecRegistry.get(ProvidersCodecRegistry.java:63)
at org.bson.codecs.configuration.ChildCodecRegistry.get(ChildCodecRegistry.java:51)
at org.bson.codecs.DocumentCodec.writeValue(DocumentCodec.java:174)
at org.bson.codecs.DocumentCodec.writeMap(DocumentCodec.java:189)
at org.bson.codecs.DocumentCodec.encode(DocumentCodec.java:131)
at org.bson.codecs.DocumentCodec.encode(DocumentCodec.java:45)
at org.bson.codecs.BsonDocumentWrapperCodec.encode(BsonDocumentWrapperCodec.java:63)
at org.bson.codecs.BsonDocumentWrapperCodec.encode(BsonDocumentWrapperCodec.java:29)
at com.mongodb.connection.InsertCommandMessage.writeTheWrites(InsertCommandMessage.java:101)
at com.mongodb.connection.InsertCommandMessage.writeTheWrites(InsertCommandMessage.java:43)
at com.mongodb.connection.BaseWriteCommandMessage.encodeMessageBodyWithMetadata(BaseWriteCommandMessage.java:129)
at com.mongodb.connection.RequestMessage.encodeWithMetadata(RequestMessage.java:160)
at com.mongodb.connection.WriteCommandProtocol.sendMessage(WriteCommandProtocol.java:212)
at com.mongodb.connection.WriteCommandProtocol.execute(WriteCommandProtocol.java:101)
at com.mongodb.connection.InsertCommandProtocol.execute(InsertCommandProtocol.java:67)
at com.mongodb.connection.InsertCommandProtocol.execute(InsertCommandProtocol.java:37)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:286)
at com.mongodb.connection.DefaultServerConnection.insertCommand(DefaultServerConnection.java:115)
at com.mongodb.operation.MixedBulkWriteOperation$Run$2.executeWriteCommandProtocol(MixedBulkWriteOperation.java:455)
at com.mongodb.operation.MixedBulkWriteOperation$Run$RunExecutor.execute(MixedBulkWriteOperation.java:646)
at com.mongodb.operation.MixedBulkWriteOperation$Run.execute(MixedBulkWriteOperation.java:401)
at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:179)
at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:168)
at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:230)
at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:221)
at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168)
at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74)
at com.mongodb.Mongo.execute(Mongo.java:781)
at com.mongodb.Mongo$2.execute(Mongo.java:764)
at com.mongodb.MongoCollectionImpl.insertMany(MongoCollectionImpl.java:323)
at com.mongodb.MongoCollectionImpl.insertMany(MongoCollectionImpl.java:311)
at com.mongodb.spark.MongoSpark$$anonfun$save$1$$anonfun$apply$1$$anonfun$apply$2.apply(MongoSpark.scala:132)
at com.mongodb.spark.MongoSpark$$anonfun$save$1$$anonfun$apply$1$$anonfun$apply$2.apply(MongoSpark.scala:132)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at com.mongodb.spark.MongoSpark$$anonfun$save$1$$anonfun$apply$1.apply(MongoSpark.scala:132)
at com.mongodb.spark.MongoSpark$$anonfun$save$1$$anonfun$apply$1.apply(MongoSpark.scala:131)
at com.mongodb.spark.MongoConnector$$anonfun$withCollectionDo$1.apply(MongoConnector.scala:186)
at com.mongodb.spark.MongoConnector$$anonfun$withCollectionDo$1.apply(MongoConnector.scala:184)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:171)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:171)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:154)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:171)
at com.mongodb.spark.MongoConnector.withCollectionDo(MongoConnector.scala:184)
at com.mongodb.spark.MongoSpark$$anonfun$save$1.apply(MongoSpark.scala:131)
at com.mongodb.spark.MongoSpark$$anonfun$save$1.apply(MongoSpark.scala:130)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
If I remove the list (do not put as a field in the document) in the mapToDocument function, everything works. I already searched over the internet for similar problems and I couldn't find any solution that fits.
Does anyone has a clue of how to solve it?
Thanks in advance
From the unsupported types section in the documentation:
Some Scala types (e.g. Lists) are unsupported and should be converted
to their Java equivalent. To convert from Scala into native types
include the following import statement to use the .asJava method.
import scala.collection.JavaConverters._
import org.bson.Document
val documents = sc.parallelize(
Seq(new Document("fruits", List("apples", "oranges", "pears").asJava))
)
MongoSpark.save(documents)
The reason they are unsupported is due to the Mongo Spark Connector using the Mongo Java Driver underneath as theres no point in using the Scala async driver in this context. However, it does mean for RDD's you have to map to supported Java types. When using Datasets these conversions are automatically done for you.

javax.mail throws IndexOutOfBoundsException when index do exists

package using:
com.sun.mail:javax.mail:1.5.6 from maven
I wrote a scala program where I use javax.mail to deal with emails. In the first part I get some mail id by message.getMessageNumber and later when I tried to retrieve the mail by these id IndexOutOfBoundsException happened. Mail server has nothing changed during the process.
Here is the code I get the id of messages.
val Final = new AndTerm(Subject,Size)
//val FinalTerm = new AndTerm(From)
val messages = inbox.search(Final).map{
message=>
val date = trim(message.getSubject)
(date,message.getMessageNumber)
}.filter(_._1.isDefined).map(_._2)
inbox.close(true)
store.close
And here is the code Exception throwed.
//newed another store and Folder with the same name
val ContentType = messages.map(id=>inbox.getMessage(id).getContentType())
inbox.close(true)
store.close
The Exception Message:
Exception in thread "main" java.lang.IndexOutOfBoundsException: 416 > 64
at com.sun.mail.imap.IMAPFolder.checkRange(IMAPFolder.java:513)
at com.sun.mail.imap.IMAPFolder.getMessage(IMAPFolder.java:1770)
at EmailReader.MessageByNumber(EmailReader.scala:67)
at Main$$anonfun$main$1.apply(Main.scala:43)
at Main$$anonfun$main$1.apply(Main.scala:41)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at Main$.main(Main.scala:40)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
The 416 > 64 gives me a hint that maybe there is some server-side limitation, is that true?
It looks like you're closing the Folder after fetching the Message objects. Message numbers (and Message objects) are only valid while the folder is open.
I believe the numbers represent that you passed an ID of 416 whereas the count of the collection is 64. It appears that getMessage(id) may be attempting to look for a number to retrieve that message from the array; such as getMessage(0) would be the first in the collection. Instead, it appears the code is passing the Message ID which is not directly translatable to the number in the collection.

Openjpa-2.2.2-r422266:1468616 nonfatal user error caused by ArgumentException and InvalidDataAccessApiUsageException

We have a web application run on tomcat and has code as below to find a schema entity.
#Override
public Schema findSchemaByCategoryAndDomainId(String category, Integer domainId) throws Exception
{
return schemaDao.findByCategoryAndDomainId(category, domainId);
}
The database and ORM we used as follows
DataBase: PostgreSQL v9.4 (Windows version)
OpenJPA: version 2.2.2
SpringDataJPA: version 1.3.0.RELEASE
It works fine at the begining, but after amount of query have made, about 112000 times in 4 hours, it becomes failed and throw the exception as below:
[ERROR][datacollection.service.DataCollectionServiceImpl.postProbeData():711][16/06/15 11:53:27.147]
Exception:
org.springframework.dao.InvalidDataAccessApiUsageException: Parameter ParameterExpression<Integer> for query "null" exceeds the number of 2 bound parameters with following values "{ParameterExpression<Integer>=0, ParameterExpression<String>=PROBE_DATA}". This can happen if you have declared but missed to bind values for one or more parameters.; nested exception is <openjpa-2.2.2-r422266:1468616 nonfatal user error> org.apache.openjpa.persistence.ArgumentException: Parameter ParameterExpression<Integer> for query "null" exceeds the number of 2 bound parameters with following values "{ParameterExpression<Integer>=0, ParameterExpression<String>=PROBE_DATA}". This can happen if you have declared but missed to bind values for one or more parameters.
at org.springframework.orm.jpa.EntityManagerFactoryUtils.convertJpaAccessExceptionIfPossible(EntityManagerFactoryUtils.java:384)
at org.springframework.orm.jpa.DefaultJpaDialect.translateExceptionIfPossible(DefaultJpaDialect.java:122)
at org.springframework.orm.jpa.AbstractEntityManagerFactoryBean.translateExceptionIfPossible(AbstractEntityManagerFactoryBean.java:417)
at org.springframework.dao.support.ChainedPersistenceExceptionTranslator.translateExceptionIfPossible(ChainedPersistenceExceptionTranslator.java:59)
at org.springframework.dao.support.DataAccessUtils.translateIfNecessary(DataAccessUtils.java:213)
at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:147)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.data.jpa.repository.support.LockModeRepositoryPostProcessor$LockModePopulatingMethodIntercceptor.invoke(LockModeRepositoryPostProcessor.java:92)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:207)
at com.sun.proxy.$Proxy104.findByCategoryAndDomainId(Unknown Source)
at devicemanage.service.appdeploy.AdminServiceImpl.findSchemaByCategoryAndDomainId(AdminServiceImpl.java:2080)
at devicemanage.service.appdeploy.AdminServiceImpl.findSchemaOIdByCategoryAndDomainId(AdminServiceImpl.java:2086)
at devicemanage.service.appdeploy.AdminServiceImpl$$FastClassBySpringCGLIB$$76519d18.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:649)
at devicemanage.service.appdeploy.AdminServiceImpl$$EnhancerBySpringCGLIB$$7f8fa940.findSchemaOIdByCategoryAndDomainId(<generated>)
at datacollection.service.DataCollectionServiceImpl.postProbeData(DataCollectionServiceImpl.java:683)
...
Caused by: <openjpa-2.2.2-r422266:1468616 nonfatal user error> org.apache.openjpa.persistence.ArgumentException: Parameter ParameterExpression<Integer> for query "null" exceeds the number of 2 bound parameters with following values "{ParameterExpression<Integer>=0, ParameterExpression<String>=PROBE_DATA}". This can happen if you have declared but missed to bind values for one or more parameters.
at org.apache.openjpa.kernel.ExpressionStoreQuery$AbstractExpressionExecutor.toParameterArray(ExpressionStoreQuery.java:423)
at org.apache.openjpa.datacache.QueryCacheStoreQuery$QueryCacheExecutor.toParameterArray(QueryCacheStoreQuery.java:481)
at org.apache.openjpa.kernel.QueryImpl.execute(QueryImpl.java:857)
at org.apache.openjpa.kernel.QueryImpl.execute(QueryImpl.java:794)
at org.apache.openjpa.kernel.DelegatingQuery.execute(DelegatingQuery.java:542)
at org.apache.openjpa.persistence.QueryImpl.execute(QueryImpl.java:286)
at org.apache.openjpa.persistence.QueryImpl.getResultList(QueryImpl.java:302)
at org.apache.openjpa.persistence.QueryImpl.getSingleResult(QueryImpl.java:330)
at org.springframework.data.jpa.repository.query.JpaQueryExecution$SingleEntityExecution.doExecute(JpaQueryExecution.java:123)
at org.springframework.data.jpa.repository.query.JpaQueryExecution.execute(JpaQueryExecution.java:55)
at org.springframework.data.jpa.repository.query.AbstractJpaQuery.doExecute(AbstractJpaQuery.java:95)
at org.springframework.data.jpa.repository.query.AbstractJpaQuery.execute(AbstractJpaQuery.java:85)
at org.springframework.data.repository.core.support.RepositoryFactorySupport$QueryExecutorMethodInterceptor.invoke(RepositoryFactorySupport.java:312)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:98)
at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:266)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:95)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:136)
... 52 more
We have debugged it in debug mode, and it looks like the paramater String category and Integer domainId are both correct. After we have restart the tomcat, it works fine again. Is this a JPA issue or we need to tuning the JPA or database settings?
After we upgraded the spring-data-jpa lib to version 1.9.0.RELEASE or later, the issue is solved. Under 8 asynchronous multi-thread query testing at each 200 ms one times.