Lost executor on simple sparksql join query

Lost executor on simple sparksql join query - scala

I am running a simple sparkSQL query, where it does a match on 2 data sets each dataset is around 500GB. So whole data is around 1TB.
val adreqPerDeviceid = sqlContext.sql("select count(Distinct a.DeviceId) as MatchCount from adreqdata1 a inner join adreqdata2 b ON a.DeviceId=b.DeviceId ")
adreqPerDeviceid.cache()
adreqPerDeviceid.show()
job works fine till data loading (10k tasks assigned).
200 tasks are assigned at .cache line. where it fails! i know i am not caching a huge data its just a number why does it fail over here.
Below are error details:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215)
at
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207)
at
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) at
org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314) at
org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377) at
org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178) at
org.apache.spark.sql.DataFrame.show(DataFrame.scala:401) at
org.apache.spark.sql.DataFrame.show(DataFrame.scala:362) at
org.apache.spark.sql.DataFrame.show(DataFrame.scala:370) at
comScore.DayWiseDeviceIDMatch$.main(DayWiseDeviceIDMatch.scala:62) at
comScore.DayWiseDeviceIDMatch.main(DayWiseDeviceIDMatch.scala) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Whenever you make a join on a huge dataset, i.e looking for aggregated value from the join of 2 datasets your cluster need a minimum (Dataset1+Dataset2) size of hardDisk not RAM. then the job will be successful.

Most likely amount of unique device ids don't fit the RAM of single executor. try spark.conf.set('spark.shuffle.partitions', 500) to get 500 tasks instead of your current 200. If query still performs badly, double it again.
What else may get the query to work better is having the data sorted by the key you're joining.

Related

Why do year and month functions result in long overflow in Spark?

I'm trying to make year and month columns from a column named logtimestamp (of type TimeStampType) in spark. The data source is cassandra. I am using sparkshell to perform these steps, here is the code I have written -
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.types._
var logsDF = spark.read.cassandraFormat("tableName", "cw").load()
var newlogs = logsDF.withColumn("year", year(col("logtimestamp")))
.withColumn("month", month(col("logtimestamp")))
newlogs.write.cassandraFormat("tableName_v2", "cw")
.mode("Append").save()
But these steps do not succeed, I end up with the following error
java.lang.ArithmeticException: long overflow
at java.lang.Math.multiplyExact(Math.java:892)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.millisToMicros(DateTimeUtils.scala:205)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.fromJavaTimestamp(DateTimeUtils.scala:166)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$TimestampConverter$.toCatalystImpl(CatalystTypeConverters.scala:327)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$TimestampConverter$.toCatalystImpl(CatalystTypeConverters.scala:325)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:107)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:252)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:242)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:107)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$.$anonfun$createToCatalystConverter$2(CatalystTypeConverters.scala:426)
at com.datastax.spark.connector.datasource.UnsafeRowReader.read(UnsafeRowReaderFactory.scala:34)
at com.datastax.spark.connector.datasource.UnsafeRowReader.read(UnsafeRowReaderFactory.scala:21)
at com.datastax.spark.connector.datasource.CassandraPartitionReaderBase.$anonfun$getIterator$2(CassandraScanPartitionReaderFactory.scala:110)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:494)
at com.datastax.spark.connector.datasource.CassandraPartitionReaderBase.next(CassandraScanPartitionReaderFactory.scala:66)
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I thought it was something to do with null values in the table so I ran the following
scala> logsDF.filter("logtimestamp is null").show()
But this too gave the same long overflow error.
How come there is an overflow in spark but not in cassandra when both have timestamps of 8 bytes?
What could be the issue here and how do I extract year and month from timestamp correctly?

Turns out one of the cassandra table had a timestamp value that was greater than the highest value allowed by spark but not large enough to overflow in cassandra. The timestamp had been manually edited to get around the upserting that is done by default in cassandra, but this led to some large values being formed during development.
Ran a python script to find this out.

How to filter a Dataframe with information from other Dataframe using command filter

I have a big Dataframe with a lot of information from different devices with their IDs. What I would like is to filter this Dataframe with the IDs that are in a second Dataframe. I know that I can easily do it with the command join, but I would like to try it with the command filter.
Also, I'm trying it because I've read that the command filter is more efficient than the join, could someone shed some light about it?
Thank you
I've tried this:
val DfFiltered = DF1.filter(col("Id").isin(DF2.rdd.map(r => r(0)).collect())
But I get the following error:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Unsupported component type class java.lang.Object in arrays;
=== Streaming Query ===
Identifier: [id = 0d89d684-d794-407d-a03c-feb3ad6a78c2, runId = b7b774c0-ce83-461e-ac26-7535d6d2b0ac]
Current Committed Offsets: {KafkaV2[Subscribe[MeterEEM]]: {"MeterEEM":{"0":270902}}}
Current Available Offsets: {KafkaV2[Subscribe[MeterEEM]]: {"MeterEEM":{"0":271296}}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
Project [value2#21.meterid AS meterid#23]
+- Project [jsontostructs(StructField(meterid,StringType,true), cast(value#8 as string), Some(Europe/Paris)) AS value2#21]
+- StreamingExecutionRelation KafkaV2[Subscribe[MeterEEM]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.sql.AnalysisException: Unsupported component type class java.lang.Object in arrays;
at org.apache.spark.sql.catalyst.expressions.Literal$.componentTypeToDataType(literals.scala:117)
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:70)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
at org.apache.spark.sql.functions$.typedLit(functions.scala:127)
at org.apache.spark.sql.functions$.lit(functions.scala:110)
at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:796)
at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:796)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.Column.isin(Column.scala:796)
at StreamingProcedure.MetersEEM$.meterEemCalcs(MetersEEM.scala:28)
at LaunchFunctions.LaunchMeterEEM$$anonfun$1.apply(LaunchMeterEEM.scala:23)
at LaunchFunctions.LaunchMeterEEM$$anonfun$1.apply(LaunchMeterEEM.scala:15)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:534)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:532)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:531)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apacahe.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBtchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more

I've made the assumption that the data in the Id column is of an Integer datatype.
val list = DF2.select("Id").as[Int] collect()
val DfFiltered = DF1.filter($"Id".isin(list: _*))
DfFiltered collect()

From High Performance Spark book, it's explained that:
Joining data is an important part of many of our pipelines, and both Spark Core and SQL support the same fundamental types of joins. While joins are very common and powerful, they warrant special performance consideration as they may require large network transfers or even create datasets beyond our capability to handle.1 In core Spark it can be more important to think about the ordering of operations, since the DAG optimizer, unlike the SQL optimizer, isn’t able to re-order or push down filters.
So, choosing filter instead of join seems a good choice

you can simply add (:_*) to your code this would work perfectly fine.
scala> val DfFiltered = df.filter(col("a").isin(df2.rdd.map(r => r(0)).collect():_*)).show()

How to write to Kafka from Spark with a changed schema without getting exceptions?

I'm loading parquet files from Databricks to Spark:
val dataset = context.session.read().parquet(parquetPath)
Then I perform some transformations like this:
val df = dataset.withColumn(
columnName, concat_ws("",
col(data.columnName), lit(textToAppend)))
When I try to save it as JSON to Kafka (not back to parquet!):
df = df.select(
lit("databricks").alias("source"),
struct("*").alias("data"))
val server = "kafka.dev.server" // some url
df = dataset.selectExpr("to_json(struct(*)) AS value")
df.write()
.format("kafka")
.option("kafka.bootstrap.servers", server)
.option("topic", topic)
.save()
I get the following exception:
org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file dbfs:/mnt/warehouse/part-00001-tid-4198727867000085490-1e0230e7-7ebc-4e79-9985-0a131bdabee2-4-c000.snappy.parquet. Column: [item_group_id], Expected: StringType, Found: INT32
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:310)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:287)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
at com.databricks.sql.io.parquet.NativeColumnReader.readBatch(NativeColumnReader.java:448)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.nextBatch(DatabricksVectorizedParquetRecordReader.java:330)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:167)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:299)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:287)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This only happens if I'm trying to read multiple partitions. For example in the /mnt/warehouse/ directory I have a lot of parquet files each representing data from a datestamp. If I read only one of them I don't get exceptions but if I read the whole directory this exception happens.
I get this when I do a transformation, like above where I change the data type of a column. How can I fix this? I'm not trying to write back to parquet but to transform all files from the same source schema to a new schema and write them to Kafka.

There seems to be an issue with the parquet files. The item_group_id column in the files are not all of the same data type, some files have the column stored as String and others as Integer. From the source code of the exception SchemaColumnConvertNotSupportedException we see the description:
Exception thrown when the parquet reader find column type mismatches.
A simple way to replicate the problem can be found among the tests for Spark on github:
Seq(("bcd", 2)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet(s"$path/parquet")
Seq((1, "abc")).toDF("a", "b").coalesce(1).write.mode("append").parquet(s"$path/parquet")
spark.read.parquet(s"$path/parquet").collect()
Of course, this will only happen when reading multiple files at once, or as in the test above where more data has been appended. If a single file is read then there will not be a mismatch issue between the datatypes of a column.
The easiest way to fix the problem would be to make sure that the column types of all files are correct while writing the files.
The alternative is to read all the parquet files separetly, change the schemas to match and then combine them with union. An easy way to do this is to adjust the schemas:
// Specify the files and read as separate dataframes
val files = Seq(...)
val dfs = files.map(file => spark.read.parquet(file))
// Specify the schema (here the schema of the first file is used)
val schema = dfs.head.schema
// Create new columns with the correct names and types
val newCols = schema.map(c => col(c.name).cast(c.dataType))
// Select the new columns and merge the dataframes
val df = dfs.map(_.select(newCols: _*)).reduce(_ union _)

You can find the instruction on this link
It present you the differents ways to write data to a kafka topic.

Hive crashing on where clause

I am trying to get a hive-hadoop-mongo setup to work. I have imported the data into mongodb from a json file, then I created both internal and external tables in hive that connect to mongo:
CREATE EXTERNAL TABLE reviews(
user_id STRING,
review_id STRING,
stars INT,
date1 STRING,
text STRING,
type STRING,
business_id STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"date1":"date"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.reviews');
This part works fine because a select all query (select * from reviews) outputs everything like it should. But when I do one with a where clause (select * from reviews where stars=4 for example) hive crashes.
I have the following jars being added when I start up hive:
add jar mongo-hadoop.jar;
add jar mongo-java-driver-3.3.0.jar;
add jar mongo-hadoop-hive-2.0.1.jar;
And if it is relevant in any sense, I am using Amazon's EMR cluster for this, and I'm connected through ssh.
Thanks for all the help
Here is the error hive throws out:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.Utilities.deserializeExpression(Ljava/lang/String;)Lorg/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc;
at com.mongodb.hadoop.hive.input.HiveMongoInputFormat.getFilter(HiveMongoInputFormat.java:134)
at com.mongodb.hadoop.hive.input.HiveMongoInputFormat.getRecordReader(HiveMongoInputFormat.java:103)
at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:691)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:329)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:455)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:424)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:144)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1885)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Crete table like below and check.
CREATE [EXTERNAL] TABLE <tablename>
(<schema>)
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
[WITH SERDEPROPERTIES('mongo.columns.mapping'='<JSON mapping>')]
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
[LOCATION '<path to existing directory>'];
Instead of using a StorageHandler to read, serialize, deserialize, and output the data from Hive objects to BSON objects, the individual components are listed individually. This is because using a StorageHandler has too many negative effects when dealing with the native HDFS filesystem

I see
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"date1":"date"}')
and you are querying the column stars which is not mapped.

I met this probelem on our cluster.
The cluster hive version is higher than version in mongo-hive(which is 1.2.1)
The old class org.apache.hadoop.hive.ql.exec.Utilities.deserializeExpression has been renamed to org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpression
You need to rebuild the jar by yourself.

sparkR-Mongo connector query to subdocument

I am using Mongo-Spark connector all examples in documentation (https://docs.mongodb.com/spark-connector/sparkR/) are fine, but if I test query in a document which has subdocuments it fails, obviously SQL is not ready for this query:
result <- sql(sqlContext, "SELECT DOCUMENT.SUBDOCUMENT FROM TABLE")
ERROR:
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast INT32 into a ConflictType (value: BsonInt32{value=171609012})
at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:79)
at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:38)
at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:36)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at com.mongodb.spark.sql.MapFunctions$.documentToRow(MapFunctions.scala:36)
at com.mongodb.spark.sql.MapFunctions$.castToStructType(MapFunctions.scala:108)
at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:74)
Previously I have registered the table as follow:
registerTempTable(schema, "TABLE")
I guess that the key problem is how to register an mongo-subdocument as table.
Someone has the solution?

Solution: All fields must follow the same type, I had fields in String type and others in Double type for this reason table is registered but it can´t be process.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Lost executor on simple sparksql join query - scala

Whenever you make a join on a huge dataset, i.e looking for aggregated value from the join of 2 datasets your cluster need a minimum (Dataset1+Dataset2) size of hardDisk not RAM. then the job will be successful.

Related

Why do year and month functions result in long overflow in Spark?

How to filter a Dataframe with information from other Dataframe using command filter

How to write to Kafka from Spark with a changed schema without getting exceptions?

Hive crashing on where clause

sparkR-Mongo connector query to subdocument

Categories

Resources