I'm trying to make year and month columns from a column named logtimestamp (of type TimeStampType) in spark. The data source is cassandra. I am using sparkshell to perform these steps, here is the code I have written -
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.types._
var logsDF = spark.read.cassandraFormat("tableName", "cw").load()
var newlogs = logsDF.withColumn("year", year(col("logtimestamp")))
.withColumn("month", month(col("logtimestamp")))
newlogs.write.cassandraFormat("tableName_v2", "cw")
.mode("Append").save()
But these steps do not succeed, I end up with the following error
java.lang.ArithmeticException: long overflow
at java.lang.Math.multiplyExact(Math.java:892)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.millisToMicros(DateTimeUtils.scala:205)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.fromJavaTimestamp(DateTimeUtils.scala:166)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$TimestampConverter$.toCatalystImpl(CatalystTypeConverters.scala:327)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$TimestampConverter$.toCatalystImpl(CatalystTypeConverters.scala:325)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:107)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:252)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:242)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:107)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$.$anonfun$createToCatalystConverter$2(CatalystTypeConverters.scala:426)
at com.datastax.spark.connector.datasource.UnsafeRowReader.read(UnsafeRowReaderFactory.scala:34)
at com.datastax.spark.connector.datasource.UnsafeRowReader.read(UnsafeRowReaderFactory.scala:21)
at com.datastax.spark.connector.datasource.CassandraPartitionReaderBase.$anonfun$getIterator$2(CassandraScanPartitionReaderFactory.scala:110)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:494)
at com.datastax.spark.connector.datasource.CassandraPartitionReaderBase.next(CassandraScanPartitionReaderFactory.scala:66)
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I thought it was something to do with null values in the table so I ran the following
scala> logsDF.filter("logtimestamp is null").show()
But this too gave the same long overflow error.
How come there is an overflow in spark but not in cassandra when both have timestamps of 8 bytes?
What could be the issue here and how do I extract year and month from timestamp correctly?
Turns out one of the cassandra table had a timestamp value that was greater than the highest value allowed by spark but not large enough to overflow in cassandra. The timestamp had been manually edited to get around the upserting that is done by default in cassandra, but this led to some large values being formed during development.
Ran a python script to find this out.
Related
I have a big Dataframe with a lot of information from different devices with their IDs. What I would like is to filter this Dataframe with the IDs that are in a second Dataframe. I know that I can easily do it with the command join, but I would like to try it with the command filter.
Also, I'm trying it because I've read that the command filter is more efficient than the join, could someone shed some light about it?
Thank you
I've tried this:
val DfFiltered = DF1.filter(col("Id").isin(DF2.rdd.map(r => r(0)).collect())
But I get the following error:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Unsupported component type class java.lang.Object in arrays;
=== Streaming Query ===
Identifier: [id = 0d89d684-d794-407d-a03c-feb3ad6a78c2, runId = b7b774c0-ce83-461e-ac26-7535d6d2b0ac]
Current Committed Offsets: {KafkaV2[Subscribe[MeterEEM]]: {"MeterEEM":{"0":270902}}}
Current Available Offsets: {KafkaV2[Subscribe[MeterEEM]]: {"MeterEEM":{"0":271296}}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
Project [value2#21.meterid AS meterid#23]
+- Project [jsontostructs(StructField(meterid,StringType,true), cast(value#8 as string), Some(Europe/Paris)) AS value2#21]
+- StreamingExecutionRelation KafkaV2[Subscribe[MeterEEM]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.sql.AnalysisException: Unsupported component type class java.lang.Object in arrays;
at org.apache.spark.sql.catalyst.expressions.Literal$.componentTypeToDataType(literals.scala:117)
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:70)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
at org.apache.spark.sql.functions$.typedLit(functions.scala:127)
at org.apache.spark.sql.functions$.lit(functions.scala:110)
at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:796)
at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:796)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.Column.isin(Column.scala:796)
at StreamingProcedure.MetersEEM$.meterEemCalcs(MetersEEM.scala:28)
at LaunchFunctions.LaunchMeterEEM$$anonfun$1.apply(LaunchMeterEEM.scala:23)
at LaunchFunctions.LaunchMeterEEM$$anonfun$1.apply(LaunchMeterEEM.scala:15)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:534)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:532)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:531)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apacahe.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBtchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more
I've made the assumption that the data in the Id column is of an Integer datatype.
val list = DF2.select("Id").as[Int] collect()
val DfFiltered = DF1.filter($"Id".isin(list: _*))
DfFiltered collect()
From High Performance Spark book, it's explained that:
Joining data is an important part of many of our pipelines, and both Spark Core and SQL support the same fundamental types of joins. While joins are very common and powerful, they warrant special performance consideration as they may require large network transfers or even create datasets beyond our capability to handle.1 In core Spark it can be more important to think about the ordering of operations, since the DAG optimizer, unlike the SQL optimizer, isn’t able to re-order or push down filters.
So, choosing filter instead of join seems a good choice
you can simply add (:_*) to your code this would work perfectly fine.
scala> val DfFiltered = df.filter(col("a").isin(df2.rdd.map(r => r(0)).collect():_*)).show()
I'm loading parquet files from Databricks to Spark:
val dataset = context.session.read().parquet(parquetPath)
Then I perform some transformations like this:
val df = dataset.withColumn(
columnName, concat_ws("",
col(data.columnName), lit(textToAppend)))
When I try to save it as JSON to Kafka (not back to parquet!):
df = df.select(
lit("databricks").alias("source"),
struct("*").alias("data"))
val server = "kafka.dev.server" // some url
df = dataset.selectExpr("to_json(struct(*)) AS value")
df.write()
.format("kafka")
.option("kafka.bootstrap.servers", server)
.option("topic", topic)
.save()
I get the following exception:
org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file dbfs:/mnt/warehouse/part-00001-tid-4198727867000085490-1e0230e7-7ebc-4e79-9985-0a131bdabee2-4-c000.snappy.parquet. Column: [item_group_id], Expected: StringType, Found: INT32
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:310)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:287)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
at com.databricks.sql.io.parquet.NativeColumnReader.readBatch(NativeColumnReader.java:448)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.nextBatch(DatabricksVectorizedParquetRecordReader.java:330)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:167)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:299)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:287)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This only happens if I'm trying to read multiple partitions. For example in the /mnt/warehouse/ directory I have a lot of parquet files each representing data from a datestamp. If I read only one of them I don't get exceptions but if I read the whole directory this exception happens.
I get this when I do a transformation, like above where I change the data type of a column. How can I fix this? I'm not trying to write back to parquet but to transform all files from the same source schema to a new schema and write them to Kafka.
There seems to be an issue with the parquet files. The item_group_id column in the files are not all of the same data type, some files have the column stored as String and others as Integer. From the source code of the exception SchemaColumnConvertNotSupportedException we see the description:
Exception thrown when the parquet reader find column type mismatches.
A simple way to replicate the problem can be found among the tests for Spark on github:
Seq(("bcd", 2)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet(s"$path/parquet")
Seq((1, "abc")).toDF("a", "b").coalesce(1).write.mode("append").parquet(s"$path/parquet")
spark.read.parquet(s"$path/parquet").collect()
Of course, this will only happen when reading multiple files at once, or as in the test above where more data has been appended. If a single file is read then there will not be a mismatch issue between the datatypes of a column.
The easiest way to fix the problem would be to make sure that the column types of all files are correct while writing the files.
The alternative is to read all the parquet files separetly, change the schemas to match and then combine them with union. An easy way to do this is to adjust the schemas:
// Specify the files and read as separate dataframes
val files = Seq(...)
val dfs = files.map(file => spark.read.parquet(file))
// Specify the schema (here the schema of the first file is used)
val schema = dfs.head.schema
// Create new columns with the correct names and types
val newCols = schema.map(c => col(c.name).cast(c.dataType))
// Select the new columns and merge the dataframes
val df = dfs.map(_.select(newCols: _*)).reduce(_ union _)
You can find the instruction on this link
It present you the differents ways to write data to a kafka topic.
I'm using DSE 5.1 (spark 2.0.2.6 and cassandra 3.10.0.1652)
My Cassandra table:
CREATE TABLE ks.tbl (
dk int,
date date,
ck int,
val int,
PRIMARY KEY (dk, date, ck)
) WITH CLUSTERING ORDER BY (date DESC, ck ASC);
with the following data:
dk | date | ck | val
----+------------+----+-----
1 | 2017-01-01 | 1 | 100
1 | 2017-01-01 | 2 | 200
My code must read this data and write the same thing but with yesterday's date (it compiles successfully):
package com.datastax.spark.example
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
import com.github.nscala_time.time._
import com.github.nscala_time.time.Imports._
object test extends App {
val conf = new SparkConf().setAppName("DSE calculus app TEST")
val sc = new SparkContext(conf)
val yesterday= (DateTime.now - 1.days).toString(StaticDateTimeFormat.forPattern("yyyy-MM-dd"))
val tbl = sc.cassandraTable("ks","tbl").select("dk","date","ck","val").where("dk=1")
tbl.map(row => (row.getInt("dk"),yesterday,row.getInt("ck"),row.getInt("val"))).saveToCassandra("ks","tbl")
sc.stop()
sys.exit(0)
}
When I run this app:
dse spark-submit --class com.datastax.spark.example.test test-assembly-0.1.jar
It fails to properly write to Cassandra. It seems the date variable is not inserted in the map correctly.
The error I get is:
Error:
WARN 2017-05-08 22:23:16,472 org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, <IP of one of my nodes>): java.io.IOException: Failed to write statements to ks.tbl.
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:207)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
However, when I insert a date (string) directly in the map statement as follows, the code does insert the data correctly:
tbl.map(row => (row.getInt("dk"),"2017-02-02",row.getInt("ck"),row.getInt("val"))).saveToCassandra("ks","tbl")
It also insert the data correctly if I set yesterday to an integer (days since epoch). This will be optimal, but can't get 'yesterday' to behave this way
EDIT: This does not insert data correctly, actually. No matter if I set 'yesterday' to 1 or 100,000,000 it always insert epoch ('1970-01-01)
The code that fails behave correctly and as I would expect in the DSE Spark console.
I just can't figure out how what I'm doing wrong. Any help is welcome.
EDIT2: The excecutor 0 stderr log does show that its trying to insert a Null value in the column date, that's obviously not possible since its a clustering column.
When writing code for a Spark Job it's important to realize when particular variables are set and when they are serialized. Let's take a look at a note from the App trait docs
Caveats
It should be noted that this trait is implemented using the
DelayedInit functionality, which means that fields of the object will
not have been initialized before the main method has been executed.
This means references to the variables used in the body of the App are possibly not initialized on the Executors when the code is actually being run.
My guess is that the lambda you have written contains a reference to a val which is initialized in the Delayed init portion of the App class. This means the serialized version of the code on the executor which doesn't run the Main method gets the uninitialized version of the value (null).
Switching the constant to a lazy val (or moving it into a separate object or class) would fix this issue by making sure the value is initialized remotely (lazy val) or simply serialized initialized (separate class/object).
I think I know what is your problem.
You may see full log file. You just attach a part of them...
Today have similar error, when create keyspace with replication_factor: 3 when I had only one cassandra instance.
So i change it and problem is gone.
ALTER KEYSPACE "some_keyspace_name" WITH REPLICATION =
{ 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
Here is my error.log file
And important part of log:
Logging.scala[logError]:72) - Failed to execute: com.datastax.spark.connector.writer.RichBoundStatement#4746499f
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (2 required but only 1 alive)
I am running a simple sparkSQL query, where it does a match on 2 data sets each dataset is around 500GB. So whole data is around 1TB.
val adreqPerDeviceid = sqlContext.sql("select count(Distinct a.DeviceId) as MatchCount from adreqdata1 a inner join adreqdata2 b ON a.DeviceId=b.DeviceId ")
adreqPerDeviceid.cache()
adreqPerDeviceid.show()
job works fine till data loading (10k tasks assigned).
200 tasks are assigned at .cache line. where it fails! i know i am not caching a huge data its just a number why does it fail over here.
Below are error details:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215)
at
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207)
at
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) at
org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314) at
org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377) at
org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178) at
org.apache.spark.sql.DataFrame.show(DataFrame.scala:401) at
org.apache.spark.sql.DataFrame.show(DataFrame.scala:362) at
org.apache.spark.sql.DataFrame.show(DataFrame.scala:370) at
comScore.DayWiseDeviceIDMatch$.main(DayWiseDeviceIDMatch.scala:62) at
comScore.DayWiseDeviceIDMatch.main(DayWiseDeviceIDMatch.scala) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Whenever you make a join on a huge dataset, i.e looking for aggregated value from the join of 2 datasets your cluster need a minimum (Dataset1+Dataset2) size of hardDisk not RAM. then the job will be successful.
Most likely amount of unique device ids don't fit the RAM of single executor. try spark.conf.set('spark.shuffle.partitions', 500) to get 500 tasks instead of your current 200. If query still performs badly, double it again.
What else may get the query to work better is having the data sorted by the key you're joining.
I am trying to get a hive-hadoop-mongo setup to work. I have imported the data into mongodb from a json file, then I created both internal and external tables in hive that connect to mongo:
CREATE EXTERNAL TABLE reviews(
user_id STRING,
review_id STRING,
stars INT,
date1 STRING,
text STRING,
type STRING,
business_id STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"date1":"date"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.reviews');
This part works fine because a select all query (select * from reviews) outputs everything like it should. But when I do one with a where clause (select * from reviews where stars=4 for example) hive crashes.
I have the following jars being added when I start up hive:
add jar mongo-hadoop.jar;
add jar mongo-java-driver-3.3.0.jar;
add jar mongo-hadoop-hive-2.0.1.jar;
And if it is relevant in any sense, I am using Amazon's EMR cluster for this, and I'm connected through ssh.
Thanks for all the help
Here is the error hive throws out:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.Utilities.deserializeExpression(Ljava/lang/String;)Lorg/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc;
at com.mongodb.hadoop.hive.input.HiveMongoInputFormat.getFilter(HiveMongoInputFormat.java:134)
at com.mongodb.hadoop.hive.input.HiveMongoInputFormat.getRecordReader(HiveMongoInputFormat.java:103)
at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:691)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:329)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:455)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:424)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:144)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1885)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Crete table like below and check.
CREATE [EXTERNAL] TABLE <tablename>
(<schema>)
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
[WITH SERDEPROPERTIES('mongo.columns.mapping'='<JSON mapping>')]
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
[LOCATION '<path to existing directory>'];
Instead of using a StorageHandler to read, serialize, deserialize, and output the data from Hive objects to BSON objects, the individual components are listed individually. This is because using a StorageHandler has too many negative effects when dealing with the native HDFS filesystem
I see
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"date1":"date"}')
and you are querying the column stars which is not mapped.
I met this probelem on our cluster.
The cluster hive version is higher than version in mongo-hive(which is 1.2.1)
The old class org.apache.hadoop.hive.ql.exec.Utilities.deserializeExpression has been renamed to org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpression
You need to rebuild the jar by yourself.