dataframe map and hivecontext issue - scala

Env: Spark 1.6 and Scala
Hi,
I have dataframe DF and tried to run
val configTable= hivecontext.table("mydb.myTable")
configTable.rdd.map(row=>{
val abc =hivecontext.sql("select count(*) as num_rows from mydb2.mytable2")
}).collect()
I am getting exception
17/03/28 22:47:04 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NullPointerException
Is it not allowed to use SparkSQL in rdd.map? Any work around for this?
Thanks

Related

Spark task fails to write rows into ORC table

I run the following code for a spatial join on geometry fields:
val coverage = DimCoverageReader.apply(spark, params)
coverage.createOrReplaceTempView("dim_coverage")
val uniqueGeometries = spark.table(params.UniqueGeometriesTable)
uniqueGeometries.createOrReplaceTempView("unique_geometries")
spark
.sql(
"""select a.*, b.lac, b.cell_id
|from unique_geometries as a, dim_coverage as b
|where ST_Intersects(ST_GeomFromWKT(a.geo_wkt), ST_GeomFromWKT(b.geo_wkt))
|""".stripMargin)
The resulting dataframe is later saved into ORC table:
Stage(spark,params).write
.format("orc")
.mode(SaveMode.Overwrite)
.saveAsTable(params.IntersectGeometriesTable)
I get this error during execution:
org.apache.spark.SparkException: Task failed while writing rows
0/10/30 17:37:19 ERROR Executor: Exception in task 205.0 in stage 4.0 (TID 1219)
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:270)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Column has wrong number of index entries found: 320 expected: 800
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.writeStripe(WriterImpl.java:803)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.writeStripe(WriterImpl.java:1742)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:2133)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:352)
at org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
at org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2413)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:76)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:55)
at org.apache.spark.sql.hive.orc.OrcOutputWriter.write(OrcFileFormat.scala:248)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:325)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
... 8 more
What is the root cause of this problem?
If this works fine with format('parquet') my guess is that you have some sort of struct type or formatting issue. Can you add the printSchema for your DF?

Filter on map doesn't prevent ArrayIndexOutofBounds in Spark

I am trying to read pipe delimited text files and here is the code i have
val mySchema = new StructType(Array(
new StructField("COL1",StringType,true),
new StructField("COL2",StringType,true),
new StructField("COL3",StringType,true),
new StructField("COL4",StringType,true),
new StructField("COL5",StringType,true),
))
val myRDD = sparkSession
.sparkContext
.textFile(myFilePath)
.map(_.split("[|]"))
.filter(_.length >= 5)
.map(r => Row(r(0),r(1),r(2),r(3),r(4)))
val myDF = sparkSession.sqlContext.createDataFrame(myRDD,mySchema)
val myView = myDF.createOrReplaceTempView("MY_VIEW")
val totalRecords = saprkSession.sqlContext.sql("SELECT * FROM MY_VIEW")
val countStr = "Total Records ="+totalRecords.count();
And i am getting the following Error in the logs
WARN TaskSetManager: Lost task 0.0 in stage 52.0 (TID 2460, executor 2): java.lang.ArrayIndexOutOfBoundsException: 4
As we are filtering the records after Splitting the current row into columns why ArrayIndexOutOfBounds is still being thrown. I am unable to figure this out.
Appreciate any help.
Thanks

Spark NegativeArraySizeException

In a spark job I join two RDDs,
val data: RDD[(Long, (String, String))] = sc.objectFile[(Long, scala.collection.mutable.HashMap[String, Object])](outputFile)
.leftOuterJoin(attributionData)
Here outputFile is output of another spark job which process data from hive. One of the tables in hive has 40 million records and when I limit to read table to fetch only 10 million records code works fine. However with full data (if I remove limit()) following error occurs,
10:43:27 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, buysub.com): java.lang.NegativeArraySizeException
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:409)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:227)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:221)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:117)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:228)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:221)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:117)
at com.esotericsoftware.kryo.util.MapReferenceResolver.addWrittenObject(MapReferenceResolver.java:23)
at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:598)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:566)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.immutable.List.foreach(List.scala:318)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
I am using Spark 1.6. Following is the spark configuration,
conf.set("spark.driver.memory", "4G")
conf.set("spark.executor.memory", "30G")
conf.set("spark.rdd.compress", "true")
conf.set("spark.storage.memoryFraction", "0.3")
conf.set("spark.shuffle.consolidateFiles", "true")
conf.set("spark.shuffle.memoryFraction", "0.5")
conf.set("spark.akka.frameSize", "384")
conf.set("spark.io.compression.codec", "lz4")
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
I found some info pointing to this being a bug in Kryo serialization:
https://github.com/EsotericSoftware/kryo/issues/382
It's fixed in Kryo 4, but spark is not yet using that version:
https://issues.apache.org/jira/browse/SPARK-20389
As a temporary work-around, sounds like this might help:
spark.executor.extraJavaOptions –XX:hashCode=0
spark.driver.extraJavaOptions –XX:hashCode=0
(From https://github.com/broadinstitute/gatk/issues/1524#issuecomment-189368808)
Or you could simply use a different serializer, though that might slow things down.
This happens when Kryo's reference table exceeds the max integer value (integer overflow).
This solve this, set spark.kryo.referenceTracking to false

Trying to execute a spark sql query from a UDF

I am trying to write a inline function in spark framework using scala which will take a string input, execute a sql statement and return me a String value
val testfunc: (String=>String)= (arg1:String) =>
{val k = sqlContext.sql("""select c_code from r_c_tbl where x_nm = "something" """)
k.head().getString(0)
}
I am registering this scala function as an UDF
val testFunc_test = udf(testFunc)
I have a dataframe over a hive table
val df = sqlContext.table("some_table")
Then I am calling the udf in a withColumn and trying to save it in a new dataframe.
val new_df = df.withColumn("test", testFunc_test($"col1"))
But everytime i try do this i get an error
16/08/10 21:17:08 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 10.0.1.5): java.lang.NullPointerException
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:41)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at org.apache.spark.sql.DataFrame.foreach(DataFrame.scala:1434)
I am relatively new to spark and scala . But I am not sure why this code should not run. Any insights or an work around will be highly appreciated.
Please note that I have not pasted the whole error stack . Please let me know if it is required.
You can't use sqlContext in your UDF - UDFs must be serializable to be shipped to executors, and the context (which can be thought of as a connection to the cluster) can't be serialized and sent to the node - only the driver application (where the UDF is defined, but not executed) can use the sqlContext.
Looks like your usecase (perform a select from table X per record in table Y) would better be accomplished by using a join.

Read CSV as dataframe and convert to JSON string

I'm trying to aggregate a CSV file via Spark SQL and then show the result as JSON:
val people = sqlContext.read().format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", ",").load("/tmp/people.csv")
people.registerTempTable("people")
val result = sqlContext.sql("select country, count(*) as cnt from people group by country")
That's where I'm stuck. I can to a result.schema().prettyJson() which works flawlessly, but I don't find a way to return the result as JSON.
I was assuming that result.toJSON.collect() should do what I desire, but this fails with a
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 101.0 failed 1 times, most recent failure: Lost task 1.0 in stage 101.0 (TID 159, localhost): java.lang.NegativeArraySizeException
at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:171)
at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:162)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
error. Can somebody guide me?
The error you're getting is odd, it sounds like result is probably empty?
You might want to try this command on the dataframe to get each line printed out instead:
result.toJSON.foreach(println)
See the Dataframe API for a little more information
Turns out this error was because of a "malformed" CSV file. It contained some rows which had more columns than others (with no header field name)... Strange error message though.
Try
val people = sqlContext.read().format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/tmp/people.csv")
people.registerTempTable("people")
val result = sqlContext.sql("select country, count(*) as cnt from people group by country")