HDIV INVALID_PAGE_ID - hdiv

Thanks for reading. I am using Spring MVC 4 with HDIV 3. The applicaition functions well except after 10-15 minutes of inactivity, the application will error out on a link that worked when first logged in. The error message in the log is INVALID_PAGE_ID. Could you suggest some configurations to verify? Could it be timeout related? Session or state cache perhaps.
thanks
Thanks for the reply. There are a number of places in the application where we execute history.back(1)
or hit the browser back button. Could the cached 'rst' param (what we call HDIV_STATE) cause issues?
We see these issues when using the browser back button, on previous requests that worked OK:
1.) 2017-11-16 10:33:58,279 : INFO : ajp-bio-10082-exec-1 : Logger : INVALID_HDIV_PARAMETER_VALUE;/smartadmin/caseHistory.action;rst;3-0-57C9472F08B18562F8AE3F5B1EED656D;;50.235.239.50;50.235.239.50;sppsapadmin; followed by
2.) 2017-11-16 10:34:12,551 : INFO : ajp-bio-10082-exec-8 : Logger : INVALID_PAGE_ID;/smartadmin/homeS911.action;rst;14-1-D901769F56363F973FC0AD848C6E9BE2;;50.235.239.50;50.235.239.50;sppsapadmin;
Any suggestions to remedy? Application changes? Workarounds?
Config
Spring/Jedis/Redis
Class:
public class HdivConfiguration extends HdivWebSecurityConfigurerAdapter {
-- other methods
#Override
public void configure(SecurityConfigBuilder builder) {
builder
.sessionExpired()
.homePage("/")
.loginPage(SecurityConfig.LOGIN)
.and()
.errorPage("/error.action")
.maxPagesPerSession(30)
.confidentiality(true)
.validateUrlsWithoutParams(false)
.showErrorPageOnEditableValidation(true) // Routes users to /error.action on security violation
.cookiesConfidentiality(false)
.cookiesIntegrity(false)
.randomName(false)
.stateParameterName("rst")
.reuseExistingPageInAjaxRequest(true);
}
}
Jars:
account-profile-1.18.2.jar (508.54 KB)
activation-1.1.jar (61.5 KB)
antlr-2.7.7.jar (434.85 KB)
aopalliance-1.0.jar (4.36 KB)
classmate-1.3.0.jar (62.6 KB)
commons-beanutils-1.8.3.jar (226.58 KB)
commons-codec-1.10.jar (277.52 KB)
commons-collections4-4.1.jar (733.63 KB)
commons-digester-2.0.jar (145.29 KB)
commons-fileupload-1.3.2.jar (68.63 KB)
commons-io-2.4.jar (180.8 KB)
commons-lang-2.6.jar (277.55 KB)
commons-lang3-3.4.jar (424.49 KB)
commons-logging-1.2.jar (60.37 KB)
commons-pool2-2.4.2.jar (109.34 KB)
commons-validator-1.4.0.jar (172.75 KB)
concurrent-4.2.1.GA.jar (247.93 KB)
corporateprofile-core-1.23.3.jar (643.88 KB)
dom4j-1.6.1.jar (306.54 KB)
freemarker-2.3.23.jar (1.28 MB)
gcm-server-1.0.jar (19.25 KB)
geronimo-jta_1.1_spec-1.1.1.jar (15.65 KB)
gson-2.6.1.jar (225.47 KB)
guava-19.0.jar (2.2 MB)
hdiv-config-3.2.0.jar (61.54 KB)
hdiv-core-3.2.0.jar (139.16 KB)
hdiv-jstl-taglibs-1.2-3.2.0.jar (20.3 KB)
hdiv-spring-mvc-3.2.0.jar (19.61 KB)
hibernate-commons-annotations-5.0.1.Final.jar (73.52 KB)
hibernate-core-5.1.0.Final.jar (5.41 MB)
hibernate-jpa-2.1-api-1.0.0.Final.jar (110.71 KB)
hibernate-validator-5.2.4.Final.jar (687.95 KB)
httpclient-4.5.2.jar (719.39 KB)
httpcore-4.4.4.jar (319.06 KB)
httpi-client-3.5.1.jar (13.28 KB)
imgscalr-lib-4.2.jar (27.23 KB)
jackson-annotations-2.7.0.jar (49.7 KB)
jackson-core-2.7.1.jar (246.38 KB)
jackson-databind-2.7.1-1.jar (1.14 MB)
jackson-datatype-hibernate5-2.7.2-r1.jar (20.36 KB)
jai-codec-1.1.3.jar (252.1 KB)
jai-core-1.1.3.jar (1.81 MB)
jandex-2.0.0.Final.jar (183.35 KB)
jasypt-1.9.0.jar (122.69 KB)
javapns-jdk16-2.4.0.jar (149.45 KB)
javassist-3.20.0-GA.jar (732.98 KB)
javax.el-api-2.2.4.jar (37.95 KB)
jboss-client-4.2.1.GA.jar (189.1 KB)
jboss-common-client-1.2.0.GA.MOD.jar (368.92 KB)
jboss-j2ee-4.2.1.GA.jar (413.77 KB)
jboss-logging-3.3.0.Final.jar (65.23 KB)
jboss-remoting-2.2.1.GA.jar (862.2 KB)
jbossha-client-4.2.1.GA.jar (51.93 KB)
jbossmq-client-4.2.1.GA.jar (325.06 KB)
jcl-over-slf4j-1.7.24.jar (16.12 KB)
jedis-2.9.0.jar (540.78 KB)
jnp-client-4.2.1.GA.jar (31.67 KB)
joda-time-2.9.6.jar (617.46 KB)
json-simple-1.1.1-rave.jar (23.37 KB)
jstl-1.2.jar (404.53 KB)
jts-1.13.jar (776.35 KB)
log4j-1.2.17.jar (478.4 KB)
mail-1.4.jar (379.75 KB)
mediautil-1.0.jar (116.99 KB)
mysql-connector-java-5.1.44.jar (976.2 KB)
opencsv-1.8.jar (8.51 KB)
organizations-1.14.1.jar (78.94 KB)
passhash-1.0.4.jar (17.83 KB)
psap-1.15.2.jar (75.44 KB)
quartz-2.2.0.jar (638.43 KB)
rave-case-1.20.1.jar (353.64 KB)
rave-common-dao-1.4.1.jar (11.91 KB)
rave-common-geo-1.4.2.jar (38.66 KB)
rave-common-jobs-1.4.1.jar (48.43 KB)
rave-common-logging-1.4.1.jar (6.99 KB)
rave-common-monitor-1.3.3.jar (6.01 KB)
rave-common-msg-1.5.1.jar (127.75 KB)
rave-common-safelist-1.1.3.jar (12.72 KB)
rave-common-util-1.4.2.jar (93.18 KB)
rave-db-corporateprofile-1.22.1.jar (28.73 KB)
rave-messaging-brokerapi-1.30.0.jar (16.16 KB)
rave-messaging-carrierlookup-1.26.0.jar (11.57 KB)
rave-messaging-common-1.26.0.jar (51.94 KB)
sanselan-0.97-incubator.jar (494.08 KB)
sardine-5.7.jar (127.36 KB)
slf4j-api-1.7.24.jar (40.23 KB)
slf4j-log4j12-1.7.16.jar (9.7 KB)
spring-aop-4.3.7.RELEASE.jar (371.09 KB)
spring-beans-4.3.7.RELEASE.jar (744.87 KB)
spring-context-4.3.7.RELEASE.jar (1.08 MB)
spring-context-support-4.3.7.RELEASE.jar (182.7 KB)
spring-core-4.3.7.RELEASE.jar (1.06 MB)
spring-data-commons-1.13.1.RELEASE.jar (746.67 KB)
spring-data-keyvalue-1.2.1.RELEASE.jar (102.31 KB)
spring-data-redis-1.8.1.RELEASE.jar (1.16 MB)
spring-expression-4.3.7.RELEASE.jar (257.11 KB)
spring-jdbc-4.3.2.RELEASE.jar (416.34 KB)
spring-orm-4.3.2.RELEASE.jar (464.99 KB)
spring-oxm-4.3.7.RELEASE.jar (83.32 KB)
spring-security-acl-4.1.3.RELEASE.jar (82.99 KB)
spring-security-config-4.1.3.RELEASE.jar (532.7 KB)
spring-security-core-4.1.3.RELEASE.jar (367.36 KB)
spring-security-taglibs-4.1.3.RELEASE.jar (19.19 KB)
spring-security-web-4.1.3.RELEASE.jar (351.41 KB)
spring-session-1.3.0.RELEASE.jar (191.41 KB)
spring-session-data-redis-1.3.0.RELEASE.jar (261 B)
spring-tx-4.3.7.RELEASE.jar (260.85 KB)
spring-web-4.3.2.RELEASE.jar (792.65 KB)
spring-webmvc-4.3.2.RELEASE.jar (892.56 KB)
swiftreach1-1.1.jar (196.57 KB)
tika-core-1.14.jar (604.67 KB)
user-1.13.1.jar (45.85 KB)
utility-1.15.1.jar (74.71 KB)
validation-api-1.1.0.Final.jar (62.28 KB)
xercesImpl-2.8.1.jar (1.15 MB)
xml-apis-1.3.03.jar (190.54 KB)

Hdiv stores data in user session, so after a session is closed by timeout or user action Hdiv can't validate the request and an INVALID_PAGE_ID log is created.
This behaviour is configurable defining an specific landing page for these cases:
https://hdivsecurity.com/docs/installation/library-setup/#session-expiration

Related

Job submitted via Spark job server fails with error

I am using Spark Job Server to submit spark jobs in cluster .The application I am trying to test is a
spark program based on Sansa query and Sansa stack . Sansa is used for scalable processing of huge amounts of RDF data and Sansa query is one of the sansa libraries which is used for querying RDF data.
When I am running the spark application as a spark program with spark-submit command it works correctly as expected.But when ran the program through spark job server , the applications fails most of the time with below exception .
0/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_0 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
20/05/29 18:57:00 INFO SparkContext: Invoking stop() from shutdown
hook 20/05/29 18:57:00 INFO JobManagerActor: Got Spark Application end
event, stopping job manger. 20/05/29 18:57:00 INFO JobManagerActor:
Got Spark Application end event externally, stopping job manager
20/05/29 18:57:00 INFO SparkUI: Stopped Spark web UI at
http://10.138.32.96:46627 20/05/29 18:57:00 INFO TaskSetManager:
Starting task 3.0 in stage 3.0 (TID 63, us1salxhpw0653.corpnet2.com,
executor 1, partition 3, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00
INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 60) in 513 ms
on us1salxhpw0653.corpnet2.com (executor 1) (1/560) 20/05/29 18:57:00
INFO TaskSetManager: Starting task 4.0 in stage 3.0 (TID 64,
us1salxhpw0669.corpnet2.com, executor 2, partition 4, NODE_LOCAL, 4942
bytes) 20/05/29 18:57:00 INFO TaskSetManager: Finished task 1.0 in
stage 3.0 (TID 61) in 512 ms on us1salxhpw0669.corpnet2.com (executor
2) (2/560) 20/05/29 18:57:00 INFO TaskSetManager: Starting task 5.0 in
stage 3.0 (TID 65, us1salxhpw0670.corpnet2.com, executor 3, partition
5, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00 INFO TaskSetManager:
Finished task 2.0 in stage 3.0 (TID 62) in 536 ms on
us1salxhpw0670.corpnet2.com (executor 3) (3/560) 20/05/29 18:57:00
INFO BlockManagerInfo: Added rdd_44_4 in memory on
us1salxhpw0669.corpnet2.com:34922 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_3 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO DAGScheduler: Job 2 failed: save at
SansaQueryExample.scala:32, took 0.732943 s 20/05/29 18:57:00 INFO
DAGScheduler: ShuffleMapStage 3 (save at SansaQueryExample.scala:32)
failed in 0.556 s due to Stage cancelled because SparkContext was shut
down 20/05/29 18:57:00 ERROR FileFormatWriter: Aborting job null.
> > org.apache.spark.SparkException: Job 2 cancelled because SparkContext
was shut down at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:820)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:818)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:818)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1732)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1651)
at
org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1923)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1922) at
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:584)
at
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
Code which used for direct execution
object SansaQueryExampleWithoutSJS {
def main(args: Array[String]) {
val spark=SparkSession.builder().appName("sansa stack example").getOrCreate()
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = spark.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
}
Code Integrated with Spark Job Server
object SansaQueryExample extends SparkSessionJob {
override type JobData = Seq[String]
override type JobOutput = collection.Map[String, Long]
override def validate(sparkSession: SparkSession, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
Try(config.getString("input.string").split(" ").toSeq)
.map(words => Good(words))
.getOrElse(Bad(One(SingleProblem("No input.string param"))))
}
override def runJob(sparkSession: SparkSession, runtime: JobEnvironment, data: JobData): JobOutput = {
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = sparkSession.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
sparkSession.sparkContext.parallelize(data).countByValue
}
}
Steps for executing an application via spark job server is explained here ,mainly
upload the jar into SJS through rest api
create a spark context with memory and core as required ,through another api
execute the job via another api mentioning the jar and context already created
So when I observed different executions of the program , I could see like the spark job server is behaving inconsistently and the program works on few occasions without any errors .Also observed like sparkcontext is being shutdown due to some unknown reasons .I am using SJS 0.8.0 and sansa 0.7.1 and spark 2.4

creating random list of names in spark/scala

I'm using spark/scala to generate about 10 mln people in a CSV file on HDFS by randomly mixing 2 csv files with first and lastname + adding a date of birth random between 1920 and now + creation data + counter.
I'm running into a bit of a problem using a for loop everything is working correctly but in that case the loop part only runs on the driver and this seems to be limited to working fine on 1 mln, generating 10 mln takes about 10 minutes longer. So I decided to create a range with 10 mln items so I could use map and utilize the cluster. I got the following code:
package ebicus
import org.apache.spark._
import org.joda.time.{DateTime,Interval,LocalDateTime}
import org.joda.time.format.DateTimeFormat
import java.util.Random
object main_generator_spark {
val conf = new SparkConf()
.setAppName("Generator")
val sc = new SparkContext(conf)
val rs = Random
val file = sc.textFile("hdfs://host:8020/user/firstname")
val fnames = file.flatMap(line => line.split("\n")).map(x => x.split(",")(1))
val fnames_ar = fnames.collect()
val fnames_size = fnames_ar.length
val firstnames = sc.broadcast(fnames_ar)
val file2 = sc.textFile("hdfs://host:8020/user/lastname")
val lnames = file2.flatMap(line => line.split("\n")).map(x => x.split(",")(1))
val lnames_ar = lnames.collect()
val lnames_size = lnames_ar.length
val lastnames = sc.broadcast(lnames_ar)
val range_val = sc.range(0, 100000000, 1, 20)
val rddpersons = range_val.map(x =>
(x.toString,
new DateTime().toString("y-M-d::H:m:s"),
**Error at line 77--> fnames_ar(rs.nextInt(fnames_size)),**
lnames_ar(rs.nextInt(lnames_size)),
makeGebDate
)
)
def makeGebDate():String ={
lazy val start = new DateTime(1920,1,1,0,0,0)
lazy val end = new DateTime().minusYears(18)
lazy val hours = (new Interval(start, end).toDurationMillis()./(1000*60*60)).toInt
start.plusHours(rs.nextInt(hours)).toString("y-MM-dd")
}
def main(args: Array[String]): Unit = {
rddpersons.saveAsTextFile("hdfs://hdfs://host:8020/user/output")
}
The code works fine when I use the spark-shell but when I try to run the script with a spark-submit (I'm using maven to build):
spark-submit --class ebicus.main_generator_spark --num-executors 16 --executor-cores 4 --executor-memory 2G --driver-cores 2 --driver-memory 10g /u01/stage/mvn_test-0.0.2.jar
I get the following error:
16/06/16 11:17:29 INFO DAGScheduler: Final stage: ResultStage 2(saveAsTextFile at main_generator_sprak.scala:93)
16/06/16 11:17:29 INFO DAGScheduler: Parents of final stage: List()
16/06/16 11:17:29 INFO DAGScheduler: Missing parents: List()
16/06/16 11:17:29 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[11] at saveAsTextFile at main_generator_sprak.scala:93), which has no missing parents
16/06/16 11:17:29 INFO MemoryStore: ensureFreeSpace(140536) called with curMem=1326969, maxMem=5556991426
16/06/16 11:17:29 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 137.2 KB, free 5.2 GB)
16/06/16 11:17:29 INFO MemoryStore: ensureFreeSpace(48992) called with curMem=1467505, maxMem=5556991426
16/06/16 11:17:29 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 47.8 KB, free 5.2 GB)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 10.29.7.4:51642 (size: 47.8 KB, free: 5.2 GB)
16/06/16 11:17:29 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
16/06/16 11:17:29 INFO DAGScheduler: Submitting 20 missing tasks from ResultStage 2 (MapPartitionsRDD[11] at saveAsTextFile at main_generator_sprak.scala:93)
16/06/16 11:17:29 INFO YarnScheduler: Adding task set 2.0 with 20 tasks
16/06/16 11:17:29 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 8, cloudera-001.fusion.ebicus.com, partition 0,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 9, cloudera-003.fusion.ebicus.com, partition 1,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 10, cloudera-001.fusion.ebicus.com, partition 2,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 11, cloudera-003.fusion.ebicus.com, partition 3,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 4.0 in stage 2.0 (TID 12, cloudera-001.fusion.ebicus.com, partition 4,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 5.0 in stage 2.0 (TID 13, cloudera-003.fusion.ebicus.com, partition 5,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 6.0 in stage 2.0 (TID 14, cloudera-001.fusion.ebicus.com, partition 6,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 7.0 in stage 2.0 (TID 15, cloudera-003.fusion.ebicus.com, partition 7,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on cloudera-003.fusion.ebicus.com:52334 (size: 47.8 KB, free: 1060.2 MB)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on cloudera-001.fusion.ebicus.com:53110 (size: 47.8 KB, free: 1060.2 MB)
16/06/16 11:17:30 INFO TaskSetManager: Starting task 8.0 in stage 2.0 (TID 16, cloudera-001.fusion.ebicus.com, partition 8,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:30 INFO TaskSetManager: Starting task 9.0 in stage 2.0 (TID 17, cloudera-001.fusion.ebicus.com, partition 9,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:30 WARN TaskSetManager: Lost task 6.0 in stage 2.0 (TID 14, cloudera-001.fusion.ebicus.com): java.lang.NoClassDefFoundError: Could not initialize class ebicus.main_generator_spark$
at ebicus.main_generator_spark$$anonfun$5.apply(main_generator_sprak.scala:77)
at ebicus.main_generator_spark$$anonfun$5.apply(main_generator_sprak.scala:74)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1205)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I'm I making some kind of fundamental thinking error? I would be happy if someone can point me in a direction?
Edit: I'm using cloudera 5.6.0, spark 1.5.0, scala 2.10.6, yarn 2.10, joda-time 2.9.4
Edit2: Added conf & sc

Spark Job Aborts with File Not found error

I have written a spark job (spark 1.3 cloudera 5.4) which loops through an avro file, and for each record issues a hivecontext query
val path = "/user/foo/2016/03/07/ALL"
val fw2 = new FileWriter("/home.nfs/Foo/spark-query-result.txt", false)
val conf = new SparkConf().setAppName("App")
val sc = new SparkContext(conf)
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
val sqlSc = new SQLContext(sc)
import sqlSc.implicits._
val df = sqlSc.load(path, "com.databricks.spark.avro").cache()
val hc = new HiveContext(sc)
df.filter("fieldA = 'X'").select($"fieldA", $"fieldB", $"fieldC").rdd.toLocalIterator.filter(x => x(1) != null).foreach{x =>
val query = s"select from hive_table where fieldA = ${x(0)} and fieldB='${x(1)}' and fieldC=${x(2)}"
val df1 = hc.sql(query)
df1.rdd.toLocalIterator.foreach { r =>
println(s"For ${x(0)} Found ${r(0)}\n")
fw1.write(s"For ${x(0)} Found ${r(0)}\n")
}
}
The job runs for 2 hours, but then aborts with the error
16/03/08 12:35:53 WARN TaskSetManager: Lost task 17.0 in stage 34315.0 (TID 82258, foo-cloudera04.foo.com): java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:794)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:833)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:897)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
....
16/03/08 12:35:53 INFO TaskSetManager: Starting task 0.0 in stage 34315.0 (TID 82260, foo-cloudera09.foo.com, NODE_LOCAL, 1420 bytes)
16/03/08 12:35:53 INFO TaskSetManager: Finished task 67.0 in stage 34314.0 (TID 82256) in 1298 ms on foo-cloudera09.foo.com (42/75)
16/03/08 12:35:53 INFO BlockManagerInfo: Added broadcast_12501_piece0 in memory on foo-cloudera09.foo.com:43893 (size: 6.5 KB, free: 522.8 MB)
16/03/08 12:35:53 INFO BlockManagerInfo: Added broadcast_12499_piece0 in memory on foo-cloudera09.foo.com:43893 (size: 44.2 KB, free: 522.7 MB)
16/03/08 12:35:53 INFO TaskSetManager: Starting task 17.1 in stage 34315.0 (TID 82261, foo-cloudera04.foo.com, NODE_LOCAL, 1420 bytes)
16/03/08 12:35:53 WARN TaskSetManager: Lost task 19.0 in stage 34315.0 (TID 82259, foo-cloudera04.foo.com): java.io.FileNotFoundException: /data/1/yarn/nm/usercache/Foo.Bar/appcache/application_1456200816465_188203/blockmgr-79a08609-56ae-490e-afc9-0f0143441a76/27/temp_shuffle_feb9ae13-6cb0-4a19-a60f-8c433f30e0e0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
at scala.Array$.fill(Array.scala:267)
at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355)
16/03/08 12:35:53 INFO TaskSetManager: Starting task 19.1 in stage 34315.0 (TID 82262, foo-cloudera04.foo.com, NODE_LOCAL, 1420 bytes)
16/03/08 12:35:53 WARN TaskSetManager: Lost task 17.1 in stage 34315.0 (TID 82261, foo-cloudera04.foo.com): java.io.FileNotFoundException: /data/1/yarn/nm/usercache/Foo.Bar/appcache/application_1456200816465_188203/blockmgr-79a08609-56ae-490e-afc9-0f0143441a76/13/temp_shuffle_2f89df35-9e35-4558-a0f2-1f7353d3f9b0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)

How to resolve java.io.invalid class exception SerialVersion UID in Scala -Spark

I have been trying to resolve this exception from many days but couldn't fix this .
Details:
Using Scala trying to connect and retrieve data from Hive Tables in some other Cluster
Using Scala version -2.10.4 in both my local machine and Spark Node
Using Spark 1.4.1 on both(added in project using maven dependency) and
JVM 1.7 on both
Tried with below code
#SerialVersionUID(42L)
object CountWords extends serializable {
def main(args:Array[String]){
println("hello")
val objConf = new SparkConf().setAppName("Spark Connection").setMaster("spark://IP:7077")
var sc = new SparkContext(objConf)
val objHiveContext = new HiveContext(sc)
objHiveContext.sql("USE MyDatabaseName")
val test= objHiveContext.sql("show tables")
var i =0
var testing = test.collect()
for(i<-0 until testing.length){
println(testing(i))
} //Able to display all tables
var tableData= objHiveContext.sql("select * from accounts c")
println(tableData) //Displaying all columns in table
var rowData = tableData.collect() // Throwing Error
println(rowData)
}
}
Tried with various #SerialVersionUID(43L) etc but couldn't figure out
Error:
15/12/29 12:38:54 INFO deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
15/12/29 12:38:54 INFO FileInputFormat: Total input paths to process : 1
[accounts]
[TableName1]
[tc]
15/12/29 12:38:54 INFO ListSinkOperator: 0 finished. closing...
15/12/29 12:38:54 INFO ListSinkOperator: 0 Close done
15/12/29 12:38:54 INFO ParseDriver: Parsing command: select * from accounts c
15/12/29 12:38:54 INFO ParseDriver: Parse Completed
15/12/29 12:38:55 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/12/29 12:38:55 INFO MemoryStore: ensureFreeSpace(371192) called with curMem=0, maxMem=503379394
15/12/29 12:38:55 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 362.5 KB, free 479.7 MB)
15/12/29 12:38:55 INFO MemoryStore: ensureFreeSpace(32445) called with curMem=371192, maxMem=503379394
15/12/29 12:38:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 31.7 KB, free 479.7 MB)
15/12/29 12:38:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.101.215:64149 (size: 31.7 KB, free: 480.0 MB)
15/12/29 12:38:55 INFO SparkContext: Created broadcast 0 from toString at String.java:2847
SchemaRDD[6] at sql at CountWords.scala:66
== Query Plan ==
== Physical Plan ==
HiveTableScan [id#15L,name#16,number#17,address#18,city#19,state#20,zip#21,createdby#23,createddate#24,updatedby#25,updateddate#26,status#27,, (MetastoreRelation DbName, accounts, Some(c)), None
15/12/29 12:38:56 INFO MemoryStore: ensureFreeSpace(371192) called with curMem=403637, maxMem=503379394
15/12/29 12:38:56 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 362.5 KB, free 479.3 MB)
15/12/29 12:38:56 INFO MemoryStore: ensureFreeSpace(32445) called with curMem=774829, maxMem=503379394
15/12/29 12:38:56 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 31.7 KB, free 479.3 MB)
15/12/29 12:38:56 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.16.101.215:64149 (size: 31.7 KB, free: 480.0 MB)
15/12/29 12:38:56 INFO SparkContext: Created broadcast 1 from first at CountWords.scala:68
15/12/29 12:39:00 INFO FileInputFormat: Total input paths to process : 1
15/12/29 12:39:00 INFO SparkContext: Starting job: first at CountWords.scala:68
15/12/29 12:39:00 INFO DAGScheduler: Got job 0 (first at CountWords.scala:68) with 1 output partitions (allowLocal=false)
15/12/29 12:39:00 INFO DAGScheduler: Final stage: ResultStage 0(first at CountWords.scala:68)
15/12/29 12:39:00 INFO DAGScheduler: Parents of final stage: List()
15/12/29 12:39:00 INFO DAGScheduler: Missing parents: List()
15/12/29 12:39:00 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[11] at first at CountWords.scala:68), which has no missing parents
15/12/29 12:39:00 INFO MemoryStore: ensureFreeSpace(9104) called with curMem=807274, maxMem=503379394
15/12/29 12:39:00 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 8.9 KB, free 479.3 MB)
15/12/29 12:39:00 INFO MemoryStore: ensureFreeSpace(4500) called with curMem=816378, maxMem=503379394
15/12/29 12:39:00 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.4 KB, free 479.3 MB)
15/12/29 12:39:00 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.16.101.215:64149 (size: 4.4 KB, free: 480.0 MB)
15/12/29 12:39:00 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
15/12/29 12:39:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[11] at first at CountWords.scala:68)
15/12/29 12:39:00 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/12/29 12:39:00 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.40.10.80, ANY, 1462 bytes)
15/12/29 12:39:02 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.40.10.80:52658 (size: 4.4 KB, free: 265.1 MB)
15/12/29 12:39:03 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.40.10.80): **java.io.InvalidClassException: org.apache.spark.sql.catalyst.expressions.AttributeReference; local class incompatible: stream classdesc serialVersionUID = -1645430348362071262, local class serialVersionUID = -5921417852507808552**
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:621)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1707)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
Are you using two different versions of Spark (specifically Spark SQL as it complains about org.apache.spark.sql.catalyst.expressions.AttributeReference)?
See this answer

Performance issue relating to joining big text files in local

I am new to scala and spark. I am trying to join two RDDs coming from two different text files. In each text file there two columns separated by tab, e.g.
text1 text2
100772C111 ion 200772C222 ion
100772C111 on 200772C222 gon
100772C111 n 200772C2 n
So I want to join these two files based their second columns and get a result as below meaning that there are 2 common terms for given those two documents:
((100772C111-200772C222,2))
My computer's features:
4 X (intel(r) core(tm) i5-2430m cpu #2.40 ghz)
8 GB RAM
My script:
import org.apache.spark.{SparkConf, SparkContext}
object hw {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\spark-1.4.1\\winutils")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - `scala\\wos14.txt")
.map { line => val parts = line.split("\t")((parts(5)),parts(0))}
val emp_new = sc.textFile("C:\\WHOLE_WOS_TEXT\\fwo_word.txt")
.map{ line2 => val parts = line2.split("\t")
((parts(3)),parts(1)) }
val finalemp = emp_new.distinct().join(emp.distinct())
.map{case((nk1), ((parts1),(val1))) => (parts1 + "-" + val1, 1)}
.reduceByKey((a, b) => a + b)
finalemp.foreach(println)
}
}
This code gives what I want when I try with text files in smaller sizes. However, I want to implement this script for big text files. I have one text file with a size of 110 KB (approx. 4M rows) and another one 9 gigabyte (more than 1B rows).
When I run my script employing these two text files, I observed on the log screen following:
15/09/04 18:19:06 INFO TaskSetManager: Finished task 177.0 in stage 1.0 (TID 181) in 9435 ms on localhost (178/287)
15/09/04 18:19:06 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:5972688896+33554432
15/09/04 18:19:15 INFO Executor: Finished task 178.0 in stage 1.0 (TID 182). 2293 bytes result sent to driver
15/09/04 18:19:15 INFO TaskSetManager: Starting task 179.0 in stage 1.0 (TID 183, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:15 INFO Executor: Running task 179.0 in stage 1.0 (TID 183)
15/09/04 18:19:15 INFO TaskSetManager: Finished task 178.0 in stage 1.0 (TID 182) in 9829 ms on localhost (179/287)
15/09/04 18:19:15 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:6006243328+33554432
15/09/04 18:19:25 INFO Executor: Finished task 179.0 in stage 1.0 (TID 183). 2293 bytes result sent to driver
15/09/04 18:19:25 INFO TaskSetManager: Starting task 180.0 in stage 1.0 (TID 184, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:25 INFO Executor: Running task 180.0 in stage 1.0 (TID 184)
...
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (13 times so far)
15/09/04 18:37:49 INFO BlockManagerInfo: Removed broadcast_2_piece0 on `localhost:64567 in memory (size: 2.2 KB, free: 969.8 MB)
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (14 times so far)...
So is it reasonable to process such text files in local? After waiting more than 3 hours the program was still spilling data to the disk.
To sum up, is there something that I need to change in my code to cope with the performance issues?
Are you giving Spark enough memory? It's not entirely obvious, but by default Spark starts with very small memory allocation. It won't use as much memory as it can eat like, say, an RDMS. You need to tell it how much you want it to use.
The default is (I believe) one executor per node, and 512MB of RAM per executor. You can scale this up very easily:
spark-shell --driver-memory 1G --executor-memory 1G --executor-cores 3 --num-executors 3
More settings here: http://spark.apache.org/docs/latest/configuration.html#application-properties
You can see how much memory is allocated to the Spark environment and each executor on the SparkUI, which (by default) is at http://localhost:4040