how make Distinct tuple scala - scala

I have an RDD and I want create a new RDD with unique values, but I have an error.
The code:
val rdd = sc.textFile("/user/ergorenova/socialmedia/allus/archivosOrigen").map( _.split(",", -1) match {
case Array(caso, canal, lote, estado, estadoo, estadooo, fechacreacioncaso, fechacierrecaso, username, clientid, nombre, apellido, ani, email) =>(canal, username, ani, email)
}).distinct
val twtface = rdd.map {
case ( canal, username, ani, email ) =>
val campoAni = "ANI"
(campoAni , ani , canal , username)
}.distinct()
twtface.take(3).foreach(println)
This is the CSV file
caso2,canal2,lote,estado3,estado4,estado5,fechacreacioncaso2,fechacierrecaso2,username,clientid,nombre,apellido,ani,email
2694464,Twitter,Redes Sociales Movistar - Twitter,Cerrado por Abandono – Robot,,,16/04/2015 23:57:51,17/04/2015 6:00:19,kariniseta,158,,,22,mmmm#test.com
2694464,Twitter,Redes Sociales Movistar - Twitter,Cerrado por Abandono – Robot,,,16/04/2015 23:57:51,17/04/2015 6:00:19,kariniseta,158,,,22,mmmm#test.com
2635376,Facebook,Redes Sociales Movistar - Facebook,Cerrado por Abandono – Robot,,,03/04/2015 20:20:18,04/04/2015 2:30:06,martin.saggini,1126,,,,
2635376,Facebook,Redes Sociales Movistar - Facebook,Cerrado por Abandono – Robot,,,03/04/2015 20:20:18,04/04/2015 2:30:06,martin.saggini,1126,,,,
Error:
scala.MatchError: [Ljava.lang.String;#dea08cc (of class [Ljava.lang.String;)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I assume the error is due to a missing/additional newline in your csv file.
Your split and match assumes that every line of the csv has exactly 14 fields. Depending on the encoding or text editor you use, you may have additional new lines at the end of the document.
My suggestion would be to validate each line and add a catch-all case that gives you a more detailed error message, that way you will avoid the ambiguous MatchError.

Related

spark-csv fails parsing with embedded html and quotes

I have this csv file which contains description of several cities:
Cities_information_extract.csv
I can parse this file just fine using python pandas.read_csv or R read.csv methods. They both return 693 rows for 25 columns.
I am trying, unsuccessfully, to load the csv using Spark 1.6.0 and scala.
For this I am using spark-csv and commons-csv (which I have included in the spark jars path).
This is is what I have tried:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")
cities_info.count()
// ERROR
java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:275)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
Then I've tried using univocity parser:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")
cities_info.count()
// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]
Inspecting the file I've noticed the presence of several html tags in description fields with embedding quotes like:
<div id="someID">
I've tried, using python, to remove all html tags using regular expressions:
import os
import re
pattern = re.compile("<[^>]*>") # find all html tags <..>
with io.open("Cities_information_extract.csv", "r", encoding="utf-8") as infile:
text = infile.read()
text = re.sub(pattern, " ", text)
with io.open("cities_info_clean.csv", "w", encoding="utf-8") as outfile:
outfile.write(text)
Next I've tried again with the new file without the html tags:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")
cities_info.count()
// ERROR
java.io.IOException: (startline 1) EOF reached before encapsulated token finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:304)
[...]
And with the univocity parser:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")
cities_info.count()
// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]
Both python and R are able to parse correctly both files, while spark-csv still fails. Any suggestion for correctly parsing this csv file using spark-scala?

How to write to HDFS using spark programming API if I have authentication details?

I need to write to external HDFS cluster whose authentication details are available for both simple as well as kerberos authentication. For the sake of simplicity, lets assume we are dealing with simple authentication.
This is what I have:
External HDFS cluster connection details (host, port)
Authentication details (user for simple auth)
HDFS location where files need to be written (hdfs://host:port/loc)
Also, other details like format, etc.
Please note SPARK user is not same as user specified for HDFS auth.
Now, using the spark programming API, this is what I am trying to do:
val hadoopConf = new Configuration()
hadoopConf.set("fs.defaultFS", fileSystemPath)
hadoopConf.set("hadoop.job.ugi", userName)
val jConf = new JobConf(hadoopConf)
jConf.setUser(user)
jConf.set("user.name", user)
jConf.setOutputKeyClass(classOf[NullWritable])
jConf.setOutputValueClass(classOf[Text])
jConf.setOutputFormat(classOf[TextOutputFormat[NullWritable, Text]])
outputDStream.foreachRDD(r => {
val rdd = r.mapPartitions { iter =>
val text = new Text()
iter.map { x =>
text.set(x.toString)
println(x.toString)
(NullWritable.get(), text)
}
}
val rddCount = rdd.count()
if(rddCount > 0) {
rdd.saveAsHadoopFile(config.outputPath, classOf[NullWritable], classOf[Text], classOf[TextOutputFormat[NullWritable, Text]], jConf)
}
})
Here, I was assuming that if we pass JobConf with correct details, it should be used for authentication and write should be done using the user specified in JobConf.
However, write still happens as the spark user ("root") irrespective of the auth details present in JobConf ("hdfs" as user). Below is the exception that I get:
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/spark-deploy/out/_temporary/0":hdfs:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:292)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:213)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1698)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1682)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1665)
at org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.mkdirs(FSDirMkdirOp.java:71)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3900)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:978)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy40.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:558)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy41.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3000)
... 45 more
Please let me know, if there are any suggestions.
This is probably more a comment than an answer but as it is too long I put it here. I haven't tried this because I have no environment to test it. Please try and let me know if this works (and if it doesn't I'll remove this answer).
Looking a bit into the code it looks like DFSClient creates a proxy using createProxyWithClientProtocol that uses UserGroupInformation.getCurrentUser() (I haven't traced the createHAProxy branch down but I suspect the same logic there). Then this info is sent to the server for authentication.
It means that you need to change what UserGroupInformation.getCurrentUser() returns in the context of your particular call. This is what UserGroupInformation.doAs is supposed to do so you just need to get a proper UserGroupInformation instance. And in the case of simple authentication UserGroupInformation.createRemoteUser might actually work.
So I suggest trying something like this:
...
val rddCount = rdd.count()
if(rddCount > 0) {
val remoteUgi = UserGroupInformation.createRemoteUser("hdfsUserName")
remoteUgi.doAs(() => { rdd.saveAsHadoopFile(config.outputPath, classOf[NullWritable], classOf[Text], classOf[TextOutputFormat[NullWritable, Text]], jConf) })
}

How to use Spark BigQuery Connector locally?

For test purpose, I would like to use BigQuery Connector to write Parquet Avro logs in BigQuery. As I'm writing there is no way to read directly Parquet from the UI to ingest it so I'm writing a Spark job to do so.
In Scala, for the time being, job body is the following:
val events: RDD[RichTrackEvent] =
readParquetRDD[RichTrackEvent, RichTrackEvent](sc, googleCloudStorageUrl)
val conf = sc.hadoopConfiguration
conf.set("mapred.bq.project.id", "myproject")
// Output parameters
val projectId = conf.get("fs.gs.project.id")
val outputDatasetId = "logs"
val outputTableId = "test"
val outputTableSchema = LogSchema.schema
// Output configuration
BigQueryConfiguration.configureBigQueryOutput(
conf, projectId, outputDatasetId, outputTableId, outputTableSchema
)
conf.set(
"mapreduce.job.outputformat.class",
classOf[BigQueryOutputFormat[_, _]].getName
)
events
.mapPartitions {
items =>
val gson = new Gson()
items.map(e => gson.fromJson(e.toString, classOf[JsonObject]))
}
.map(x => (null, x))
.saveAsNewAPIHadoopDataset(conf)
As the BigQueryOutputFormat isn't finding the Google Credentials, it fallbacks on metadata host to try to discover them with the following stacktrace:
016-06-13 11:40:53 WARN HttpTransport:993 - exception thrown while executing request
java.net.UnknownHostException: metadata
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589 at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:160)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:207)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:72)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:81)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:101)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:89)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.<init>(BigQueryOutputCommitter.java:70)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:102)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:84)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:30)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1135)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
It is of course expected but it should be able to use my service account and its key as GoogleCredential.getApplicationDefault() returns appropriate credentials fetched from GOOGLE_APPLICATION_CREDENTIALS environment variable.
As the connector seems to read credentials, from hadoop configuration, what's the keys to set so that it reads GOOGLE_APPLICATION_CREDENTIALS ? Is there a way to configure the output format to use a provided GoogleCredential object ?
If I understand your question correctly - you might want to set:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.email</name>
<name>mapred.bq.auth.service.account.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Here, the mapred.bq.auth.service.account.keyfile should point to the full file path to the older-style "P12" keyfile; alternatively, if you're using the newer "JSON" keyfiles, you should replace the "email" and "keyfile" entries with the single mapred.bq.auth.service.account.json.keyfile key:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.json.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Also you might want to take a look at https://github.com/spotify/spark-bigquery - which is much more civilised way of working with BQ and Spark. The setGcpJsonKeyFile method used in this case is the same JSON file you'd set for mapred.bq.auth.service.account.json.keyfile if using the BQ connector for Hadoop.

OutOfMemoryError: Java heap space and memory variables in Spark

I have been trying to execute a scala program and the output somehow always seems to be something like this:
15/08/17 14:13:14 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:64)
at java.lang.StringBuilder.<init>(StringBuilder.java:97)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:339)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2344)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:32)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:44)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:143)
at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:169)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:34)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1215)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
or like this
15/08/19 11:45:11 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider$Impl.createInstance(DefaultSerializerProvider.java:526)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider$Impl.createInstance(DefaultSerializerProvider.java:505)
at com.fasterxml.jackson.databind.ObjectMapper._serializerProvider(ObjectMapper.java:2846)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:17)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:17)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
Are these errors on the driver or executor side?
I am a bit confused with the memory variables that Spark uses. My current settings are
spark-env.sh
export SPARK_WORKER_MEMORY=6G
export SPARK_DRIVER_MEMORY=6G
export SPARK_EXECUTOR_MEMORY=4G
spark-defaults.conf
# spark.driver.memory 6G
# spark.executor.memory 4G
# spark.executor.extraJavaOptions ' -Xms5G -Xmx5G '
# spark.driver.extraJavaOptions ' -Xms5G -Xmx5G '
Do I need to uncomment any of the variables contained in spark-defaults.conf, or are they redundant?
Is for example setting SPARK_WORKER_MEMORY equivalent to setting the spark.executor.memory?
Part of my scala code where it stops after a few iterations:
val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct.collect
for (id <- filteredNodesGroups){
val clusterGraph = connCompGraph.subgraph(vpred = (_, attr) => attr(pagerankIndex) == id)
val pagerankGraph = clusterGraph.pageRank(0.15)
val completeClusterPagerankGraph = clusterGraph.outerJoinVertices(pagerankGraph.vertices) {
case (uid, attrList, Some(pr)) =>
attrList :+ ("inClusterPagerank:" + pr)
case (uid, attrList, None) =>
attrList :+ ""
}
val sortedClusterNodes = completeClusterPagerankGraph.vertices.toArray.sortBy(_._2(pagerankIndex + 1))
println(sortedClusterNodes(0)._2(1) + " with rank: " + sortedClusterNodes(0)._2(pagerankIndex + 1))
}
Many questions disguised as one. Thank you in advance!
I'm not a Spark expert, but there is line that seems suspicious to me :
val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct.collect
Basically, by using the collect method, you are getting back all the data from your executors (before even processing it) to the driver. Do you have any idea about the size of this data ?
In order to fix this, you should proceed in a more functional way. To extract the distinct values, you could for example use a groupBy and map :
val pairs = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }
pairs.groupBy(_./* the property to group on */)
.map { case (_, arrays) => /* map function */ }
Regarding the collect, there should be a way to sort each partition and then to return the (processed) result to the driver. I would like to help you more but I need more information about what you are trying to do.
UPDATE
After digging a little bit, you could sort your data using shuffling as described here
UPDATE
So far, I've tried to avoid the collect, and to get the data back to the driver as much as possible, but I've no idea how to solve this :
val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct()
val clusterGraphs = filteredNodesGroups.map { id => connCompGraph.subgraph(vpred = (_, attr) => attr(pagerankIndex) == id) }
val pageRankGraphs = clusterGraphs.map(_.pageRank(0.15))
Basically, you need to join two RDD[Graph[Array[String], String]], but I don't know what key to use, and secondly this would necessarily return an RDD of RDD (I don't know if you can even do that). I'll try to find something later this day.

Scala - Exception in thread "main" java.lang.NumberFormatException

I have seen other threads about this but none of them answered my question. The fact is I think that in my case the problem is not like the others.
Here is a link to the problem I tried to solve: http://projecteuler.net/problem=55
I am sure the algorithm is correct but what I suspect is that there is something going on with what Scala can convert to string. I don't know. Here is my code:
package projecteuler55
object PE55 extends App {
def rev ( x :BigInt) : BigInt = {
x.toString.reverse.toInt
}
def notPalindrome ( x:BigInt) : Boolean = {
if ( x != rev(x) ) true else false
}
def test (x:BigInt , steps:Int) : Boolean = {
if ( steps > 50 ) true
else if (notPalindrome(x) == false) false
else test ( x + rev(x) , steps +1)
}
var lychrel = 0
for (i<-10 until 10000){
if (test(i+rev(i),0)) lychrel += 1
}
println(lychrel)
}
and the error I get is this:
Exception in thread "main" java.lang.NumberFormatException: For input string: "2284457131"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:495)
at java.lang.Integer.parseInt(Integer.java:527)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at projecteuler55.PE55$.rev(PE55.scala:8)
at projecteuler55.PE55$.notPalindrome(PE55.scala:14)
at projecteuler55.PE55$.test(PE55.scala:20)
at projecteuler55.PE55$$anonfun$1.apply$mcVI$sp(PE55.scala:26)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at projecteuler55.PE55$delayedInit$body.apply(PE55.scala:25)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at projecteuler55.PE55$.main(PE55.scala:3)
at projecteuler55.PE55.main(PE55.scala)
Why does this happen? I also use BigInt in case of an overflow but this does not seem to help.
Thanks in advance
toInt expects a string that can be parsed to fit into a regular Int, which 2284457131 can't. You want to use BigInt(x.toString.reverse).