Error when extracting features(spark) - scala

I encountered some problems when I tried to extract features from raw data.
Here is my data:
25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
and here is my code:
val rawData = sc.textFile("data/myData.data")
val lines = rawData.map(_.split(","))
val categoriesMap = lines.map(fields => fields(1)).distinct.collect.zipWithIndex.toMap
Here is the error info:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 3, localhost): java.lang.ArrayIndexOutOfBoundsException: 1
I want to extract the second column as the categorical feature, but it seems that it cannot read the column and leads to ArrayIndexOutOfBoundsException.
I tried many times but still cannot solve the problem.
val categoriesMap1 = lines.map(fields => fields(1)).distinct.collect.zipWithIndex.toMap
val labelpointRDD = lines.map { fields =>
val categoryFeaturesArray1 = Array.ofDim[Double](categoriesMap1.size)
val categoryIdx1 = categoriesMap1(fields(1))
categoryFeaturesArray1(categoryIdx1) = 1 }

Your code works for the example you supplied - which means it's fine for "valid" rows - but your input probably contains some invalid rows - in this case, rows with no commas.
You can either clean your data or improve the code to handle these rows more gracefully, for example using some default value for bad rows:
val rawData = sc.parallelize(Seq(
"25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0",
"BAD LINE"
))
val lines = rawData.map(_.split(","))
val categoriesMap = lines.map {
case Array(_, s, _*) => s // for arrays with 2 or more items - use 2nd
case _ => "UNKNOWN" // default
}.distinct().collect().zipWithIndex.toMap
println(categoriesMap) // prints Map(UNKNOWN -> 0, Private -> 1)
UPDATE: per updated question - assuming these rows are indeed invalid, you can just skip them entirely, both when extracting the categories map and when mapping to labeled points:
val secondColumn: RDD[String] = lines.collect {
case Array(_, s, _*) => s // for arrays with 2 or more items - use 2nd
// shorter arrays (bad records) will be ~~filtered out~~
}
val categoriesMap = secondColumn.distinct().collect().zipWithIndex.toMap
val labelpointRDD = secondColumn.map { field =>
val categoryFeaturesArray1 = Array.ofDim[Double](categoriesMap.size)
val categoryIdx1 = categoriesMap(field)
categoryFeaturesArray1(categoryIdx1) = 1
categoryFeaturesArray1
}

Related

UDF to extract String in scala

I'm trying to extract the last set number from this data type:
urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)
In this example I'm trying to extract 10342800535 as a string.
This is my code in scala,
def extractNestedUrn(urn: String): String = {
val arr = urn.split(":").map(_.trim)
val nested = arr(3)
val clean = nested.substring(1, nested.length -1)
val subarr = clean.split(":").map(_.trim)
val res = subarr(3)
val out = res.split(",").map(_.trim)
val fin = out(1)
fin.toString
}
This is run as an UDF and it throws the following error,
org.apache.spark.SparkException: Failed to execute user defined function
What am I doing wrong?
You can simply use regexp_extract function. Check this
val df = Seq(("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)")).toDF("x")
df.show(false)
+-------------------------------------------------------------------+
|x |
+-------------------------------------------------------------------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|
+-------------------------------------------------------------------+
df.withColumn("NestedUrn", regexp_extract(col("x"), """.*,(\d+)""", 1)).show(false)
+-------------------------------------------------------------------+-----------+
|x |NestedUrn |
+-------------------------------------------------------------------+-----------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|10342800535|
+-------------------------------------------------------------------+-----------+
One reason that org.apache.spark.SparkException: Failed to execute user defined function exception are raised is when an exception is raised inside your user defined function.
Analysis
If I try to run your user defined function with the example input you provided, using the code below:
import org.apache.spark.sql.functions.{col, udf}
import sparkSession.implicits._
val dataframe = Seq("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)").toDF("urn")
def extractNestedUrn(urn: String): String = {
val arr = urn.split(":").map(_.trim)
val nested = arr(3)
val clean = nested.substring(1, nested.length -1)
val subarr = clean.split(":").map(_.trim)
val res = subarr(3)
val out = res.split(",").map(_.trim)
val fin = out(1)
fin.toString
}
val extract_urn = udf(extractNestedUrn _)
dataframe.select(extract_urn(col("urn"))).show(false)
I get this complete stack trace:
Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(UdfExtractionError$$$Lambda$1165/1699756582: (string) => string)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
...
at UdfExtractionError$.main(UdfExtractionError.scala:37)
at UdfExtractionError.main(UdfExtractionError.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
at UdfExtractionError$.extractNestedUrn$1(UdfExtractionError.scala:29)
at UdfExtractionError$.$anonfun$main$4(UdfExtractionError.scala:35)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
... 86 more
The important part of this stack trace is actually:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
This is the exception raised when executing your user defined function code.if we analyse your function code, you split two times the input by :. The result of the first split is actually this array:
["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
and not this array:
["urn", "fb", "candidateHiringState", "(urn:fb:contract:187236028,10342800535)"]
So, if we execute the remaining statements of your function, you get:
val arr = ["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
val nested = "(urn"
val clean = "urn"
val subarr = ["urn"]
As at the next line you call the fourth element of the array subarr that contains only one element, an ArrayOutOfBound exception is raised and then Spark returns a SparkException
Solution
Although the best solution to your problem is obviously the previous answer with regexp_extract, you can correct your user defined function as below:
def extractNestedUrn(urn: String): String = {
val arr = urn.split(':') // split using character instead of string regexp
val nested = arr.last // get last element of array, here "187236028,10342800535)"
val subarr = nested.split(',')
val res = subarr.last // get last element, here "10342800535)"
val out = res.init // take all the string except the last character, to remove ')'
out // no need to use .toString as out is already a String
}
However, as said before, the best solution is to use spark inner function regexp_extract as explained in first answer. Your code will be easier to understand and more performant

NullPointerException applying a function to spark RDD that works on non-RDD

I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/

How to group by a select number of fields in an RDD looking for duplicates based on those fields

I am new to Scala and Spark. I am working in the Spark Shell.
I need to Group By and sort by the first three fields of this file, looking for duplicates. If I find duplicates within the group, I need to append a counter to the third field, starting at "1" and incrementing by "1", for each record in the duplicate group. Resetting the counter back to "1" when reading a new group. When no duplicates are found, then just append the counter which would be "1".
CSV File contains the following:
("00111","00111651","4444","PY","MA")
("00111","00111651","4444","XX","MA")
("00112","00112P11","5555","TA","MA")
val csv = sc.textFile("file.csv")
val recs = csv.map(line => line.split(",")
If I apply the logic properly on the above example, the resulting RDD of recs would look like this:
("00111","00111651","44441","PY","MA")
("00111","00111651","44442","XX","MA")
("00112","00112P11","55551","TA","MA")
How about group the data, change it and put it back:
val csv = sc.parallelize(List(
"00111,00111651,4444,PY,MA",
"00111,00111651,4444,XX,MA",
"00112,00112P11,5555,TA,MA"
))
val recs = csv.map(_.split(","))
val grouped = recs.groupBy(line=>(line(0),line(1), line(2)))
val numbered = grouped.mapValues(dataList=>
dataList.zipWithIndex.map{case(data, idx) => data match {
case Array(fst,scd,thd,rest#_*) => Array(fst,scd,thd+(idx+1)) ++ rest
}
})
numbered.flatMap{case (key, values)=>values}
Also grouping the data, changing it, putting it back.
val lists= List(("00111","00111651","4444","PY","MA"),
("00111","00111651","4444","XX","MA"),
("00112","00112P11","5555","TA","MA"))
val grouped = lists.groupBy{case(a,b,c,d,e) => (a,b,c)}
val indexed = grouped.mapValues(
_.zipWithIndex
.map {case ((a,b,c,d,e), idx) => (a,b,c + (idx+1).toString,d,e)}
val unwrapped = indexed.flatMap(_._2)
//List((00112,00112P11,55551,TA,MA),
// (00111,00111651,44442,XX ,MA),
// (00111,00111651,44441,PY,MA))
Version working on Arrays (of arbitary length >= 3)
val lists= List(Array("00111","00111651","4444","PY","MA"),
Array("00111","00111651","4444","XX","MA"),
Array("00112","00112P11","5555","TA","MA"))
val grouped = lists.groupBy{_.take(3)}
val indexed = grouped.mapValues(
_.zipWithIndex
.map {case (Array(a,b,c, rest#_*), idx) => Array(a,b,c+ (idx+1).toString) ++ rest})
val unwrapped = indexed.flatMap(_._2)
// List(Array(00112, 00112P11, 55551, TA, MA),
// Array(00111, 00111651, 44441, XX, MA),
// Array(00111, 00111651, 44441, PY, MA))

Task not serializable Flink

I am trying to do the pagerank Basic example in flink with little bit of modification(only in reading the input file, everything else is the same) i am getting the error as Task not serializable and below is the part of the output error
atorg.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:179)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:171)
Below is my code
object hpdb {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val maxIterations = 10000
val DAMPENING_FACTOR: Double = 0.85
val EPSILON: Double = 0.0001
val outpath = "/home/vinoth/bigdata/assign10/pagerank.csv"
val links = env.readCsvFile[Tuple2[Long,Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1,4)).as('sourceId,'targetId).toDataSet[Link]//source and target
val pages = env.readCsvFile[Tuple1[Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1)).as('pageId).toDataSet[Id]//Pageid
val noOfPages = pages.count()
val pagesWithRanks = pages.map(p => Page(p.pageId, 1.0 / noOfPages))
val adjacencyLists = links
// initialize lists ._1 is the source id and ._2 is the traget id
.map(e => AdjacencyList(e.sourceId, Array(e.targetId)))
// concatenate lists
.groupBy("sourceId").reduce {
(l1, l2) => AdjacencyList(l1.sourceId, l1.targetIds ++ l2.targetIds)
}
// start iteration
val finalRanks = pagesWithRanks.iterateWithTermination(maxIterations) {
// **//the output shows error here**
currentRanks =>
val newRanks = currentRanks
// distribute ranks to target pages
.join(adjacencyLists).where("pageId").equalTo("sourceId") {
(page, adjacent, out: Collector[Page]) =>
for (targetId <- adjacent.targetIds) {
out.collect(Page(targetId, page.rank / adjacent.targetIds.length))
}
}
// collect ranks and sum them up
.groupBy("pageId").aggregate(SUM, "rank")
// apply dampening factor
//**//the output shows error here**
.map { p =>
Page(p.pageId, (p.rank * DAMPENING_FACTOR) + ((1 - DAMPENING_FACTOR) / pages.count()))
}
// terminate if no rank update was significant
val termination = currentRanks.join(newRanks).where("pageId").equalTo("pageId") {
(current, next, out: Collector[Int]) =>
// check for significant update
if (math.abs(current.rank - next.rank) > EPSILON) out.collect(1)
}
(newRanks, termination)
}
val result = finalRanks
// emit result
result.writeAsCsv(outpath, "\n", " ")
env.execute()
}
}
Any help in the right direction is highly appreciated? Thank you.
The problem is that you reference the DataSet pages from within a MapFunction. This is not possible, since a DataSet is only the logical representation of a data flow and cannot be accessed at runtime.
What you have to do to solve this problem is to assign the val pagesCount = pages.count value to a variable pagesCount and refer to this variable in your MapFunction.
What pages.count actually does, is to trigger the execution of the data flow graph, so that the number of elements in pages can be counted. The result is then returned to your program.

Spark - Prediction.io - scala.MatchError: null

I'm working on a template for prediction.io and I'm running into trouble with Spark.
I keep getting a scala.MatchError error: full gist here
scala.MatchError: null
at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:831)
at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:66)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1$$anonfun$apply$1.apply(ALSAlgorithm.scala:86)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1$$anonfun$apply$1.apply(ALSAlgorithm.scala:79)
at scala.Option.map(Option.scala:145)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1.apply(ALSAlgorithm.scala:79)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1.apply(ALSAlgorithm.scala:78)
The code github source here
val usersWithCounts =
ratingsRDD
.map(r => (r.user, (1, Seq[Rating](Rating(r.user, r.item, r.rating)))))
.reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2.union(v2._2)))
.filter(_._2._1 >= evalK)
// create evalK folds of ratings
(0 until evalK).map { idx =>
// start by getting this fold's ratings for each user
val fold = usersWithCounts
.map { userKV =>
val userRatings = userKV._2._2.zipWithIndex
val trainingRatings = userRatings.filter(_._2 % evalK != idx).map(_._1)
val testingRatings = userRatings.filter(_._2 % evalK == idx).map(_._1)
(trainingRatings, testingRatings) // split the user's ratings into a training set and a testing set
}
.reduce((l, r) => (l._1.union(r._1), l._2.union(r._2))) // merge all the testing and training sets into a single testing and training set
val testingSet = fold._2.map {
r => (new Query(r.user, r.item), new ActualResult(r.rating))
}
(
new TrainingData(sc.parallelize(fold._1)),
new EmptyEvaluationInfo(),
sc.parallelize(testingSet)
)
}
In order to do evaluation I need to split the ratings into a training and a testing group. To make sure each user has been included as part of the training, I group all the user's ratings together and then do the split on each user and then join the splits together.
Maybe there's a better way to do this?
The error means that the userFeatures of the MLlib MatrixFactorizationModel doesn't contain the user id (say, if the user is not in training data). MLlib doesn't check for this after the lookup (.head is used):
https://github.com/apache/spark/blob/v1.2.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L66
To debug if it's the case, you can implement a modified version of model.predict() to check if userId/itemId exists in model instead of calling the default one:
val itemScore = model.predict(userInt, itemInt)
(https://github.com/nickpoorman/template-scala-parallel-prediction/blob/master/src/main/scala/ALSAlgorithm.scala#L80):
Change to use .headOption:
val itemScore = model.userFeatures.lookup(userInt).headOption.map { userFeature =>
model.productFeatures.lookup(itemInt).headOption.map { productFeature =>
val userVector = new DoubleMatrix(userFeature)
val productVector = new DoubleMatrix(productFeature)
userVector.dot(productVector)
}.getOrElse{
logger.info(s"No itemFeature for item ${query.item}.")
0.0 // return default score
}
}.getOrElse{
logger.info(s"No userFeature for user ${query.user}.")
0.0 // return default score
}