set findFirstMatchIn to ignore line if match not found- spark - scala

set findFirstMatchIn to ignore line if match not found- spark - scala - scala

I am new to scala. I am trying to write a code to map parsed numbers in a series of xml files. My code works for a small RDD as below:
val myrdd = sc.parallelize(Array("FavoriteCount=\"23\" Score=\"43\"","FavoriteCount=\"12\" Score=\"32\"","FavoriteCount=\"32\" Score=\"2\""))
def successMatches(s: String): (String,Int) = {
val fcountMatcher = """FavoriteCount=\"(\d+)\"""".r
val scoreMatcher = """Score=\"(\d+)\"""".r
val fcount = fcountMatcher.findFirstMatchIn(s).get.group(1)
val score = scoreMatcher.findFirstMatchIn(s).get.group(1)
(fcount,score.toInt)
}
val myWords = myrdd.map(x => successMatches(x))
myWords.take(3)
myrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:29
successMatches: (s: String)(String, Int)
myWords: Array[(String, Int)] = Array((23,43), (12,32), (32,2))
res1: Array[(String, Int)] = Array((23,43), (12,32), (32,2))
for the actual xml RDD it returns error message as below:
val myWords = valid_lines.take(1).map(x => successMatches(x))
myWords.take(1)
ava.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at successMatches(<console>:52)
at $anonfun$1.apply(<console>:57)
at $anonfun$1.apply(<console>:57)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
... 42 elided
What am I missing?
This is what the first element of the xml RDD looks like:
valid_lines.take(1)
res51: Array[String] = Array(" <row AnswerCount="0" Body="<p>I'm having trouble with a basic machine learning methodology question. I understand the concept of not using the same data to both train and evaluate a classifier, and furthermore when there are parameters in an algorithm to be optimized, you should use an independent third test set to get the final reportable performance figures (e.g. recall rate). However, using a <em>single</em> test set to measure performance seems to be problematic because the measures of performance will likely differ depending on how the data is partitioned into training (plus validation) and test sets, especially for small datasets. It would be better to average the results of N different partitions.</p>
<p>For t...

Related

What is "WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)"?

I have a following syntax
val data = sc.textFile("log1.txt,log2.txt")
val s = Seq(data)
val par = sc.parallelize(s)
Result that i obtained is as follows:
WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)
par: org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] = ParallelCollectionRDD[2] at parallelize at :28
Question 1
How does a parallelCollection work?.
Question 2
Can I iterate through them and perform transformation?
Question 3
RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
What does this mean?

(An interesting case indeed)
When in doubt, I always recommend to follow the types in Scala (after all the types are why we, Scala developers, use the language in the first place, don't we?)
So, let's reveal the types:
scala> val data = sc.textFile("log1.txt,log2.txt")
data: org.apache.spark.rdd.RDD[String] = log1.txt,log2.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val s = Seq(data)
s: Seq[org.apache.spark.rdd.RDD[String]] = List(log1.txt,log2.txt MapPartitionsRDD[1] at textFile at <console>:24)
scala> val par = sc.parallelize(s)
WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)
par: org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] = ParallelCollectionRDD[3] at parallelize at <console>:28
As you were told, org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] is not supported case in Spark (however it was indeed accepted by the Scala compiler since it matches the signature of SparkContext.parallelize method...unfortunately).
You don't really need val s = Seq(data) since the records in the two files log1.txt,log2.txt are already "inside" RDD and Spark will process them in distributed and parallel manner all records in all the two files (which I believe is your use case).
I do think I've answered all the three questions that I think are based on false expectations and hence they all are pretty much alike :)

In Spark-Scala, how to copy Array of Lists into DataFrame?

I am familiar with Python and I am learning Spark-Scala.
I want to build a DataFrame which has structure desribed by this syntax:
// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
(1.1, Vectors.dense(1.1, 0.1)),
(0.2, Vectors.dense(1.0, -1.0)),
(3.0, Vectors.dense(1.3, 1.0)),
(1.0, Vectors.dense(1.2, -0.5))
)).toDF("label", "features")
I got the above syntax from this URL:
http://spark.apache.org/docs/latest/ml-pipeline.html
Currently my data is in array which I had pulled out of a DF:
val my_a = gspc17_df.collect().map{row => Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}
The structure of my array is very similar to the above DF:
my_a: Array[Seq[Any]] =
Array(
List(-1.4830674013266898, [-0.004192832940431825,-0.003170667657263393]),
List(-0.05876766500768526, [-0.008462913654529357,-0.006880595828929472]),
List(1.0109273250546658, [-3.1816797620416693E-4,-0.006502619326182358]))
How to copy data from my array into a DataFrame which has the above structure?
I tried this syntax:
val my_df = spark.createDataFrame(my_a).toDF("label","features")
Spark barked at me:
<console>:105: error: inferred type arguments [Seq[Any]] do not conform to method createDataFrame's type parameter bounds [A <: Product]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
<console>:105: error: type mismatch;
found : scala.collection.mutable.WrappedArray[Seq[Any]]
required: Seq[A]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
scala>

The first problem here is that you use List to store row data. List is a homogeneous data structure and since the only common type for Any (row(2)) and DenseVector is Any (Object) you end up with a Seq[Any].
The next issue is that you use row(2) at all. Since Row is effectively a collection of Any this operation doesn't return any useful type and result couldn't be stored in a DataFrame without providing an explicit Encoder.
From the more Sparkish perspective it is not the good approach neither. collect-int just to transform data shouldn't require any comment and. mapping over Rows just to create Vectors doesn't make much sense either.
Assuming that there is no type mismatch you can use VectorAssembler:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array(df.columns(3), df.columns(4)))
.setOutputCol("features")
assembler.transform(df).select(df.columns(2), "features")
or if you really want to handle this manually an UDF.
val toVec = udf((x: Double, y: Double) => Vectors.dense(x, y))
df.select(col(df.columns(2)), toVec(col(df.columns(3)), col(df.columns(4))))
In general I would strongly recommend getting familiar with Scala before you start using it with Spark.

How to convert RDD[Row] to RDD[Vector]

I'm trying to implement k-means method using scala.
I created a RDD something like that
val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> {
sc.parallelize(chunk._2.toSeq).toDF()
})
val examples = df.map(dataframe =>{
dataframe.selectExpr(
"avg(time) as avg_time",
"variance(size) as var_size",
"variance(time) as var_time",
"count(size) as examples"
).rdd
})
val rdd_final=examples.reduce(_ union _)
val kmeans= new KMeans()
val model = kmeans.run(rdd_final)
With this code I obtain an error
type mismatch;
[error] found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error] required:org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried to cast doing:
val rdd_final_Vector = rdd_final.map{x:Row => x.getAs[org.apache.spark.mllib.linalg.Vector](0)}
val model = kmeans.run(rdd_final_Vector)
But then I obtain an error:
java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
So I'm looking for a way to do that cast, but I can't find any method.
Any idea?
Best regards

At least a couple of issues here:
No you really can not cast a Row to a Vector: a Row is a collection of potentially disparate types understood by Spark SQL. A Vector is not a native spark sql type
There seems to be a mismatch between the content of your SQL statement and what you are attempting to achieve with KMeans: the SQL is performing aggregations. But KMeans expects a series of individual data points in the form a Vector (which encapsulates an Array[Double]) . So then - why are you supplying sum's and average's to a KMeans operation?
Addressing just #1 here: you will need to do something along the lines of:
val doubVals = <rows rdd>.map{ row => row.getDouble("colname") }
val vector = Vectors.toDense{ doubVals.collect}
Then you have a properly encapsulated Array[Double] (within a Vector) that can be supplied to Kmeans.

Reducing potentially empty RDD's

So I'm running into an issue where a filter I'm using on an RDD can potentially create an empty RDD. I feel that doing a count() in order to test for emptiness would be very expensive, and was wondering if there is a more performant way to handle this situation.
Here is an example of what this issue might look like:
val b:RDD[String] = sc.parallelize(Seq("a","ab","abc"))
println(b.filter(a => !a.contains("a")).reduce(_+_))
would give the result
empty collection
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1005)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1005)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1005)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
Does anyone have any suggestions for how I should go about addressing this edge case?

Consider .fold("")(_ + _) instead of .reduce(_ + _)

how about
scala> val b = sc.parallelize(Seq("a","ab","abc"))
b: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> b.isEmpty
res1: Boolean = false

Finding the average of a data set using Apache Spark

I am learning how to use Apache Spark and I am trying to get the average temperature from each hour from a data set. The data set that I am trying to use is from weather information stored in a csv. I am having trouble finding how to first read in the csv file and then calculating the average temperature for each hour.
From the spark documentation I am using the example Scala line to read in a file.
val textFile = sc.textFile("README.md")
I have given the link for the data file below. I am using the file called JCMB_2014.csv as it is the latest one with all months covered.
Weather Data
Edit:
The code I have tried so far is:
class SimpleCSVHeader(header:Array[String]) extends Serializable {
val index = header.zipWithIndex.toMap
def apply(array:Array[String], key:String):String = array(index(key))
}
val csv = sc.textFile("JCMB_2014.csv")
val data = csv.map(line => line.split(",").map(elem => elem.trim))
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header
val header = new SimpleCSVHeader(data.take(1)(0))
val rows = data.filter(line => header(line,"date-time") != "date-time")
val users = rows.map(row => header(row,"date-time")
val usersByHits = rows.map(row => header(row,"date-time") -> header(row,"surface temperature (C)").toInt)

Here is sample code for calculating averages on hourly basis
Step1:Read file, Filter header,extract time and temp columns
scala> val hourlyTemps = lines.map(line=>line.split(",")).filter(entries=>(!"time".equals(entries(3)))).map(entries=>(entries(3).toInt/60,(entries(8).toFloat,1)))
scala> hourlyTemps.take(1)
res25: Array[(Int, (Float, Int))] = Array((9,(10.23,1)))
(time/60) discards minutes and keeps only hours
Step2:Aggregate temperatures and no of occurrences
scala> val aggregateTemps=hourlyTemps.reduceByKey((a,b)=>(a._1+b._1,a._2+b._2))
scala> aggreateTemps.take(1)
res26: Array[(Int, (Double, Int))] = Array((34,(8565.25,620)))
Step2:Calculate Averages using total and no of occurrences
Find the final result below.
val avgTemps=aggregateTemps.map(tuple=>(tuple._1,tuple._2._1/tuple._2._2))
scala> avgTemps.collect
res28: Array[(Int, Float)] = Array((34,13.814922), (4,11.743354), (16,14.227251), (22,15.770312), (28,15.5324545), (30,15.167026), (14,13.177828), (32,14.659948), (36,12.865237), (0,11.994799), (24,15.662579), (40,12.040322), (6,11.398838), (8,11.141323), (12,12.004652), (38,12.329914), (18,15.020147), (20,15.358524), (26,15.631921), (10,11.192643), (2,11.848178), (13,12.616284), (19,15.198371), (39,12.107664), (15,13.706351), (21,15.612191), (25,15.627121), (29,15.432097), (11,11.541124), (35,13.317129), (27,15.602408), (33,14.220147), (37,12.644306), (23,15.83412), (1,11.872819), (17,14.595772), (3,11.78971), (7,11.248139), (9,11.049844), (31,14.901464), (5,11.59693))

You may want to provide Structure definition of your CSV file and convert your RDD to DataFrame, like described in the documentation. Dataframes provide a whole set of useful predefined statistic functions as well as the possibility to write some simple custom functions. You then will be able to compute the average with:
dataFrame.groupBy(<your columns here>).agg(avg(<column to compute average>)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

set findFirstMatchIn to ignore line if match not found- spark - scala - scala

Related

What is "WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)"?

In Spark-Scala, how to copy Array of Lists into DataFrame?

How to convert RDD[Row] to RDD[Vector]

Reducing potentially empty RDD's

Finding the average of a data set using Apache Spark

Categories

Resources