Scala is being used with hadoop
I have performed map-reduce on a text file stored in hdfs (hadoop). The file is large, therefore, I have attempted to extract the important rows; the five most used words in the file.
Therefore, I have used the .take(n) method to extract the required elements. However, an error is prompt when the result is attempted to be saved to a text file. I have tried saving the file a number of ways:
Method 1
val path = "Books/"+language+"/*"
val textFile = sc.textFile(path)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey(_+_)
counts.collect()
val sortedCounts = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
sortedCounts.collect()
val check = sortedCounts.take(5)
check.foreach(d => Files.write(Paths.get(language), (d._1 + " " + d._2 + "\n").getBytes, StandardOpenOption.CREATE, StandardOpenOption.APPEND))
Error
[info] Compiling 1 Scala source to /home/cloudera/Assessment 2/target/scala-2.10/classes ...
[error] /home/cloudera/Assessment 2/src/main/scala/task2.scala:27:22: not found: value Files
[error] check.foreach(d => Files.write(Paths.get(language), (d._1).getBytes, StandardOpenOption.CREATE, StandardOpenOption.APPEND))
[error] ^
[error] one error found
[error] (Compile / compileIncremental) Compilation failed
Method 2
val path = "Books/"+language+"/*"
val textFile = sc.textFile(path)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey(_+_)
counts.collect()
val sortedCounts = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
sortedCounts.collect()
val check = sortedCounts.take(5)
check.saveAsTextFile(language)
*Error*
[info] Compiling 1 Scala source to /home/cloudera/Assessment 2/target/scala-2.10/classes ...
[error] /home/cloudera/Assessment 2/src/main/scala/task2.scala:27:9: value saveAsTextFile is not a member of Array[(Int, String)]
[error] check.saveAsTextFile(language)
[error] ^
[error] one error found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 11 s, completed Nov 26, 2019 6:20:11 AM
Important
The file saves correctly before the .take() method is used. When sortedCounts is used with the saveAsTextFile(x) the entire map is saved. I do not want this as mentioned previously.
How does one save an array[(string,int)] to a text file on hadoop via scala?
You can't save on Hadoop after you take/collect an RDD because it's a local Scala data-structure then, no longer within Spark
You need to use sortedCounts.saveAsTextFile and you should remove counts.collect
Related
I need to group my rdd by two columns and aggregate the count. I have a function:
def constructDiagnosticFeatureTuple(diagnostic: RDD[Diagnostic])
: RDD[FeatureTuple] = {
val grouped_patients = diagnostic
.groupBy(x => (x.patientID, x.code))
.map(_._2)
.map{ events =>
val p_id = events.map(_.patientID).take(1).mkString
val f_code = events.map(_.code).take(1).mkString
val count = events.size.toDouble
((p_id, f_code), count)
}
//should be in form:
//diagnostic.sparkContext.parallelize(List((("patient", "diagnostics"), 1.0)))
}
At compile time, I am getting an error:
/FeatureConstruction.scala:38:3: type mismatch;
[error] found : Unit
[error] required: org.apache.spark.rdd.RDD[edu.gatech.cse6250.features.FeatureConstruction.FeatureTuple]
[error] (which expands to) org.apache.spark.rdd.RDD[((String, String), Double)]
[error] }
[error] ^
How can I fix it?
I red this post: Scala Spark type missmatch found Unit, required rdd.RDD , but I do not use collect(), so, it does not help me.
I've a text file with following format (id,f1,f2,f3,...,fn):
12345,0,0,1,2,...,3
23456,0,0,1,2,...,0
33333,0,1,1,0,...,0
56789,1,0,0,0,...,4
a_123,0,0,0,6,...,3
And I want to read the file (ignore the line like a_123,0,0,0,6,...,3) to create a RDD[(Long, Vector). Here's my solution:
def readDataset(path: String, sparkSession: SparkSession): RDD[(ItemId, Vector)] = {
val sc = sparkSession.sparkContext
sc.textFile(path)
.map({ line => val values=line.split(",")
(
values(0).toLong,
//util.Try(values(0).toLong).getOrElse(0L),
Vectors.dense(values.slice(1, values.length).map {x => x.toDouble }).toSparse
)})
.filter(x => x._1 > 0)
}
However this code can not be compiled:
[ERROR] found : org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.SparseVector)]
[ERROR] required: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] (which expands to) org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] Note: (Long, org.apache.spark.ml.linalg.SparseVector) <: (Long, org.apache.spark.ml.linalg.Vector), but class RDD is invariant in type T.
[ERROR] You may wish to define T as +T instead. (SLS 4.5)
[ERROR] .filter(x => x._1 > 0)
[ERROR] ^
[ERROR] one error found
But if I remove the . toSparse or .filter(x => x._1 > 0) this code can be compiled successfully.
Does someone know why and what should I do to fix it?
Also is there any better way to read the file to RDD with ignoring the non-numeric id lines?
The code compiles successfully if you remove toSparse because the type of your PairRDD is (ItemId, Vector).
The org.apache.spark.ml.linalg.Vector class/type represent the Dense Vector which you are generating using Vector.dense and when you call toSparse it gets converted to org.apache.spark.ml.linalg.SparseVector which is not the type that your PairRDD expects.
As for filtering non-integer IDs I would say your method is a good way to do that.
I'm saving each event in text file as follows:
map{ case (_, record) => getEventFromRecord(record) }.map(m => m.toByteArray).saveAsTextFile(outputPath)
I want to also save the total size of each event I'm saving to text file.
1) How can I save the total size of each record to a new file?
2) I tried using accumulator
val accum = sparkContext.accumulator(0, "My Accumulator")
map{ case (_, record) => getEventFromRecord(record) }.foreach(m => accum += (m.toByteArray.length)).saveAsTextFile(outputPath)
But I get the following error:
value saveAsTextFile is not a member of Unit
[error] sparkContext.sequenceFile(inputDirectory, classOf[IntWritable], classOf[DataOutputValue]).map{ case (_, record) => getEventFromRecord(record) }.foreach(m => accum += (m.toByteArray.length)).saveAsTextFile(outputPath)
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
The foreach action returns unit as a result and is only used for side effects. If you want to collect the sum of your rdd, use the reduce action
val totalSize = map{ case (_, record) => getEventFromRecord(record).toByteArray.length}.reduce{_ + _}
This will return the result of the summation on the driver. You can then use the Hadoop Filesystem api to create a new file and write to it.
val fs = FileSystem.get(new Configuration())
val outputWriter = new PrintWriter(fs.create(outputPath))
outputWriter.println(totalSize)
outputWriter.flush()
outputWriter.close()
Note that in production you probably would want to wrap that outputstream into a try/catch/finally block or similar to make sure your resources close properly as with any file IO you do.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.Entropy
object ScalaApp {
def main(args: Array[String]) {
object ScalaApp {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("Program")
val sc = new SparkContext(conf)
val rawData = sc.textFile("/home/sangeen/Kaggle/train.tsv")
val records = rawData.map(line => line.split("\t"))
records.first
println(records.first)
/*
we will have to do a bit of data cleaning during
our initial processing by trimming out the extra quotation characters ("). There are
also missing values in the dataset; they are denoted by the "?" character. In this case,
we will simply assign a zero value to these missing values:
*/
val data = records.map { r => val trimmed = r.map (_.replaceAll("/"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size -1).map(d => if (d == "?")) 0.0 else d.toDouble)
LabeledPoint(label, Vectors.dense(features))}
/*
In the preceding code, we extracted the label variable from the last column and an
array of features for columns 5 to 25 after cleaning and dealing with missing values.
We converted the label to an Int value and the features to an Array[Double].
Finally, we wrapped the label and features in a LabeledPoint instance, converting
the features into an MLlib Vector.
We will also cache the data and count the number of data points:
You will see that the value of numData is 7395.
*/
data.cache
val numData = data.count
println("value of numData is : " + numData)
/*
We will explore the dataset in more detail a little later, but we will tell you now
that there are some negative feature values in the numeric data. As we saw earlier,
the naïve Bayes model requires non-negative features and will throw an error if it
encounters negative values. So, for now, we will create a version of our input feature
vectors for the naïve Bayes model by setting any negative feature values to zero:
*/
val nbData = records.map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble).map(d => if (d < 0) 0.0 else d)
LabeledPoint(label, Vectors.dense(features))}
val numIterations = 10
val maxTreeDepth = 5
//Now, train each model in turn. First, we will train logistic regression:
val lrModel = LogisticRegressionWithSGD.train(data, numIterations)
}
}
The code gives me erros :
[error] (run-main-1) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.NumberFormatException: For input string: ",urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,compression_ratio,embed_ratio,framebased,frameTagRatio,hasDomainLink,html_ratio,image_ratio,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label"
[error] at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
[error] at java.lang.Integer.parseInt(Integer.java:481)
[error] at java.lang.Integer.parseInt(Integer.java:527)
[error] at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
[error] at scala.collection.immutable.StringOps.toInt(StringOps.scala:30)
[error] at ScalaApp$$anonfun$4.apply(Program.scala:29)
[error] at ScalaApp$$anonfun$4.apply(Program.scala:27)
[error] at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
[error] at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
[error] at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
[error] at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
[error] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
[error] at org.apache.spark.scheduler.Task.run(Task.scala:88)
[error] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[error] at java.lang.Thread.run(Thread.java:745)
[error] Driver stacktrace
[error] (compile:run) Nonzero exit code: 1
Your code is trying to convert the header columns into numbers, which are not numbers of-course. Just skip the first line and you are good to go:
val lst = List(1,2,3,4)
val records = sc.parallelize(lst).zipWithIndex.filter(_._2 > 0).map(_._1)
val records.collect() // Array[Int] = Array(2, 3, 4)
Or don't read the header line at all.
For more: How do I skip a header from CSV files in Spark?
just before running the code first remove the header by the help of these steps
1) open terminal
Ctr + alt + t
2) go to the file directory
cd /home/sangeen/Programs/Classification
3) just run that one line code :
sed 1d train.tsv > train_noheader.tsv
so in directry a non-header tsv file will generate .
use the "train-noheader.tsv" file instead of "train.tsv".
for Example :
val rawData = sc.textFile("/home/sangeen/Kaggle/train.tsv")
will become
val rawData = sc.textFile("/home/sangeen/Kaggle/train-noheader.tsv")
Tuxdna is correct that header is the problem but the method provided by me to filter out the header will reduce the space and time complexity of the code.
val data = records.filter(_.contains("urlid,boilerplate,alchemy_category")).map { r => val trimmed = r.map (_.replaceAll("/"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size -1).map(d => if (d == "?")) 0.0 else d.toDouble)
LabeledPoint(label, Vectors.dense(features))}
val nbData = records.filter(_.contains("urlid,boilerplate,alchemy_category"))..map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble).map(d => if (d < 0) 0.0 else d)
LabeledPoint(label, Vectors.dense(features))}
There are some examples for use SQL over Spark Streaming in foreachRDD(). But if I want to use SQL in tranform():
case class AlertMsg(host:String, count:Int, sum:Double)
val lines = ssc.socketTextStream("localhost", 8888)
lines.transform( rdd => {
if (rdd.count > 0) {
val t = sqc.jsonRDD(rdd)
t.registerTempTable("logstash")
val sqlreport = sqc.sql("SELECT host, COUNT(host) AS host_c, AVG(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND lineno > 70 GROUP BY host ORDER BY host_c DESC LIMIT 100")
sqlreport.map(r => AlertMsg(r(0).toString,r(1).toString.toInt,r(2).toString.toDouble))
} else {
rdd
}
}).print()
I got such error:
[error] /Users/raochenlin/Downloads/spark-1.2.0-bin-hadoop2.4/logstash/src/main/scala/LogStash.scala:52: no type parameters for method transform: (transformFunc: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[U])(implicit evidence$5: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U] exist so that it can be applied to arguments (org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable])
[error] --- because ---
[error] argument expression's type is not compatible with formal parameter type;
[error] found : org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable]
[error] required: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[?U]
[error] lines.transform( rdd => {
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
Seems only if I use sqlreport.map(r => r.toString) can be a correct usage?
dstream.transform take a function transformFunc: (RDD[T]) ⇒ RDD[U]
In this case, the if must result in the same type on both evaluations of the condition, which is not the case:
if (count == 0) => RDD[String]
if (count > 0) => RDD[AlertMsg]
In this case, remove the optimization of if rdd.count ... sothat you have an unique transformation path.