class Cleaner {
def getDocumentData() = {
val conf = new SparkConf()
.setAppName("linkin_spark")
.setMaster("local[2]")
.set("spark.executor.memory", "1g")
.set("spark.rdd.compress", "true")
.set("spark.storage.memoryFraction", "1")
val CorpusReader = new Corpus()
val files = CorpusReader.getListOfFiles("/home/DATA/doc_collection/")
val sc = new SparkContext(conf)
val temp = sc.textFile(files(0).toString())
println(files(0).toString())
var count = 0
val regex = """<TAG>""".r
for (line <- temp ) {
line match {
case regex(_*) => {
println(line)
count += 1
println(count)
}
case _ => null //Handle error - scala.MatchError
}
}
println(s"There are " + count + " documents.") // this comes out to be 0
}
}
I have a list of text file that I have to read. They are XML-like files so, I need to extract the relevant text. Since, they are not standard XML files I thought of using regex to get the text. Every document start with a <TAG> tag. So I tried to count no. of documents in a file which will be equal to no. of <TAG> matches in the file. The function above does the same thing. Originally the file has 264 docs but when I run the function I get 127 or 137, either of the numbers. It does not seem to reading the whole file. Also count at the end comes out to be 0.
I am Scala/Spark newbie.
UPDATE:
var count = sc.accumulator(0)
val regex = """<TAG>""".r
for (line <- temp ) {
println(line)
line match {
case regex(_*) => {
count += 1
println(s"$line # $count") //There is no "<TAG> # 264" in the output
}
case _ => null
}
}
println(s"There are " + count.value + " documents.")
This change in the program gives me the correct value of count i.e. 264 but the file id printed correctly!! It appears to start somewhere from the middle and end somewhere in the middle.
UPDATE II:
This has something to do with threads. The SparkConf() has been initialised using local[2] which means 2 threads, if I am not wrong. As soon as I changed it to local[1] I got the correct answer but I cannot use one thread.
File appears like this:
<TAG>
<DOCNO> AP890825-0001 </DOCNO>
<FILEID>AP-NR-08-25-89 0134EDT</FILEID>
<TEXT>
Some large text.
</TEXT>
</TAG>
<TAG> // new doc started
How should I correct this issue?
This is a closure problem. Each node gets its own version of the count variable. You want to use accumulators or simply perform a reduce
Created by:
val tagCounter = sc.accumulator(0, "tagCount")
Updated by: (not readable on the nodes)
tagCounter += 1
Readable on the driver by:
tagCounter.value
Following your update:
var count = sc.accumulator(0)
val regex = """<TAG>""".r
val output = for (line <- temp ) yield {
//println(line)
line match {
case regex(_*) => {
count += 1
//println(s"$line # $count") //There is no "<TAG> # 264" in the output
line
}
case _ => "ERROR"
}
}
println(s"output size:{output.count}")
println(s"There are " + count.value + " documents.")
UPDATE AFTER SEEING THE INPUT FORMAT:
You may end up at the whim of using wholeTextFiles to guarantee ordering. Otherwise the distributed nature means that ordering is often not guaranteed, but if you can guarantee ordering (possibly custom partitioner or custom InputFormat), then something like this should work:
sc.parallelize(list)
.aggregate(Nil : List[String])((accum, value) => {
value match {
case regex(_*) => accum :+ value
case _ => {
accum match {
case Nil => List(value)
case _ => accum.init ++ (accum.tail :+ value)
}
}
}, _ ++ _)
Related
I am running a simple PageRanking algorithm on a WARC file. The two RDDs I use are links and ranks, as per the algorithm. Links is of the form (URL:String, links:List[String]) and ranks is of the form (URL:String, no. of links:Int). Here is the code I am using:
val warcs = sc.newAPIHadoopFile(
warcfile,
classOf[WarcGzInputFormat], // InputFormat
classOf[NullWritable], // Key
classOf[WarcWritable] // Value
)
val links = warcs.map{ wr => wr._2.getRecord()}.
map{ wb => {
val url = wb.getHeader().getUrl()
val d = Jsoup.parse(wb.getHttpStringBody())
val links = d.select("a").asScala
links.map(l => (url,l.attr("href"))).toIterator
}
}.
flatMap(identity).map(t => (t._1,List(t._2))).reduceByKey(_:::_)
var ranks = warcs.map{ wr => wr._2.getRecord()}.
map{ wb => (wb.getHeader().getUrl(), Jsoup.parse(wb.getHttpStringBody()).select("a[href]").size().toDouble)}.
filter{ l => l._2 > 0}
for(i <- 1 to 10){
val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank.toDouble/links.size)) }
ranks = contribs.reduceByKey((x,y) => x+y).mapValues(sum => 0.15 + 0.85*sum)
}
Everything goes fine until after the for loop, where ranks is empty. I cannot understand why this happens; I've tried doing it iteratively(meaning that I took the lines out of the for loop and ran them independently) and notice that everything goes fine until the second iteration, at which point the join operation returns an empty RDD. How can I circumvent this?
Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")
Upload a file in chunk to a server including additional fields
def readFile(): Seq[ExcelFile] = {
logger.info(" readSales method initiated: ")
val source_test = source("E:/dd.xlsx")
println( " source_test "+source_test)
val source_test2 = Source.fromFile(source_test)
println( " source_test2 "+source_test)
//logger.info(" source: "+source)
for {
line <- source_test2.getLines().drop(1).toVector
values = line.split(",").map(_.trim)
// logger.info(" values are the: "+values)
} yield ExcelFile(Option(values(0)), Option(values(1)), Option(values(2)), Option(values(3)))
}
def source(filePath: String): String = {
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
Source.fromFile(filePath).mkString
}
upload route,
path("upload"){
(post & extractRequestContext) { ctx => {
implicit val materializer = ctx.materializer
implicit val ec = ctx.executionContext
fileUpload("fileUploads") {
case (fileInfo, fileStream) =>
val path = "E:\\"
val sink = FileIO.toPath(Paths.get(path).resolve(fileInfo.fileName))
val wResult = fileStream.runWith(sink)
onSuccess(wResult) { rep => rep.status match {
case Success(_) =>
var ePath = path + File.separator + fileInfo.fileName
readFile(ePath)
_success message_
case Failure(e) => _faillure message_
} }
}
} }
}
am using above code, is it possible in scala or Akka can I read the excel file like chunk file
After looking at your code, it like you are having an issue with the post-processing (after upload) of the file.
If uploading a 3GB file is working even for 1 user then I assume that it is already chunked or multipart.
The first problem is here - source_test2.getLines().drop(1).toVector which create a Vector ( > 3GB ) with all line in file.
The other problem is that you are keeping the whole Seq[ExcelFile] in memory which should be bigger than 3 GB (because of Java object overhead).
So whenever you are calling this readFile function, you are using more than 6 GB memory.
You should try to avoid creating such large object in your application and use things like Iterator instead of Seq
def readFile(): Iterator[ExcelFile] = {
val lineIterator = Source.fromFile("your_file_path").getLines
lineIterator.drop(1).map(line => {
val values = line.split(",").map(_.trim)
ExcelFile(
Option(values(0)),
Option(values(1)),
Option(values(2)),
Option(values(3))
)
})
}
The advantage with Iterator is that it will not load all the things in memory at once. And you can keep using Iterators for further steps.
Below is code for getting list of file Names in a zipped file
def getListOfFilesInRepo(zipFileRDD : RDD[(String,PortableDataStream)]) : (List[String]) = {
val zipInputStream = zipFileRDD.values.map(x => new ZipInputStream(x.open))
val filesInZip = new ArrayBuffer[String]()
var ze : Option[ZipEntry] = None
zipInputStream.foreach(stream =>{
do{
ze = Option(stream.getNextEntry);
ze.foreach{ze =>
if(ze.getName.endsWith("java") && !ze.isDirectory()){
var fileName:String = ze.getName.substring(ze.getName.lastIndexOf("/")+1,ze.getName.indexOf(".java"))
filesInZip += fileName
}
}
stream.closeEntry()
} while(ze.isDefined)
println(filesInZip.toList.length) // print 889 (correct)
})
println(filesInZip.toList.length) // print 0 (WHY..?)
(filesInZip.toList)
}
I execute above code in the following manner :
scala> val zipFileRDD = sc.binaryFiles("./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip")
zipFileRDD: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = ./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip BinaryFileRDD[17] at binaryFiles at <console>:25
scala> getListOfFilesInRepo(zipRDD)
889
0
res12: List[String] = List()
Why i am not getting 889 and instead getting 0?
It happens because filesInZip is not shared between workers. foreach operates on a local copy of filesInZip and when it finishes this copy is simply discarded and garbage collected. If you want to keep the results you should use transformation (most likely a flatMap) and return collected aggregated values.
def listFiles(stream: PortableDataStream): TraversableOnce[String] = ???
zipInputStream.flatMap(listFiles)
You can learn more from Understanding closures
I am trying to do the pagerank Basic example in flink with little bit of modification(only in reading the input file, everything else is the same) i am getting the error as Task not serializable and below is the part of the output error
atorg.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:179)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:171)
Below is my code
object hpdb {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val maxIterations = 10000
val DAMPENING_FACTOR: Double = 0.85
val EPSILON: Double = 0.0001
val outpath = "/home/vinoth/bigdata/assign10/pagerank.csv"
val links = env.readCsvFile[Tuple2[Long,Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1,4)).as('sourceId,'targetId).toDataSet[Link]//source and target
val pages = env.readCsvFile[Tuple1[Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1)).as('pageId).toDataSet[Id]//Pageid
val noOfPages = pages.count()
val pagesWithRanks = pages.map(p => Page(p.pageId, 1.0 / noOfPages))
val adjacencyLists = links
// initialize lists ._1 is the source id and ._2 is the traget id
.map(e => AdjacencyList(e.sourceId, Array(e.targetId)))
// concatenate lists
.groupBy("sourceId").reduce {
(l1, l2) => AdjacencyList(l1.sourceId, l1.targetIds ++ l2.targetIds)
}
// start iteration
val finalRanks = pagesWithRanks.iterateWithTermination(maxIterations) {
// **//the output shows error here**
currentRanks =>
val newRanks = currentRanks
// distribute ranks to target pages
.join(adjacencyLists).where("pageId").equalTo("sourceId") {
(page, adjacent, out: Collector[Page]) =>
for (targetId <- adjacent.targetIds) {
out.collect(Page(targetId, page.rank / adjacent.targetIds.length))
}
}
// collect ranks and sum them up
.groupBy("pageId").aggregate(SUM, "rank")
// apply dampening factor
//**//the output shows error here**
.map { p =>
Page(p.pageId, (p.rank * DAMPENING_FACTOR) + ((1 - DAMPENING_FACTOR) / pages.count()))
}
// terminate if no rank update was significant
val termination = currentRanks.join(newRanks).where("pageId").equalTo("pageId") {
(current, next, out: Collector[Int]) =>
// check for significant update
if (math.abs(current.rank - next.rank) > EPSILON) out.collect(1)
}
(newRanks, termination)
}
val result = finalRanks
// emit result
result.writeAsCsv(outpath, "\n", " ")
env.execute()
}
}
Any help in the right direction is highly appreciated? Thank you.
The problem is that you reference the DataSet pages from within a MapFunction. This is not possible, since a DataSet is only the logical representation of a data flow and cannot be accessed at runtime.
What you have to do to solve this problem is to assign the val pagesCount = pages.count value to a variable pagesCount and refer to this variable in your MapFunction.
What pages.count actually does, is to trigger the execution of the data flow graph, so that the number of elements in pages can be counted. The result is then returned to your program.