Pass a TXT file as stopword list in Scala

Pass a TXT file as stopword list in Scala - scala

I am using the stanford Topic modelling toolbox (TMT) http://nlp.stanford.edu/software/tmt/tmt-0.4/, and I want to prepare my text data set.
I have a txt file of stopwords.
However,
TermStopListFilter()
Which filters out stop words from my CSV data set, only accepts a list within the script, such as:
TermStopListFilter(List("positively","scrumptious"))
How do I import my stopwords.txt file and use it as my stopword list?
A full snipet of the code I use:
val source = CSVFile("filtered.csv");
val text = {
source ~>
Column(1) ~>
TokenizeWith(tokenizer) ~>
TermCounter() ~>
TermMinimumDocumentCountFilter(100) ~>
TermStopListFilter(TXTFile("stopwords.txt"))
TermDynamicStopListFilter(10) ~>
DocumentMinimumLengthFilter(5)
}

well, if your stopwords are "," separated you can try this:
.
.
TermStopListFilter(Source("stopwords.txt").getLines().map(_.split(",")).toList)
.
.
If your stopwords in stopwords.txt are delimited by some other char, change it in split(",") accordingly and most likely you should remove line: TermStopListFilter(List("positively","scrumptious"))

Related

Scala : List the files which are greater than a file based on its name with timestamp pattern in it

I have to list out all files which are greater than particular file based on its timestamp in naming pattern in scala. Below is the example.
Files available:
log_20200601T123421.log
log_20200601T153432.log
log_20200705T093425.log
log_20200803T049383.log
Condition file:
log_20200601T123421.log - I need to list all the file names, which are greater than equal to 20200601T123421 in its name. The result would be,
Output list:
log_20200601T153432.log
log_20200705T093425.log
log_20200803T049383.log
How to achieve this in scala? I was trying with apache common, but i couldn't see greater than equal to NameFileFilter for it.

Perhaps the following code snippet could be a starting point:
import java.io.File
def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(x => x.getName > "log_20200601T123421.log").toList
val files = getListOfFiles(new File("/tmp"))
For the extended task to collect files from different sub-directories:
import java.io.File
def recursiveListFiles(f: File): Array[File] = {
val these = f.listFiles
these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
}
val files = recursiveListFiles(new File("/tmp")).filter(x => x.getName > "log_20200601T123421.log")

Scala: Listing files that match a regular expression within a directory

I'm trying to list files within a directory that match a regular expression, e.g. ".csv$" this is very similar to Scala & DataBricks: Getting a list of Files
I've been running in circles for hours trying to figure out how Scala can list a directory of files and filter by regex.
import java.io.File
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
val name : String = ".csv"
val files = getListOfFiles("/home/con/Scripts").map(_.path).filter(_.matches(name))
println(files)
gives the error
/home/con/Scripts/scala/find_files.scala:13: error: value path is not a member of java.io.File
val files = getListOfFiles("/home/con/Scripts").map(_.path).filter(_.matches(name))
I'm trying to figure out the regular Scala equivalent of dbutils.fs.ls which eludes me.
How can list files in a regular directory in Scala?

The error is reporting that path is not defined in java.io.File which it isn't.
If you want to match by name, why don't you get file names? Also, your regex is a bit off if you want to match based on file extension.
Fixing these two problems:
val name : String = ".+\\.csv"
val files = getListOfFiles("/path/to/files/location")
.map(f => f.getName)
.filter(_.matches(name))
will output .csv files in the /path/to/files/location folder.

Is there a way to split by Custom Delimiter in Spark(with Scala) and not read line by line, to read a set of key, value pairs?

I have an input .txt file in the format.
Record
ID||1
Word||ABC
Language||English
Count||2
Record
ID||2
Word||DEF
Language||French
Count||4
and so on.
I'm new to Apache Spark/Scala.
I see that there are options to read a file line by line by using the .textFile method or to read a whole file by .wholeTextFile method. We can also read files which are in CSV format.
But let's say I want to read such a file and create a case class out of it, which would have the members id, word, language, count, how can I go about this?

Assuming your input format is consistent (no random whitespaces, always terminates with "Record\n"), the following code works.
The key is in hadoop configuration's "textinputformat.record.delimiter"
case class Foo(ID : Long, Word : String, Language : String, Count : Long)
.
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("stackOverflow")
val sc = new SparkContext(conf)
sc.hadoopConfiguration.set("textinputformat.record.delimiter","Record\n")
val rdd = sc.textFile("C:\\TEMP\\stack.txt")
.flatMap(record => {
if (record.isEmpty) None //needed to remove first empty string delimited by "Record\n"
else {
val lines = record.split("\n").map(_.split("\\|\\|"))
//lines.foreach(x=>println(x.mkString(",")))
Some(Foo(
lines(0)(1).toLong,
lines(1)(1),
lines(2)(1),
lines(3)(1).toLong
))
}
})
rdd.foreach(println)
The output is
Foo(2,DEF,French,4)
Foo(1,ABC,English,2)

Akka: How to extract a value in one graph stage and use it in the next

I am using Alpakka and Akka to process a CSV file. Since I have a bunch of CSV files that have to be added to the same stream, I would like to add a field that contains information from the file name or request. Currently I have something like this:
val source = FileIO.fromPath(Paths.get("10002070.csv"))
.via(CsvParsing.lineScanner())
Which streams a Sequence of Lists (lines) of ByteStrings (fields). The goal would be something like:
val filename = "10002070.csv"
val source = FileIO.fromPath(Path.get(filename))
.via(CsvParsing.lineScanner())
.via(AddCSVFieldHere(filename))
Creating a structure similar to:
10002070.csv,max,estimated,12,1,0
Where the filename is a field non-existent in the original source.
I thing it does not look very pretty to inject values mid-stream, plus eventually I would like to determine the filenames passed to the parsing in a stream stage that reads a directory.
What is the correct/canonical way to pass values through stream stages for later re-use?

You could transform the stream with map to add the file name to each List[ByteString]:
val fileName = "10002070.csv"
val source =
FileIO.fromPath(Path.get(fileName))
.via(CsvParsing.lineScanner())
.map(List(ByteString(fileName)) ++ _)
For example:
Source.single(ByteString("""header1,header2,header3
|1,2,3
|4,5,6""".stripMargin))
.via(CsvParsing.lineScanner())
.map(List(ByteString("myfile.csv")) ++ _)
.runForeach(row => println(row.map(_.utf8String)))
// The above code prints the following:
// List(myfile.csv, header1, header2, header3)
// List(myfile.csv, 1, 2, 3)
// List(myfile.csv, 4, 5, 6)
The same approach is applicable in the more general case in which you don't know the file names upfront. If you want to read all the files in a directory (assuming that all of these files are csv files), concatenate the files into a single stream, and preserve the file name in each stream element, then you could do so with Alpakka's Directory utility in the following manner:
val source =
Directory.ls(Paths.get("/my/dir")) // Source[Path, NotUsed]
.flatMapConcat { path =>
FileIO.fromPath(path)
.via(CsvParsing.lineScanner())
.map(List(ByteString(path.getFileName.toString)) ++ _)
}

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?

As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)

Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pass a TXT file as stopword list in Scala - scala

Related

Scala : List the files which are greater than a file based on its name with timestamp pattern in it

Scala: Listing files that match a regular expression within a directory

Is there a way to split by Custom Delimiter in Spark(with Scala) and not read line by line, to read a set of key, value pairs?

Akka: How to extract a value in one graph stage and use it in the next

Spark: How to get String value while generating output file

Categories

Resources