how to search multipule string in Scala spark - scala

I am using
val my_data = sc.textFile("abc.txt")
val my_search = my_data.filter(x => x.contains("is","work"))
where I am trying to filter the lines contains "is" and "work" in my RDD "My_Data"

If you know all the strings you want to filter beforehand (as in the example you gave), you can do the following:
my_data.filter(x => Seq("is", "work").forall(x.contains))
Full words
If you want to filter the full words, you will need to tokenize each line first. The easiest way to do it is by using a string.split(" "). Be careful, as this doesn't work for languages like Japanese or Chinese.
my_data.filter { line =>
val tokens = line.split(" ").toSet
Seq("is", "work").forall(tokens.contains)
}

Related

Remove duplicates on the combination of Name and Age using Scala and print results, do not use high level API/Framework like pandas /spark-sql etc

• Please do not use high level API/Framework like pandas /spark-sql etc.
• Solve this problem by using simple data structures given in a Scala language.
Dataset -
Name,Age,Location
Rajesh,21,London
Suresh,28,California
Sam,26,Delhi
Rajesh,21,Gurgaon
Manish,29,Bengaluru
This is what I have tried:
val list =
Source
.fromFile("/home/nikhil/Desktop/Datasets/abc.csv").drop(1)
.getLines()
.filter(line => !line.isEmpty)
.map { line =>
val value = line.split(",")
(value(0), value(1))
}.toList
val test = list.toSet //doesn't preserve the order
//#To preserve the order
// val test = list.distinct
OR use drop(1) while reading the file, that will skip the first line of the file contains (Name, Age, Location).
test.foreach(println)
output:-
(Rajesh,21)
(Suresh,28)
(Sam,26)
(Manish,29)
If you're on Scala 2.13.x then the new distinctBy() is an option.
. //readLines(), filter(), drop(1), whatever
.
.
.distinctBy(s => s.take(s.lastIndexOf(",")))
.foreach(println)
//Rajesh,21,London
//Suresh,28,California
//Sam,26,Delhi
//Manish,29,Bengaluru
Alas, no distinctBy(). Oh well, there's always foldLeft().
.foldLeft(Set.empty[String]){ case (seen, str) =>
val substr = str.take(str.lastIndexOf(","))
if (!seen(substr)) println(str) //send to StdOut
seen + substr
}

how to find a line in a text with specific term in spark scala

I couldnt find a response for this in spark scala,
please look at the detail,
I have an output text that contains list of topic with their weight like this:(this has been achieved using lda on a document)
TOPIC_0;connection;0.030922248292319265
TOPIC_0;pragmatic;0.02690878152282403
TOPIC_0;Originator;0.02443295327258558
TOPIC_0;check;0.022290036662386385
TOPIC_0;input;0.020578378303486064
TOPIC_0;character;0.019718375317755072
TOPIC_0;wide;0.017389396600966833
TOPIC_0;load;0.016898979702795396
TOPIC_0;Pretty;0.014923624938546124
TOPIC_0;soon;0.014731449663492822
I want to go through each topic and find the first sentence related to this topic in a file.
I tried something like this but I can not make my mind about this to filter:
topic.foreach { case (term, weight) =>
val filePath = "data/20_news/sci.BusinessandFinance/14147"
val lines = sc.textFile(filePath)
val words = lines.flatMap(x => x.split(' '))
val sentence = words.filter(w => words.contains(term))
}
the last line for filtering is incorrect,
for example:
my text file is like this:
input for the program should be checked. the connection between two part is pretty simple.
so it should extract the first sentence for topic :"input"
any help or idea is appreciated
I think you are filtering on your list of words and you should be filtering on lines.
This code: words.contains(term) doesn't really make sense since it returns true if the term appears in any of the words.
It would make more sense to write something like this:
w.contains(term)
So that at least your filter would only return the words that match the term.
However what you really want is to see if the line (ie sentence) contains the term.
topic.foreach { case (term, weight) =>
val filePath = "data/20_news/sci.BusinessandFinance/14147"
val lines = sc.textFile(filePath)
val sentence = lines.filter(line => line.contains(term))
}
It may be though that the lines need extra splitting (e.g. on full stops) to get the sentences.
You could add this step in like so:
topic.foreach { case (term, weight) =>
val filePath = "data/20_news/sci.BusinessandFinance/14147"
val lines = sc.textFile(filePath)
val morelines = lines.flatMap(l => l.split(". "))
val sentence = morelines.filter(line => line.contains(term))
}
val rddOnline = sc.textFile("/path/to/file")
val hasLine = rddOnline.map(line => line.contains("whatever it is"))
it will return true or false

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

Formatting the join rdd - Apache Spark

I have two key value pair RDD, I join the two rdd's and I saveastext file, here is the code:
val enKeyValuePair1 = rows_filter6.map(line => (line(8) -> (line(0),line(4),line(10),line(5),line(6),line(14),line(1),line(9),line(12),line(13),line(3),line(15),line(7),line(16),line(2),line(14))))
val enKeyValuePair = DATA.map(line => (line(0) -> (line(2),line(3))))
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val output = final_res.saveAsTextFile("C:/out")
my output is as follows:
(534309,((17999,5161,45005,00000,XYZ,,29.95,0.00),None))
How can i get rid of all the parenthesis?
I want my output as follows:
534309,17999,5161,45005,00000,XYZ,,29.95,0.00,None
When outputing to a text file Spark will just use the toString representation of the element in the RDD. If you want control over the format, then, tou can do one last transform of the data to a String before the call to saveAsTextFile.
Luckily the tuples that arise form using the Spark API can be pulled apart using destructuring. In your example I'd do:
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val formatted = final_res.map { tuple =>
val (f1,((f2,f3,f4,f5,f6,f7,f8,f9),f10)) = tuple
Seq(f1, f2, f3, f4, f5, f6, f7, f8, f9, f10).mkString(",")
}
formatted.saveAsTextFile("C:/out")
The first val line will take the tuple that is passed into the map function and assign the components to the values on the left. The second line creates a temporary Seq with the fields in the order you want displayed and then invokes mkString(",") to join the fields using a comma.
In cases with fewer fields or you're just hacking away at a problem on the REPL, a slight alternate to the above can also be used by using pattern matching on the partial function passed to map.
simpleJoinedRdd.map { case (key,(left,right)) => s"$key,$left,$right"}}
While that does allow you do make it a single line expression it can throw Exceptions if the data in the RDD don't match the pattern provided, as opposed to the earlier example where the compiler will complain if the tuple parameter cannot be destructured into the expected form.
You can do something like this:
import scala.collection.JavaConversions._
val output = sc.parallelize(List((534309,((17999,5161,45005,1,"XYZ","",29.95,0.00),None))))
val result = output.map(p => p._1 +=: p._2._1.productIterator.toBuffer += p._2._2)
.map(p => com.google.common.base.Joiner.on(", ").join(p.iterator))
I used guava to format string but there is porbably scala way of doing this.
do a flatmap before saving. Or, you can write a simple format function and use it in map.
Adding a bit code, just to show how it can be done. function formatOnDemand can be anything
test = sc.parallelize([(534309,((17999,5161,45005,00000,"XYZ","",29.95,0.00),None))])
print test.collect()
print test.map(formatOnDemand).collect()
def formatOnDemand(t):
out=[]
out.append(t[0])
for tok in t[1][0]:
out.append(tok)
out.append(t[1][1])
return out
>>>
[(534309, ((17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0), None))]
[[534309, 17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0, None]]

How do I iterate RDD's in apache spark (scala)

I use the following command to fill an RDD with a bunch of arrays containing 2 strings ["filename", "content"].
Now I want to iterate over every of those occurrences to do something with every filename and content.
val someRDD = sc.wholeTextFiles("hdfs://localhost:8020/user/cloudera/*")
I can't seem to find any documentation on how to do this however.
So what I want is this:
foreach occurrence-in-the-rdd{
//do stuff with the array found on loccation n of the RDD
}
You call various methods on the RDD that accept functions as parameters.
// set up an example -- an RDD of arrays
val sparkConf = new SparkConf().setMaster("local").setAppName("Example")
val sc = new SparkContext(sparkConf)
val testData = Array(Array(1,2,3), Array(4,5,6,7,8))
val testRDD = sc.parallelize(testData, 2)
// Print the RDD of arrays.
testRDD.collect().foreach(a => println(a.size))
// Use map() to create an RDD with the array sizes.
val countRDD = testRDD.map(a => a.size)
// Print the elements of this new RDD.
countRDD.collect().foreach(a => println(a))
// Use filter() to create an RDD with just the longer arrays.
val bigRDD = testRDD.filter(a => a.size > 3)
// Print each remaining array.
bigRDD.collect().foreach(a => {
a.foreach(e => print(e + " "))
println()
})
}
Notice that the functions you write accept a single RDD element as input, and return data of some uniform type, so you create an RDD of the latter type. For example, countRDD is an RDD[Int], while bigRDD is still an RDD[Array[Int]].
It will probably be tempting at some point to write a foreach that modifies some other data, but you should resist for reasons described in this question and answer.
Edit: Don't try to print large RDDs
Several readers have asked about using collect() and println() to see their results, as in the example above. Of course, this only works if you're running in an interactive mode like the Spark REPL (read-eval-print-loop.) It's best to call collect() on the RDD to get a sequential array for orderly printing. But collect() may bring back too much data and in any case too much may be printed. Here are some alternative ways to get insight into your RDDs if they're large:
RDD.take(): This gives you fine control on the number of elements you get but not where they came from -- defined as the "first" ones which is a concept dealt with by various other questions and answers here.
// take() returns an Array so no need to collect()
myHugeRDD.take(20).foreach(a => println(a))
RDD.sample(): This lets you (roughly) control the fraction of results you get, whether sampling uses replacement, and even optionally the random number seed.
// sample() does return an RDD so you may still want to collect()
myHugeRDD.sample(true, 0.01).collect().foreach(a => println(a))
RDD.takeSample(): This is a hybrid: using random sampling that you can control, but both letting you specify the exact number of results and returning an Array.
// takeSample() returns an Array so no need to collect()
myHugeRDD.takeSample(true, 20).foreach(a => println(a))
RDD.count(): Sometimes the best insight comes from how many elements you ended up with -- I often do this first.
println(myHugeRDD.count())
The fundamental operations in Spark are map and filter.
val txtRDD = someRDD filter { case(id, content) => id.endsWith(".txt") }
the txtRDD will now only contain files that have the extension ".txt"
And if you want to word count those files you can say
//split the documents into words in one long list
val words = txtRDD flatMap { case (id,text) => text.split("\\s+") }
// give each word a count of 1
val wordT = words map (x => (x,1))
//sum up the counts for each word
val wordCount = wordsT reduceByKey((a, b) => a + b)
You want to use mapPartitions when you have some expensive initialization you need to perform -- for example, if you want to do Named Entity Recognition with a library like the Stanford coreNLP tools.
Master map, filter, flatMap, and reduce, and you are well on your way to mastering Spark.
I would try making use of a partition mapping function. The code below shows how an entire RDD dataset can be processed in a loop so that each input goes through the very same function. I am afraid I have no knowledge about Scala, so everything I have to offer is java code. However, it should not be very difficult to translate it into scala.
JavaRDD<String> res = file.mapPartitions(new FlatMapFunction <Iterator<String> ,String>(){
#Override
public Iterable<String> call(Iterator <String> t) throws Exception {
ArrayList<String[]> tmpRes = new ArrayList <>();
String[] fillData = new String[2];
fillData[0] = "filename";
fillData[1] = "content";
while(t.hasNext()){
tmpRes.add(fillData);
}
return Arrays.asList(tmpRes);
}
}).cache();
what the wholeTextFiles return is a Pair RDD:
def wholeTextFiles(path: String, minPartitions: Int): RDD[(String, String)]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Here is an example of reading the files at a local path then printing every filename and content.
val conf = new SparkConf().setAppName("scala-test").setMaster("local")
val sc = new SparkContext(conf)
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.collect
.foreach(t => println(t._1 + ":" + t._2));
the result:
file:/Users/leon/Documents/test/1.txt:{"name":"tom","age":12}
file:/Users/leon/Documents/test/2.txt:{"name":"john","age":22}
file:/Users/leon/Documents/test/3.txt:{"name":"leon","age":18}
or converting the Pair RDD to a RDD first
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.map(t => t._2)
.collect
.foreach { x => println(x)}
the result:
{"name":"tom","age":12}
{"name":"john","age":22}
{"name":"leon","age":18}
And I think wholeTextFiles is more compliant for small files.
for (element <- YourRDD)
{
// do what you want with element in each iteration, and if you want the index of element, simply use a counter variable in this loop beginning from 0
println (element._1) // this will print all filenames
}