I want to filter the Lines read form text file with set of the key words - scala

I have written below code, it is working for the one word but when I give the seq variable term I am not getting the output, can anyone tell me how to solve this.
val term = List("Achieving","Making")
val sc = new SparkContext("local[*]","Filter_lines")
val Lines = sc.textFile("../book.txt")
val filter_Lines = Lines.filter(l => l.contains("Making")).collect()
filter_Lines.foreach(println)

Try this -
Lines.filter(l => term.exists(t => l.contains(t))).foreach(println)
exists function on the collection accepts a function that returns true if the "l" contains any of the "t" terms.

Related

Matching Column name from Csv file in spark scala

I want to take headers (column name) from my csv file and the want to match with it my existing header.
I am using below code:
val cc = sparksession.read.csv(filepath).take(1)
Its giving me value like:
Array([id,name,salary])
and I have created one more static schema, which is giving me value like this:
val ss=Array("id","name","salary")
and then I'm trying to compare column name using if condition:
if(cc==ss){
println("matched")
} else{
println("not matched")
}
I guess due to [] and () mismatch its always going to else part is there any other way to compare these value without considering [] and ()?
First, for convenience, set the header option to true when reading the file:
val df = sparksession.read.option("header", true).csv(filepath)
Get the column names and define the expected column names:
val cc = df.columns
val ss = Array("id", "name", "salary")
To check if the two match (not considering the ordering):
if (cc.toSet == ss.toSet) {
println("matched")
} else {
println("not matched")
}
If the order is relevant, then the condition can be done as follows (you can't use Array here but Seq works):
cc.toSeq == ss.toSeq
or you a deep array comparison:
cc.deep == d.deep
First of all, I think you are trying to compare a Array[org.apache.spark.sql.Row] with an Array[String]. I believe you should change how you load the headers to something like: val cc = spark.read.format("csv").option("header", "true").load(fileName).columns.toArray.
Then you could compare using cc.deep == ss.deep.
Below code worked for me.
val cc= spark.read.csv("filepath").take(1)(0).toString
The above code gave output as String:[id,name,salary].
created one one stating schema as
val ss="[id,name,salary]"
then wrote the if else Conditions.

how to search multipule string in Scala spark

I am using
val my_data = sc.textFile("abc.txt")
val my_search = my_data.filter(x => x.contains("is","work"))
where I am trying to filter the lines contains "is" and "work" in my RDD "My_Data"
If you know all the strings you want to filter beforehand (as in the example you gave), you can do the following:
my_data.filter(x => Seq("is", "work").forall(x.contains))
Full words
If you want to filter the full words, you will need to tokenize each line first. The easiest way to do it is by using a string.split(" "). Be careful, as this doesn't work for languages like Japanese or Chinese.
my_data.filter { line =>
val tokens = line.split(" ").toSet
Seq("is", "work").forall(tokens.contains)
}

how to find a line in a text with specific term in spark scala

I couldnt find a response for this in spark scala,
please look at the detail,
I have an output text that contains list of topic with their weight like this:(this has been achieved using lda on a document)
TOPIC_0;connection;0.030922248292319265
TOPIC_0;pragmatic;0.02690878152282403
TOPIC_0;Originator;0.02443295327258558
TOPIC_0;check;0.022290036662386385
TOPIC_0;input;0.020578378303486064
TOPIC_0;character;0.019718375317755072
TOPIC_0;wide;0.017389396600966833
TOPIC_0;load;0.016898979702795396
TOPIC_0;Pretty;0.014923624938546124
TOPIC_0;soon;0.014731449663492822
I want to go through each topic and find the first sentence related to this topic in a file.
I tried something like this but I can not make my mind about this to filter:
topic.foreach { case (term, weight) =>
val filePath = "data/20_news/sci.BusinessandFinance/14147"
val lines = sc.textFile(filePath)
val words = lines.flatMap(x => x.split(' '))
val sentence = words.filter(w => words.contains(term))
}
the last line for filtering is incorrect,
for example:
my text file is like this:
input for the program should be checked. the connection between two part is pretty simple.
so it should extract the first sentence for topic :"input"
any help or idea is appreciated
I think you are filtering on your list of words and you should be filtering on lines.
This code: words.contains(term) doesn't really make sense since it returns true if the term appears in any of the words.
It would make more sense to write something like this:
w.contains(term)
So that at least your filter would only return the words that match the term.
However what you really want is to see if the line (ie sentence) contains the term.
topic.foreach { case (term, weight) =>
val filePath = "data/20_news/sci.BusinessandFinance/14147"
val lines = sc.textFile(filePath)
val sentence = lines.filter(line => line.contains(term))
}
It may be though that the lines need extra splitting (e.g. on full stops) to get the sentences.
You could add this step in like so:
topic.foreach { case (term, weight) =>
val filePath = "data/20_news/sci.BusinessandFinance/14147"
val lines = sc.textFile(filePath)
val morelines = lines.flatMap(l => l.split(". "))
val sentence = morelines.filter(line => line.contains(term))
}
val rddOnline = sc.textFile("/path/to/file")
val hasLine = rddOnline.map(line => line.contains("whatever it is"))
it will return true or false

Trying to call Maps into parameters but unable to work

Hey I'm new to the language and trying to make a map that contains values used alot within the code, but currently having trouble calling the map into a parameter.
heres and example of what it looks like
val abcd = "abc"
val defg = "123"
val ghij = "456"
val map1 = Map ( "abcd" -> abcd, "defg" -> defg, "ghij" -> ghij)
that part is fine but when i want to add them all to the parameter it says there is too many arguments?
had to add
.toString
.##()
to the end of the strings and ints, for example
val settings = Array( map1("defg").##() );

Formatting the join rdd - Apache Spark

I have two key value pair RDD, I join the two rdd's and I saveastext file, here is the code:
val enKeyValuePair1 = rows_filter6.map(line => (line(8) -> (line(0),line(4),line(10),line(5),line(6),line(14),line(1),line(9),line(12),line(13),line(3),line(15),line(7),line(16),line(2),line(14))))
val enKeyValuePair = DATA.map(line => (line(0) -> (line(2),line(3))))
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val output = final_res.saveAsTextFile("C:/out")
my output is as follows:
(534309,((17999,5161,45005,00000,XYZ,,29.95,0.00),None))
How can i get rid of all the parenthesis?
I want my output as follows:
534309,17999,5161,45005,00000,XYZ,,29.95,0.00,None
When outputing to a text file Spark will just use the toString representation of the element in the RDD. If you want control over the format, then, tou can do one last transform of the data to a String before the call to saveAsTextFile.
Luckily the tuples that arise form using the Spark API can be pulled apart using destructuring. In your example I'd do:
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val formatted = final_res.map { tuple =>
val (f1,((f2,f3,f4,f5,f6,f7,f8,f9),f10)) = tuple
Seq(f1, f2, f3, f4, f5, f6, f7, f8, f9, f10).mkString(",")
}
formatted.saveAsTextFile("C:/out")
The first val line will take the tuple that is passed into the map function and assign the components to the values on the left. The second line creates a temporary Seq with the fields in the order you want displayed and then invokes mkString(",") to join the fields using a comma.
In cases with fewer fields or you're just hacking away at a problem on the REPL, a slight alternate to the above can also be used by using pattern matching on the partial function passed to map.
simpleJoinedRdd.map { case (key,(left,right)) => s"$key,$left,$right"}}
While that does allow you do make it a single line expression it can throw Exceptions if the data in the RDD don't match the pattern provided, as opposed to the earlier example where the compiler will complain if the tuple parameter cannot be destructured into the expected form.
You can do something like this:
import scala.collection.JavaConversions._
val output = sc.parallelize(List((534309,((17999,5161,45005,1,"XYZ","",29.95,0.00),None))))
val result = output.map(p => p._1 +=: p._2._1.productIterator.toBuffer += p._2._2)
.map(p => com.google.common.base.Joiner.on(", ").join(p.iterator))
I used guava to format string but there is porbably scala way of doing this.
do a flatmap before saving. Or, you can write a simple format function and use it in map.
Adding a bit code, just to show how it can be done. function formatOnDemand can be anything
test = sc.parallelize([(534309,((17999,5161,45005,00000,"XYZ","",29.95,0.00),None))])
print test.collect()
print test.map(formatOnDemand).collect()
def formatOnDemand(t):
out=[]
out.append(t[0])
for tok in t[1][0]:
out.append(tok)
out.append(t[1][1])
return out
>>>
[(534309, ((17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0), None))]
[[534309, 17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0, None]]