Spark/Scala group similar words and count - scala

I am trying to group and count words in an rdd such that if a word ends with s/ly it is counted as the same word.
hi
yes
love
know
hi
knows
loves
lovely
Expected output:
hi 2
yes 1
love 3
know 2
This is what I currently have:
data.map(word=>(word,1)).reduceByKey((a,b)=>(a+b+).collect
Any help is appreciated regarding adding s/ly condition.

It seems that you want to count stem of words in your input list. The process of finding the stem of a word in Computational Linguistics is called stemming. If your goal is to handle s and ly at the end of the words in your input list, you can remove in a map step and then count the remaining parts. As a matter of fact, there would be some side effects by removing s and ly blindly. For instance, if there is a word which ends with s like "is" you would count "i" at the end. It's a better solution to use some available stemmers like Porter or the one which is available in Stanford Corenlp.
listRdd.mapToPair(t -> new Tuple2(t.replayAll("(ly|s)$", ""), 1))
.reduceByKey((a,b) -> a+b).collect()
the second solution which can help to overcome other suffixes too is using stemmers:
listRdd.mapToPair(t -> {
Stemmer stemmer = new Stemmer();
return new Tuple2(stemmer.stem(t), 1));
}).reduceByKey((a,b) -> a+b).collect();
about stemmer can be replaced with any implementation of stemmers.
For more information about stemmers and lemmatizers, you can use https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

If you want to group together words that finish with 's' or 'ly' here is how I would do it:
data
.map(word => (if (word.endsWith('s') || word.endsWith('ly')) 's/ly-words' else word, 1))
.reduceByKey(_+_)
.collect
If you want to separate 'ly' words from 's' words from the rest:
data
.map(word => (if (word.endsWith('s')) 's-words' else if (word.endsWith('ly')) 'ly-words' else word, 1))
.reduceByKey(_+_)
.collect
If you want to count words that end with 'ly' or 's' as if they did not end with it (eg. 'love', 'lovely', 'loves' are counted as being 'love'):
data
.map(word => (if (word.endsWith('s')) word.slice(0, word.length-1) else if (word.endsWith('ly')) word.slice(0, word.length-2) else word, 1))
.reduceByKey(_+_)
.collect

Related

Scala : How to split words using multiple delimeters

Suppose I have the text file like this:
Apple#mango&banana#grapes
The data needs to be split on multiple delimiters before performing the word count.
How to do that?
Use split method:
scala> "Apple#mango&banana#grapes".split("[#&#]")
res0: Array[String] = Array(Apple, mango, banana, grapes)
If you just want to count words, you don't need to split. Something like this will do:
val numWords = """\b\w""".r.findAllIn(string).length
This is a regex that matches start of a word (\b is a (zero-length) word boundary, \w is any "word" character (letter, number or underscore), so you get all the matches in your string, and then just check how many there are.
If you are looking to count each word separately, and do it across multiple lines, then, split is, probably, a better option:
source
.getLines
.flatMap(_.split("\\W+"))
.filterNot(_.isEmpty)
.groupBy(identity)
.mapValues(_.size)

How does reduceByKey work [duplicate]

This question already has answers here:
reduceByKey: How does it work internally?
(5 answers)
Closed 5 years ago.
I am doing some work with Scala and spark - beginner programmer and poster- the goal is to map each request (line) to a pair (userid, 1) then sum the hits.
Can anyone explain in more detail what is happening on the 1st and 3rd line and what the => in: line => line.split means?
Please excuse any errors in my post formatting as I am new to this website.
val userreqs = logs.map(line => line.split(' ')).
map(words => (words(2),1)).
reduceByKey((v1,v2) => v1 + v2)
considering the below hypothetical log
trans_id amount user_id
1 100 A001
2 200 A002
3 300 A001
4 200 A003
this how the data is processed in spark for each operation performed on the logs.
logs // RDD("1 100 A001","2 200 A002", "3 300 A001", "3 200 A003")
.map(line => line.split(' ')) // RDD(Array(1,100,A001),Array(2,200,A002),Array(3,300,A001), Array(4,200,A003))
.map(words => (words(2),1)) // RDD((A001,1), (A002,1), (A001,1), (A003,1))
.reduceByKey((v1,v2) => v1+v2) // RDD(A001,2),A(A002,1),A(`003,1))
line.split(' ') splits a string into Array of String. "Hello World" => Array("Hello", "World")
reduceByKey(_+_) run a reduce operation grouping data by key. in this case its adds all the values of key. In the above example there were two occurences for the user-key A001 and the value associated with each of those key was 1. This is now reduced to value 2 using the additive function (_ + _) provided in the reduceByKey method.
The easiest way to learn Spark and reduceByKey is to read the official documentation of PairRDDFunctions that says:
reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)] Merge the values for each key using an associative and commutative reduce function.
So it basically takes all the values per key and sums them together to a value that is a sum of all the values per key.
Now, you may be asking yourself:
What is the key?
The key to understand the key (pun intended) is to see how keys are generated and that's the role of the line
map(words => (words(2),1)).
This is where you take words and destructure it into a pair of key and 1.
This is a classic map-reduce algorithm where you give 1 to all keys to reduce them in the following step.
In the end, after this map you'll have a series of key-value pairs as follows:
(hello, 1)
(world, 1)
(nice, 1)
(to, 1)
(see, 1)
(you, 1)
(again, 1)
(again, 1)
I repeated the last (again, 1) pair on purpose to show you that pairs can occur multiple times.
The series is created using RDD.map operator that takes a function that splits a single line and tokenize it into words.
logs.map(line => line.split(' ')).
It reads:
For every line in logs, split the line into tokens using (space) as separator.
I'd change this line to use a regex like \\s+ so any white character would get considered part of a separator.
line.split(' ') splits each line with the space which returns an array of string
For example:
"hello spark scala".split(' ') gives [hello, spark, scala]
`reduceByKey((v1,v2) => v1 + v2)` is equivalent to `reduceByKey(_ + _)`
Here is how reduceByKey works https://i.stack.imgur.com/igmG3.gif and http://backtobazics.com/big-data/spark/apache-spark-reducebykey-example/
For the same key it keeps adding all the values.
Hope this helped!

find line number in an unstructured file in scala

Hi guys I am parsing an unstructured file for some key words but i can't seem to easily find the line number of what the results I am getiing
val filePath:String = "myfile"
val myfile = sc.textFile(filePath);
var ora_temp = myfile.filter(line => line.contains("MyPattern")).collect
ora_temp.length
However, I not only want to find the lines that contains MyPatterns but I want more like a tupple (Mypattern line, line number)
Thanks in advance,
You can use ZipWithIndex as eliasah pointed out in a comment (with probably the most succinct way to do this using the direct tuple accessor syntax), or like so using pattern matching in the filter:
val matchingLineAndLineNumberTuples = sc.textFile("myfile").zipWithIndex().filter({
case (line, lineNumber) => line.contains("MyPattern")
}).collect

Count filtered records in scala

As I am new to scala ,This problem might look very basic to all..
I have a file called data.txt which contains like below:
xxx.lss.yyy23.com-->mailuogwprd23.lss.com,Hub,12689,14.98904563,1549
xxx.lss.yyy33.com-->mailusrhubprd33.lss.com,Outbound,72996,1.673717588,1949
xxx.lss.yyy33.com-->mailuogwprd33.lss.com,Hub,12133,14.9381027,664
xxx.lss.yyy53.com-->mailusrhubprd53.lss.com,Outbound,72996,1.673717588,3071
I want to split the line and find the records depending upon the numbers in xxx.lss.yyy23.com
val data = io.Source.fromFile("data.txt").getLines().map { x => (x.split("-->"))}.map { r => r(0) }.mkString("\n")
which gives me
xxx.lss.yyy23.com
xxx.lss.yyy33.com
xxx.lss.yyy33.com
xxx.lss.yyy53.com
This is what I am trying to count the exact value...
data.count { x => x.contains("33")}
How do I get the count of records who does not contain 33...
The following will give you the number of lines that contain "33":
data.split("\n").count(a => a.contains("33"))
The reason what you have above isn't working is that you need to split data into an array of strings again. Your previous statement actually concatenates the result into a single string using newline as a separator using mkstring, so you can't really run collection operations like count on it.
The following will work for getting the lines that do not contain "33":
data.split("\n").count(a => !a.contains("33"))
You simply need to negate the contains operation in this case.

Algorithm to get a list of all words that are anagrams of all substrings (scrabble)?

Eg if input string is helloworld I want the output to be like:
do
he
we
low
hell
hold
roll
well
word
hello
lower
world
...
all the way up to the longest word that is an anagram of a substring of helloworld. Like in Scrabble for example.
The input string can be any length, but rarely more than 16 chars.
I've done a search and come up with structures like a trie, but I am still unsure of how to actually do this.
The structure used to hold your dictionary of valid entries will have a huge impact on efficiency. Organize it as a tree, root being the singular zero letter "word", the empty string. Each child of root is a single first letter of a possible word, children of those being the second letter of a possible word, etc., with each node marked as to whether it actually forms a word or not.
Your tester function will be recursive. It starts with zero letters, finds from the tree of valid entries that "" isn't a word but it does have children, so you call your tester recursively with your start word (of no letters) appended with each available remaining letter from your input string (which is all of them at that point). Check each one-letter entry in tree, if valid make note; if children, re-call tester function appending each of remaining available letters, and so on.
So for example, if your input string is "helloworld", you're going to first call your recursive tester function with "", passing the remaining available letters "helloworld" as a 2nd parameter. Function sees that "" isn't a word, but child "h" does exist. So it calls itself with "h", and "elloworld". Function sees that "h" isn't a word, but child "e" exists. So it calls itself with "he" and "lloworld". Function sees that "e" is marked, so "he" is a word, take note. Further, child "l" exists, so next call is "hel" with "loworld". It will next find "hell", then "hello", then will have to back out and probably next find "hollow", before backing all the way out to the empty string again and then starting with "e" words next.
I couldn't resist my own implementation. It creates a dictionary by sorting all the letters alphabetically, and mapping them to the words that can be created from them. This is an O(n) start-up operation that eliminates the need to find all permutations. You could implement the dictionary as a trie in another language to attain faster speedups.
The "getAnagrams" command is also an O(n) operation which searches each word in the dictionary to see if it is a subset of the search. Doing getAnagrams("radiotelegraphically")" (a 20 letter word) took approximately 1 second on my laptop, and returned 1496 anagrams.
# Using the 38617 word dictionary at
# http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt
# Usage: getAnagrams("helloworld")
def containsLetters(subword, word):
wordlen = len(word)
subwordlen = len(subword)
if subwordlen > wordlen:
return False
word = list(word)
for c in subword:
try:
index = word.index(c)
except ValueError:
return False
word.pop(index)
return True
def getAnagrams(word):
output = []
for key in mydict.iterkeys():
if containsLetters(key, word):
output.extend(mydict[key])
output.sort(key=len)
return output
f = open("dict.txt")
wordlist = f.readlines()
f.close()
mydict = {}
for word in wordlist:
word = word.rstrip()
temp = list(word)
temp.sort()
letters = ''.join(temp)
if letters in mydict:
mydict[letters].append(word)
else:
mydict[letters] = [word]
An example run:
>>> getAnagrams("helloworld")
>>> ['do', 'he', 'we', 're', 'oh', 'or', 'row', 'hew', 'her', 'hoe', 'woo', 'red', 'dew', 'led', 'doe', 'ode', 'low', 'owl', 'rod', 'old', 'how', 'who', 'rho', 'ore', 'roe', 'owe', 'woe', 'hero', 'wood', 'door', 'odor', 'hold', 'well', 'owed', 'dell', 'dole', 'lewd', 'weld', 'doer', 'redo', 'rode', 'howl', 'hole', 'hell', 'drew', 'word', 'roll', 'wore', 'wool','herd', 'held', 'lore', 'role', 'lord', 'doll', 'hood', 'whore', 'rowed', 'wooed', 'whorl', 'world', 'older', 'dowel', 'horde', 'droll', 'drool', 'dwell', 'holed', 'lower', 'hello', 'wooer', 'rodeo', 'whole', 'hollow', 'howler', 'rolled', 'howled', 'holder', 'hollowed']
The data structure you want is called a Directed Acyclic Word Graph (dawg), and it is described by Andrew Appel and Guy Jacobsen in their paper "The World's Fastest Scrabble Program" which unfortunately they have chosen not to make available free online. An ACM membership or a university library will get it for you.
I have implemented this data structure in at least two languages---it is simple, easy to implement, and very, very fast.
A simple-minded approach is to generate all the "substrings" and, for each of them, check whether it's an element of the set of acceptable words. E.g., in Python 2.6:
import itertools
import urllib
def words():
f = urllib.urlopen(
'http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt')
allwords = set(w[:-1] for w in f)
f.close()
return allwords
def substrings(s):
for i in range(2, len(s)+1):
for p in itertools.permutations(s, i):
yield ''.join(p)
def main():
w = words()
print '%d words' % len(w)
ss = set(substrings('weep'))
print '%d substrings' % len(ss)
good = ss & w
print '%d good ones' % len(good)
sgood = sorted(good, key=lambda w:(len(w), w))
for aword in sgood:
print aword
main()
will emit:
38617 words
31 substrings
5 good ones
we
ewe
pew
wee
weep
Of course, as other responses pointed out, organizing your data purposefully can greatly speed-up your runtime -- although the best data organization for a fast anagram finder could well be different... but that will largely depend on the nature of your dictionary of allowed words (a few tens of thousands, like here -- or millions?). Hash-maps and "signatures" (based on sorting the letters in each word) should be considered, as well as tries &c.
What you want is an implementation of a power set.
Also look at Eric Lipparts blog, he blogged about this very thing a little while back
EDIT:
Here is an implementation I wrote of getting the powerset from a given string...
private IEnumerable<string> GetPowerSet(string letters)
{
char[] letterArray = letters.ToCharArray();
for (int i = 0; i < Math.Pow(2.0, letterArray.Length); i++)
{
StringBuilder sb = new StringBuilder();
for (int j = 0; j < letterArray.Length; j++)
{
int pos = Convert.ToInt32(Math.Pow(2.0, j));
if ((pos & i) == pos)
{
sb.Append(letterArray[j]);
}
}
yield return new string(sb.ToString().ToCharArray().OrderBy(c => c).ToArray());
}
}
This function gives me the powersets of chars that make up the passed in string, I then can use these as keys into a dictionary of anagrams...
Dictionary<string,IEnumerable<string>>
I created my dictionary of anagrams like so... (there are probably more efficient ways, but this was simple and plenty quick enough with the scrabble tournament word list)
wordlist = (from s in fileText.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
let k = new string(s.ToCharArray().OrderBy(c => c).ToArray())
group s by k).ToDictionary(o => o.Key, sl => sl.Select(a => a));
Like Tim J, Eric Lippert's blog posts where the first thing to come to my mind. I wanted to add that he wrote a follow-up about ways to improve the performance of his first attempt.
A nasality talisman for the sultana analyst
Santalic tailfans, part two
I believe the Ruby code in the answers to this question will also solve your problem.
I've been playing a lot of Wordfeud on my phone recently and was curious if I could come up with some code to give me a list of possible words. The following code takes your availble source letters (* for a wildcards) and an array with a master list of allowable words (TWL, SOWPODS, etc) and generates a list of matches. It does this by trying to build each word in the master list from your source letters.
I found this topic after writing my code, and it's definitely not as efficient as John Pirie's method or the DAWG algorithm, but it's still pretty quick.
public IList<string> Matches(string sourceLetters, string [] wordList)
{
sourceLetters = sourceLetters.ToUpper();
IList<string> matches = new List<string>();
foreach (string word in wordList)
{
if (WordCanBeBuiltFromSourceLetters(word, sourceLetters))
matches.Add(word);
}
return matches;
}
public bool WordCanBeBuiltFromSourceLetters(string targetWord, string sourceLetters)
{
string builtWord = "";
foreach (char letter in targetWord)
{
int pos = sourceLetters.IndexOf(letter);
if (pos >= 0)
{
builtWord += letter;
sourceLetters = sourceLetters.Remove(pos, 1);
continue;
}
// check for wildcard
pos = sourceLetters.IndexOf("*");
if (pos >= 0)
{
builtWord += letter;
sourceLetters = sourceLetters.Remove(pos, 1);
}
}
return string.Equals(builtWord, targetWord);
}