How does reduceByKey work [duplicate] - scala

This question already has answers here:
reduceByKey: How does it work internally?
(5 answers)
Closed 5 years ago.
I am doing some work with Scala and spark - beginner programmer and poster- the goal is to map each request (line) to a pair (userid, 1) then sum the hits.
Can anyone explain in more detail what is happening on the 1st and 3rd line and what the => in: line => line.split means?
Please excuse any errors in my post formatting as I am new to this website.
val userreqs = logs.map(line => line.split(' ')).
map(words => (words(2),1)).
reduceByKey((v1,v2) => v1 + v2)

considering the below hypothetical log
trans_id amount user_id
1 100 A001
2 200 A002
3 300 A001
4 200 A003
this how the data is processed in spark for each operation performed on the logs.
logs // RDD("1 100 A001","2 200 A002", "3 300 A001", "3 200 A003")
.map(line => line.split(' ')) // RDD(Array(1,100,A001),Array(2,200,A002),Array(3,300,A001), Array(4,200,A003))
.map(words => (words(2),1)) // RDD((A001,1), (A002,1), (A001,1), (A003,1))
.reduceByKey((v1,v2) => v1+v2) // RDD(A001,2),A(A002,1),A(`003,1))
line.split(' ') splits a string into Array of String. "Hello World" => Array("Hello", "World")
reduceByKey(_+_) run a reduce operation grouping data by key. in this case its adds all the values of key. In the above example there were two occurences for the user-key A001 and the value associated with each of those key was 1. This is now reduced to value 2 using the additive function (_ + _) provided in the reduceByKey method.

The easiest way to learn Spark and reduceByKey is to read the official documentation of PairRDDFunctions that says:
reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)] Merge the values for each key using an associative and commutative reduce function.
So it basically takes all the values per key and sums them together to a value that is a sum of all the values per key.
Now, you may be asking yourself:
What is the key?
The key to understand the key (pun intended) is to see how keys are generated and that's the role of the line
map(words => (words(2),1)).
This is where you take words and destructure it into a pair of key and 1.
This is a classic map-reduce algorithm where you give 1 to all keys to reduce them in the following step.
In the end, after this map you'll have a series of key-value pairs as follows:
(hello, 1)
(world, 1)
(nice, 1)
(to, 1)
(see, 1)
(you, 1)
(again, 1)
(again, 1)
I repeated the last (again, 1) pair on purpose to show you that pairs can occur multiple times.
The series is created using RDD.map operator that takes a function that splits a single line and tokenize it into words.
logs.map(line => line.split(' ')).
It reads:
For every line in logs, split the line into tokens using (space) as separator.
I'd change this line to use a regex like \\s+ so any white character would get considered part of a separator.

line.split(' ') splits each line with the space which returns an array of string
For example:
"hello spark scala".split(' ') gives [hello, spark, scala]
`reduceByKey((v1,v2) => v1 + v2)` is equivalent to `reduceByKey(_ + _)`
Here is how reduceByKey works https://i.stack.imgur.com/igmG3.gif and http://backtobazics.com/big-data/spark/apache-spark-reducebykey-example/
For the same key it keeps adding all the values.
Hope this helped!

Related

How to modify this code in Scala by using Brackets

I have a spark dataframe in Databricks, with an ID and 200 other columns (like a pivot view of data). I would like to unpivot these data to make a tall object with half of the columns, where I'll end up with 100 rows per id. I'm using the Stack function and using specific column names.
Question is this: I'm new to scala and similar languages, and unfamiliar with best practices on how to us Brackets when literals are presented in multiple rows as below. Can I replace the Double quotes and + with something else?
%scala
val unPivotDF = hiveDF.select($"id",
expr("stack(100, " +
"'cat1', cat1, " +
"'cat2', cat2, " +
"'cat3', cat3, " +
//...
"'cat99', cat99, " +
"'cat100', cat100) as (Category,Value)"))
.where("Value is not null")
You can use """ to define multiline strings like:
"""
some string
over multiple lines
"""
In your case this will only work assuming that the string you're writing tolerates new lines.
Considering how repetitive it is, you could also generate the string with something like:
(1 to 100)
.map(i => s"'cat$i', cat$i")
.mkString(",")
(To be adapted by the reader to exact needs)
Edit: and to answer your initial question: brackets won't help in any way here.

Spark (Scala) modify the contents of a Dataset Column

I would like to have a Dataset, where the first column contains single words and the second column contains the filenames of the files where these words appear.
My current code looks something like this:
val path = "path/to/folder/with/files"
val tokens = spark.read.textFile(path).
.flatMap(line => line.split(" "))
.withColumn("filename", input_file_name)
tokens.show()
However this returns something like
|word1 |whole/path/to/some/file |
|word2 |whole/path/to/some/file |
|word1 |whole/path/to/some/otherfile|
(I don't need the whole path, just the last bit). My idea to fix this, was to use the map function
val tokensNoPath = tokens.
map(r => (r(0), r(1).asInstanceOf[String].split("/").lastOption))
So basically, just going to every tow, grabbing the second entry and deleting everything before the last slash.
However since I'm very new to Spark and Scala I can't figure out how to get the syntax for this right
Docs:
substring_index "substring_index(str, delim, count) Returns the substring from str before count occurrences of the delimiter delim... If count is negative, everything to the right of the final delimiter (counting from the right) is returned."
.withColumn("filename", substring_index(input_file_name, "/", -1))
You can split by slash and get the last element:
val tokens2 = tokens.withColumn("filename", element_at(split(col("filename"), "/"), -1))

Spark/Scala group similar words and count

I am trying to group and count words in an rdd such that if a word ends with s/ly it is counted as the same word.
hi
yes
love
know
hi
knows
loves
lovely
Expected output:
hi 2
yes 1
love 3
know 2
This is what I currently have:
data.map(word=>(word,1)).reduceByKey((a,b)=>(a+b+).collect
Any help is appreciated regarding adding s/ly condition.
It seems that you want to count stem of words in your input list. The process of finding the stem of a word in Computational Linguistics is called stemming. If your goal is to handle s and ly at the end of the words in your input list, you can remove in a map step and then count the remaining parts. As a matter of fact, there would be some side effects by removing s and ly blindly. For instance, if there is a word which ends with s like "is" you would count "i" at the end. It's a better solution to use some available stemmers like Porter or the one which is available in Stanford Corenlp.
listRdd.mapToPair(t -> new Tuple2(t.replayAll("(ly|s)$", ""), 1))
.reduceByKey((a,b) -> a+b).collect()
the second solution which can help to overcome other suffixes too is using stemmers:
listRdd.mapToPair(t -> {
Stemmer stemmer = new Stemmer();
return new Tuple2(stemmer.stem(t), 1));
}).reduceByKey((a,b) -> a+b).collect();
about stemmer can be replaced with any implementation of stemmers.
For more information about stemmers and lemmatizers, you can use https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
If you want to group together words that finish with 's' or 'ly' here is how I would do it:
data
.map(word => (if (word.endsWith('s') || word.endsWith('ly')) 's/ly-words' else word, 1))
.reduceByKey(_+_)
.collect
If you want to separate 'ly' words from 's' words from the rest:
data
.map(word => (if (word.endsWith('s')) 's-words' else if (word.endsWith('ly')) 'ly-words' else word, 1))
.reduceByKey(_+_)
.collect
If you want to count words that end with 'ly' or 's' as if they did not end with it (eg. 'love', 'lovely', 'loves' are counted as being 'love'):
data
.map(word => (if (word.endsWith('s')) word.slice(0, word.length-1) else if (word.endsWith('ly')) word.slice(0, word.length-2) else word, 1))
.reduceByKey(_+_)
.collect

How to output field padding in file Scala spark?

I have a text file. Now, I want output field padding in file as Exp1 & Exp2.
What should I do?
This is my input:
a
a a
a a a
a a a a
a a a a a
Exp1. Fill the remaining fields with the - character when each record in the file does not fit into the n=4 field.
a _ _ _
a a _ _
a a a _
a a a a
a a a a a
Exp2. Same as above. Delete the fields after the n=4 field when the number of fields in the record exceeds n.
a _ _ _
a a _ _
a a a _
a a a a
a a a a
My code:
val df = spark.read.text("data.txt")
val result = df.columns.foldLeft(df){(newdf, colname) =>
newdf.withColumnRenamed(colname, colname.replace("a", "_"))
}
result .show
This resembles a homework-style problem, so I will help guide you based on your provided code and try to lead you on the right path here.
Your current code is only changing the name of the columns. In this case, the column name "value" is being changed to "v_lue".
You want to change the actual records themselves.
First, you want to read this data into an RDD. It can be done with a dataframe, but being able to map on the row strings
instead of Row objects might make this easier to understand conceptually. I'll get you started.
val data = sc.textFile("data.txt")
Data will be an RDD of strings, where each element is a line in the data file.
We're going to want to map this data to some new data, and transform each row.
data.map(row => {
// transform each row here
})
Inside this map we make some change to row, which is a string. The code inside applies to every string in the RDD.
You will probably want to split the row to get an array of strings, so that you can count how many occurrences
of 'a' there are. Depending on the size of the array, you will want to create a new string and output that from this map.
If there are fewer 'a's than n, you will probably want to create a string with enough '_'s. If there are too many,
you will probably want to return a string with the correct number.
Hope this helps.

Count filtered records in scala

As I am new to scala ,This problem might look very basic to all..
I have a file called data.txt which contains like below:
xxx.lss.yyy23.com-->mailuogwprd23.lss.com,Hub,12689,14.98904563,1549
xxx.lss.yyy33.com-->mailusrhubprd33.lss.com,Outbound,72996,1.673717588,1949
xxx.lss.yyy33.com-->mailuogwprd33.lss.com,Hub,12133,14.9381027,664
xxx.lss.yyy53.com-->mailusrhubprd53.lss.com,Outbound,72996,1.673717588,3071
I want to split the line and find the records depending upon the numbers in xxx.lss.yyy23.com
val data = io.Source.fromFile("data.txt").getLines().map { x => (x.split("-->"))}.map { r => r(0) }.mkString("\n")
which gives me
xxx.lss.yyy23.com
xxx.lss.yyy33.com
xxx.lss.yyy33.com
xxx.lss.yyy53.com
This is what I am trying to count the exact value...
data.count { x => x.contains("33")}
How do I get the count of records who does not contain 33...
The following will give you the number of lines that contain "33":
data.split("\n").count(a => a.contains("33"))
The reason what you have above isn't working is that you need to split data into an array of strings again. Your previous statement actually concatenates the result into a single string using newline as a separator using mkstring, so you can't really run collection operations like count on it.
The following will work for getting the lines that do not contain "33":
data.split("\n").count(a => !a.contains("33"))
You simply need to negate the contains operation in this case.