How to use saveAsTextFile on AWS? - scala

I hope to get something by println, but by using AWS, it may not work, How can I save the content of println as a file on AWS using "saveAsTextFile"?
The original content of println is as following:
println("\n[ First output is ]")
output1.foreach(a => println("(" + a +"," + titles(a - 1) + ")"));
println("\n[ Second output us ]")
output2.foreach(a => println("(" + a +"," + titles(a - 1) + ")"));
output1 and output2 are both list made up of numbers. titles is also a list.
Thanks.

Well if both are Lists, you may convert them into RDDs, using SparkContext's method parallelize.
val rdd1 = sc.parallelize(List("[ First output is ]") ++ output1.map(a => "(" + a + "," + titles(a - 1) + ")"))
val rdd2 = sc.parallelize(List("[ Second output is ]") ++ output2.map(a => "(" + a + "," + titles(a - 1) + ")"))
After this you can use saveAsTextFile, in your desired s3 path.
rdd1.saveAsTextFile("s3://yourAccessKey:yourSecretKey#/out1.txt")
rdd2.saveAsTextFile("s3://yourAccessKey:yourSecretKey#/out2.txt")
I recommend you to read this blog, it might help you do understand important things about S3 and Apache-Spark Writing s3 data with Apache Spark

Related

How to optimize this Scala code?

I'm learning Scala, curious how to optimize this code. What I have is an RDD loaded from Spark. It's a tab delimited dataset. I want to combine the first column with the second column, and append it as a new column to the end of the dataset, with a "-" separating the two.
For example:
column1\tcolumn2\tcolumn3
becomes
column1\tcolumn2\tcolumn3\tcolumn1-column2
val f = sc.textFile("path/to/dataset")
f.map(line => if (line.split("\t").length > 1)
line.split("\t") :+ line.split("\t")(0)+"-"+line.split("\t")(1)
else
Array[String]()).map(a => a.mkString("\t")
)
.saveAsTextFile("output/path")
Try:
f.map{ line =>
val cols = line.split("\t")
if (cols.length > 1) line + "\t" + cols(0) + "-" + cols(1)
else line
}

Functional Way of handling corner case in Folds

I've a list of nodes (String) that I want to convert into something the following.
create X ({name:"A"}),({name:"B"}),({name:"B"}),({name:"C"}),({name:"D"}),({name:"F"})
Using a fold I get everything with an extra "," at the end. I can remove that using a substring on the final String. I was wondering if there is a better/more functional way of doing this in Scala ?
val nodes = List("A", "B", "B", "C", "D", "F")
val str = nodes.map( x => "({name:\"" + x + "\"}),").foldLeft("create X ")( (acc, curr) => acc + curr )
println(str)
//create X ({name:"A"}),({name:"B"}),({name:"B"}),({name:"C"}),({name:"D"}),({name:"F"}),
Solution 1
You could use the mkString function, which won't append the seperator at the end.
In this case you first map each element to the corresponding String and then use mkString for putting the ',' inbetween.
Since the "create X" is static in the beginning you could just prepend it to the result.
val str = "create X " + nodes.map("({name:\"" + _ + "\"})").mkString(",")
Solution 2
Another way to see this: Since you append exactly one ',' too much, you could just remove it.
val str = nodes.foldLeft("create X ")((acc, x) => acc + "({name:\"" + x + "\"}),").init
init just takes all elements from a collection, except the last.
(A string is seen as a collection of chars here)
So in a case where there are elements in your nodes, you would remove a ','. When there is none you only get "create X " and therefore remove the white-space, which might not be needed anyways.
Solution 1 and 2 are not equivalent when nodes is empty. Solution 1 would keep the white-space.
Joining a bunch of things, splicing something "in between" each of the things, isn't a map-shaped problem. So adding the comma in the map call doesn't really "fit".
I generally do this sort of thing by inserting the comma before each item during the fold; the fold can test whether the accumulator is "empty" and not insert a comma.
For this particular case (string joining) it's so common that there's already a library function for it: mkString.
Move "," from map(which applies to all) to fold/reduce
val str = "create X " + nodes.map( x => "({name:\"" + x + "\"})").reduceLeftOption( _ +","+ _ ).getOrElse("")

Scala closures and underscore (_) symbol

Why I can write something like this without compilation errors:
wordCount foreach(x => println("Word: " + x._1 + ", count: " + x._2)) // wordCount - is Map
i.e. I declared the x variable.
But I can't use magic _ symbol in this case:
wordCount foreach(println("Word: " + _._1 + ", count: " + _._2)) // wordCount - is
You should check this answer about placeholder syntax.
Two underscores mean two consecutive variables, so using println(_ + _) is a placeholder equivalent of (x, y) => println(x + y)
In first example, you just have a regular Tuple, which has accessors for first (._1) and second (._2) element.
it means that you can't use placeholder syntax when you want to reference only one variable multiple times.
Every underscore is positional. So your code is desugared to
wordCount foreach((x, y) => println("Word: " + x._1 + ", count: " + y._2))
Thanks to this, List(...).reduce(_ + _) is possible.
Moreover, since expansion is made relative to the closest paren it actually will look like:
wordCount foreach(println((x, y) => "Word: " + x._1 + ", count: " + y._2))

More elegant scala code

I am starting to learn scala. Wonder if anyone has a better way to re-write below code in a more functional way. I know there must be one.
val buf = ((addr>>24)&0xff) + "." + ((addr>>16)&0xff) + "." + ((addr>>8)&0xff) + "." + ((addr)&0xff)
This generates the Range(24, 16, 8, 0) with (24 to 0 by -8) and then applies the function addr >> _ & 0xff to each number using map. Lastly, the mapped Range of numbers is "joined" with . to create a string.
The map is more functional than using the + operator but the rest is just syntactic sugar and a library call to mkString.
val addr = 1024
val buf = (24 to 0 by -8).map(addr >> _ & 0xff).mkString(".")
buf: java.lang.String = 0.0.4.0
val buf = List(24,16,8,0).map(addr >> _).map(_ & 0xff).mkString(".")
Here's how I would do it, similar to Brian's answer but with a short list of values and two simple map() methods using Scala's famous '_' operator. Great question!
Some would find the for comprehension a little bit more readable:
(for (pos <- 24 to 0 by -8) yield addr >> pos & 0xff) mkString "."
The advantage is that input - can be ANY number of integers
// trick
implicit class When[F](fun: F) {
def when(cond: F => Boolean)(tail: F => F) = if (cond(fun)) tail(fun) else fun
}
// actual one-liner
12345678.toHexString.when(1 to 8 contains _.length % 8)
(s => "0" * (8 - s.length % 8) + s ).reverse.grouped(2).map
(Integer.parseInt(_, 16)).toList.reverse.mkString(".")
// 0.203.22.228
// a very big IPv7
BigInt("123456789012345678901").toString(16).when(1 to 8 contains _.length % 8)
(s => "0" * (8 - s.length % 8) + s ).reverse.grouped(2).map
(Integer.parseInt(_, 16)).toList.reverse.mkString(".")
// 0.0.0.96.27.228.249.24.242.99.198.83
EDIT
Explanation because of downvotes. implicit class When can be just a library class, it works in 2.10 and allows conditionally execute some of functions in a calls chain. I did not measure performance, and don't care, because an example itself is meant to be an illustration of what is possible, elegant or not.

Why do I have to explicitly state Tuple2(a, b) to be able to use Map add in a foldLeft?

I wish to create a Map keyed by name containing the count of things with that name. I have a list of the things with name, which may contain more than one item with the same name. Coded like this I get an error "type mismatch; found : String required: (String, Int)":
//variation 0, produces error
(Map[String, Int]() /: entries)((r, c) => { r + (c.name, if (r.contains(c.name)) (c.name) + 1 else 1) })
This confuses me as I though (a, b) was a Tuple2 and therefore suitable for use with Map add. Either of the following variations works as expected:
//variation 1, works
(Map[String, Int]() /: entries)((r, c) => { r + Tuple2(c.name, if (r.contains(c.name)) (c.name) + 1 else 1) })
//variation 2, works
(Map[String, Int]() /: entries)((r, c) => {
val e = (c.name, if (r.contains(c.name)) (c.name) + 1 else 1) })
r + e
I'm unclear on why there is a problem with my first version; can anyone advise. I am using Scala-IDE 2.0.0 beta 2 to edit the source; the error is from the Eclipse Problems window.
When passing a single tuple argument to a method used with operator notation, like your + method, you should use double parentheses:
(Map[String, Int]() /: entries)((r, c) => { r + ((c.name, r.get(c.name).map(_ + 1).getOrElse(1) )) })
I've also changed the computation of the Int, which looks funny in your example…
Because + is used to concatenate strings stuff with strings. In this case, parenthesis are not being taken to mean a tuple, but to mean a parameter.
Scala has used + for other stuff, which resulted in all sorts of problems, just like the one you mention.
Replace + with updated, or use -> instead of ,.
r + (c.name, if (r.contains(c.name)) (c.name) + 1 else 1)
is parsed as
r.+(c.name, if (r.contains(c.name)) (c.name) + 1 else 1)
So the compiler looks for a + method with 2 arguments on Map and doesn't find it. The form I prefer over double parentheses (as Jean-Philippe Pellet suggests) is
r + (c.name -> if (r.contains(c.name)) (c.name) + 1 else 1)
UPDATE:
if Pellet is correct, it's better to write
r + (c.name -> r.getOrElse(c.name, 0) + 1)
(and of course James Iry's solution expresses the same intent even better).