reduceByKey method not being found in IntelliJ - scala

Here is code I'm trying out for reduceByKey :
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import scala.math.random
import org.apache.spark._
import org.apache.spark.storage.StorageLevel
object MapReduce {
def main(args: Array[String]) {
val sc = new SparkContext("local[4]" , "")
val file = sc.textFile("c:/data-files/myfile.txt")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
}
}
Is giving compiler error : "cannot resolve symbol reduceByKey"
When I hover over implementation of reduceByKey it gives three possible implementations so it appears it is being found ?:

You need to add the following import to your file:
import org.apache.spark.SparkContext._
Spark documentation:
"In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in
tuples in the language, created by simply writing (a, b)), as long as you import org.apache.spark.SparkContext._ in your program to enable Spark’s implicit conversions. The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples if you import the conversions."

It seems as if the documented behavior has changed in Spark 1.4.x. To have IntelliJ recognize the implicit conversions you now have to add the following import:
import org.apache.spark.rdd.RDD._

I have noticed that at times IJ is unable to resolve methods that are imported implicitly via PairRDDFunctions https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala .
The methods implicitly imported include the reduceByKey* and reduceByKeyAndWindow* methods. I do not have a general solution at this time -except that yes you can safely ignore the intellisense errors

Related

how to convert java stream to scala stream? [duplicate]

As a part of an effort of converting Java code to Scala code, I need to convert the Java stream Files.walk(Paths.get(ROOT)) to Scala. I can't find a solution by googling. asScala won't do it. Any hints?
Here is the related code:
import static org.springframework.hateoas.mvc.ControllerLinkBuilder.linkTo;
import static org.springframework.hateoas.mvc.ControllerLinkBuilder.methodOn;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.stream.Collectors;
// ...snip...
Files.walk(Paths.get(ROOT))
.filter(path -> !path.equals(Paths.get(ROOT)))
.map(path -> Paths.get(ROOT).relativize(path))
.map(path -> linkTo(methodOn(FileUploadController.class).getFile(path.toString())).withRel(path.toString()))
.collect(Collectors.toList()))
The Files.walk(Paths.get(ROOT)) return type is Stream<Path> in Java.
There is a slightly nicer way without needing the compat layer or experimental 2.11 features mentioned here by #marcospereira
Basically just use an iterator:
import java.nio.file.{Files, Paths}
import scala.collection.JavaConverters._
Files.list(Paths.get(".")).iterator().asScala
Starting Scala 2.13, the standard library includes scala.jdk.StreamConverters which provides Java to Scala implicit stream conversions:
import scala.jdk.StreamConverters._
val javaStream = Files.walk(Paths.get("."))
// javaStream: java.util.stream.Stream[java.nio.file.Path] = java.util.stream.ReferencePipeline$3#51b1d486
javaStream.toScala(LazyList)
// scala.collection.immutable.LazyList[java.nio.file.Path] = LazyList(?)
javaStream.toScala(Iterator)
// Iterator[java.nio.file.Path] = <iterator>
Note the usage of LazyList (as opposed to Stream) as Streams are deprecated in Scala 2.13. LazyList is the supported replacement type.
Java 8 Streams and Scala Streams are conceptually different things; the Java 8 Stream is not a collection, so the usual collection converter won't work. You can use the scala-java8-compat (github) library to add a toScala method to Java Streams:
import scala.compat.java8.StreamConverters._
import java.nio.file.{ Files, Path, Paths }
val scalaStream: Stream[Path] = Files.walk(Paths.get(".")).toScala[Stream]
You can't really use this conversion (Java->Scala) from Java, so if you have to do this from Java, it's easier (but still awkward) to just run the stream and build the Scala Stream yourself (which is what the aforementioned library is doing under the hood):
import scala.collection.immutable.Stream$;
import scala.collection.mutable.Builder;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.stream.Stream;
final Stream<Path> stream = Files.walk(Paths.get("."));
final Builder<Path, scala.collection.immutable.Stream<Path>> builder = Stream$.MODULE$.newBuilder();
stream.forEachOrdered(builder::$plus$eq);
final scala.collection.immutable.Stream<Path> result = builder.result();
However, both ways will fully consume the Java Stream, so you don't get the benefit of the lazy evaluation by converting it to a Scala Stream and might as well just convert it directly to a Vector. If you just want to use the Scala function literal syntax, there different ways to achieve this. You could use the same library to use function converters, similar to collection converters:
import scala.compat.java8.FunctionConverters._
import java.nio.file.{ Files, Path, Paths }
val p: Path => Boolean = p => Files.isExecutable(p)
val stream: java.util.stream.Stream[Path] = Files.walk(Paths.get(".")).filter(p.asJava)
Alternatively since 2.11, Scala has experimental support for SAM types under the -Xexperimental flag. This will be non-experimental without a flag in 2.12.
$ scala
Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92).
Type in expressions for evaluation. Or try :help.
scala> import java.nio.file.{ Files, Path, Paths }
import java.nio.file.{Files, Path, Paths}
scala> Files.walk(Paths.get(".")).filter(p => Files.isExecutable(p))
<console>:13: error: missing parameter type
Files.walk(Paths.get(".")).filter(p => Files.isExecutable(p))
^
scala> :set -Xexperimental
scala> Files.walk(Paths.get(".")).filter(p => Files.isExecutable(p))
res1: java.util.stream.Stream[java.nio.file.Path] = java.util.stream.ReferencePipeline$2#589838eb
scala> Files.walk(Paths.get(".")).filter(Files.isExecutable)
res2: java.util.stream.Stream[java.nio.file.Path] = java.util.stream.ReferencePipeline$2#185d8b6

ParTraverse Not a Value of NonEmptyList

I am following the instructions on the cats IO website to run a sequence of effects in parallel:
My code looks like:
val maybeNonEmptyList: Option[NonEmptyList[Urls]] = NonEmptyList.fromList(urls)
val maybeDownloads: Option[IO[NonEmptyList[Either[Error, Files]]]] = maybeNonEmptyList map { urls =>
urls.parTraverse(url => downloader(url))
}
But I get a compile time error saying:
value parTraverse is not a member of cats.data.NonEmptyList[Urls]
[error] urls.parTraverse(url => downloader(url))
I have imported the following:
import cats.data.{EitherT, NonEmptyList}
import cats.effect.{ContextShift, IO, Timer}
import cats.implicits._
import cats.syntax.parallel._
and also i have the following implicits:
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
implicit val timer: Timer[IO] = IO.timer(ExecutionContext.global)
Why do i still get the problem?
This is caused because the implicit enrichment is being imported twice, so it becomes ambiguous
import cats.implicits._
import cats.syntax.parallel._
As of recent versions of cats, implicits imports are never required, only syntax
The recommended pattern is to do only import cats.syntax.all._

Change Java List to Scala Seq?

I have the following list from my configuration:
val markets = Configuration.getStringList("markets");
To create a sequence out of it I write this code:
JavaConverters.asScalaIteratorConverter(markets.iterator()).asScala.toSeq
I wish I could do it in a less verbose way, such as:
markets.toSeq
And then from that list I get the sequence. I will have more configuration in the near future; is there a solution that provides this kind of simplicity?
I want a sequence regardless of the configuration library I am using. I don't want to have the stated verbose solution with the JavaConverters.
JavaConversions is deprecated since Scala 2.12.0. Use JavaConverters; you can import scala.collection.JavaConverters._ to make it less verbose:
import scala.collection.JavaConverters._
val javaList = java.util.Arrays.asList("one", "two")
val scalaSeq = javaList.asScala.toSeq
Yes. Just import implicit conversions:
import java.util
import scala.collection.JavaConversions._
val jlist = new util.ArrayList[String]()
jlist.toSeq

How to convert a Java Stream to a Scala Stream?

As a part of an effort of converting Java code to Scala code, I need to convert the Java stream Files.walk(Paths.get(ROOT)) to Scala. I can't find a solution by googling. asScala won't do it. Any hints?
Here is the related code:
import static org.springframework.hateoas.mvc.ControllerLinkBuilder.linkTo;
import static org.springframework.hateoas.mvc.ControllerLinkBuilder.methodOn;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.stream.Collectors;
// ...snip...
Files.walk(Paths.get(ROOT))
.filter(path -> !path.equals(Paths.get(ROOT)))
.map(path -> Paths.get(ROOT).relativize(path))
.map(path -> linkTo(methodOn(FileUploadController.class).getFile(path.toString())).withRel(path.toString()))
.collect(Collectors.toList()))
The Files.walk(Paths.get(ROOT)) return type is Stream<Path> in Java.
There is a slightly nicer way without needing the compat layer or experimental 2.11 features mentioned here by #marcospereira
Basically just use an iterator:
import java.nio.file.{Files, Paths}
import scala.collection.JavaConverters._
Files.list(Paths.get(".")).iterator().asScala
Starting Scala 2.13, the standard library includes scala.jdk.StreamConverters which provides Java to Scala implicit stream conversions:
import scala.jdk.StreamConverters._
val javaStream = Files.walk(Paths.get("."))
// javaStream: java.util.stream.Stream[java.nio.file.Path] = java.util.stream.ReferencePipeline$3#51b1d486
javaStream.toScala(LazyList)
// scala.collection.immutable.LazyList[java.nio.file.Path] = LazyList(?)
javaStream.toScala(Iterator)
// Iterator[java.nio.file.Path] = <iterator>
Note the usage of LazyList (as opposed to Stream) as Streams are deprecated in Scala 2.13. LazyList is the supported replacement type.
Java 8 Streams and Scala Streams are conceptually different things; the Java 8 Stream is not a collection, so the usual collection converter won't work. You can use the scala-java8-compat (github) library to add a toScala method to Java Streams:
import scala.compat.java8.StreamConverters._
import java.nio.file.{ Files, Path, Paths }
val scalaStream: Stream[Path] = Files.walk(Paths.get(".")).toScala[Stream]
You can't really use this conversion (Java->Scala) from Java, so if you have to do this from Java, it's easier (but still awkward) to just run the stream and build the Scala Stream yourself (which is what the aforementioned library is doing under the hood):
import scala.collection.immutable.Stream$;
import scala.collection.mutable.Builder;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.stream.Stream;
final Stream<Path> stream = Files.walk(Paths.get("."));
final Builder<Path, scala.collection.immutable.Stream<Path>> builder = Stream$.MODULE$.newBuilder();
stream.forEachOrdered(builder::$plus$eq);
final scala.collection.immutable.Stream<Path> result = builder.result();
However, both ways will fully consume the Java Stream, so you don't get the benefit of the lazy evaluation by converting it to a Scala Stream and might as well just convert it directly to a Vector. If you just want to use the Scala function literal syntax, there different ways to achieve this. You could use the same library to use function converters, similar to collection converters:
import scala.compat.java8.FunctionConverters._
import java.nio.file.{ Files, Path, Paths }
val p: Path => Boolean = p => Files.isExecutable(p)
val stream: java.util.stream.Stream[Path] = Files.walk(Paths.get(".")).filter(p.asJava)
Alternatively since 2.11, Scala has experimental support for SAM types under the -Xexperimental flag. This will be non-experimental without a flag in 2.12.
$ scala
Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92).
Type in expressions for evaluation. Or try :help.
scala> import java.nio.file.{ Files, Path, Paths }
import java.nio.file.{Files, Path, Paths}
scala> Files.walk(Paths.get(".")).filter(p => Files.isExecutable(p))
<console>:13: error: missing parameter type
Files.walk(Paths.get(".")).filter(p => Files.isExecutable(p))
^
scala> :set -Xexperimental
scala> Files.walk(Paths.get(".")).filter(p => Files.isExecutable(p))
res1: java.util.stream.Stream[java.nio.file.Path] = java.util.stream.ReferencePipeline$2#589838eb
scala> Files.walk(Paths.get(".")).filter(Files.isExecutable)
res2: java.util.stream.Stream[java.nio.file.Path] = java.util.stream.ReferencePipeline$2#185d8b6

getOrElse method not being found in Scala Spark

Attempting to follow example in Sandy Ryza's book Advanced Analytics with Spark, coding using IntelliJ. Below I seem to have imported all the right libraries, but why is it not recognizing getOrElse?
Error:(84, 28) value getOrElse is not a member of org.apache.spark.rdd.RDD[String]
bArtistAlias.value.getOrElse(artistID, artistID)
^
Code:
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd._
import org.apache.spark.rdd.PairRDDFunctions
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.recommendation._
val trainData = rawUserArtistData.map { line =>
val Array(userID, artistID, count) = line.split(' ').map(_.toInt)
val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID)
Rating(userID, finalArtistID, count)
}.cache()
I can only make an assumption as the code listed is missing pieces, but my guess is that bArtistAlias is supposed to be a Map that SHOULD be broadcast, but isnt.
I went and found the piece of code in Sandy's book and it corroborates my guess. So, you seem to be missing this piece:
val bArtistAlias = sc.broadcast(artistAlias)
I am not even sure what you did without the code, but it looks like you broadcast an RDD[String], thus the error.....this would not even work anyway as you cannot work with another RDD inside of an RDD