I am following the instructions on the cats IO website to run a sequence of effects in parallel:
My code looks like:
val maybeNonEmptyList: Option[NonEmptyList[Urls]] = NonEmptyList.fromList(urls)
val maybeDownloads: Option[IO[NonEmptyList[Either[Error, Files]]]] = maybeNonEmptyList map { urls =>
urls.parTraverse(url => downloader(url))
}
But I get a compile time error saying:
value parTraverse is not a member of cats.data.NonEmptyList[Urls]
[error] urls.parTraverse(url => downloader(url))
I have imported the following:
import cats.data.{EitherT, NonEmptyList}
import cats.effect.{ContextShift, IO, Timer}
import cats.implicits._
import cats.syntax.parallel._
and also i have the following implicits:
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
implicit val timer: Timer[IO] = IO.timer(ExecutionContext.global)
Why do i still get the problem?
This is caused because the implicit enrichment is being imported twice, so it becomes ambiguous
import cats.implicits._
import cats.syntax.parallel._
As of recent versions of cats, implicits imports are never required, only syntax
The recommended pattern is to do only import cats.syntax.all._
I wrote the following code which aims to transform a dataframe to a dataset using a case class
def toDs[T](df: DataFrame): Dataset[T] = {
df.as[T]
}
then case class DATA( name:String, age:Double, location:String)
I am getting:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] df.as[T]
Any idea how to fix this
You can read the data into a Dataset[MyCaseClass] in the following two ways:
Say you have the following class: case class MyCaseClass
1) First way: Import sparksession implicits in the scope and use the as operator to convert your DataFrame to Dataset[MyCaseClass]:
case class MyCaseClass
val spark: SparkSession = SparkSession.builder.enableHiveSupport.getOrCreate()
import spark.implicits._
val ds: Dataset[MyCaseClass]= spark.read.format("FORMAT_HERE").load().as[MyCaseClass]
2) You can create you own encoder in another object and import them in your current code
package com.funky.package
import org.apache.spark.sql.{Encoder, Encoders}
case class MyCaseClass
object MyCustomEncoders{
implicit val mycaseClass:Encoder[MyCaseClass] = Encoders.product[MyCaseClass]
}
In the file containing the main method, import the above implicit value
import com.funky.package.MyCustomEncoders
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
val spark: SparkSession = SparkSession.builder.enableHiveSupport.getOrCreate()
val ds: Dataset[MyCaseClass]= spark.read.format("FORMAT_HERE").load().as[MyCaseClass]
I am trying to teach myself Scala whilst at the same time trying to write code that is idiomatic of a functional language, i.e. write better, more elegant, functional code.
I have the following code that works OK:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import java.time.LocalDate
object DataFrameExtensions_ {
implicit class DataFrameExtensions(df: DataFrame){
def featuresGroup1(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
def featuresGroup2(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
}
}
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
import spark.implicits._
val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
val groupBy = Seq("a","b")
val asAt = LocalDate.now()
val dataFrames = Seq(df.featuresGroup1(groupBy, asAt),df.featuresGroup2(groupBy, asAt))
The last line bothers me though. The two functions (featuresGroup1, featuresGroup2) both have the same signature:
scala> :type df.featuresGroup1(_,_)
(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame
scala> :type df.featuresGroup2(_,_)
(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame
and take the same vals as parameters so I assume I can write that line in a more functional way (perhaps using .map somehow) that means I can write the parameter list just once and pass it to both functions. I can't figure out the syntax though. I thought maybe I could construct a list of those functions but that doesn't work:
scala> Seq(featuresGroup1, featuresGroup2)
<console>:23: error: not found: value featuresGroup1
Seq(featuresGroup1, featuresGroup2)
^
<console>:23: error: not found: value featuresGroup2
Seq(featuresGroup1, featuresGroup2)
^
Can anyone help?
I thought maybe I could construct a list of those functions but that doesn't work:
Why are you writing just featuresGroup1/2 here when you already had the correct syntax df.featuresGroup1(_,_) just above?
Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))
df.featuresGroup1 _ should work as well.
df.featuresGroup1 by itself would work if you had an expected type, e.g.
val dataframes: Seq[(Seq[String], LocalDate) => DataFrame] =
Seq(df.featuresGroup1, df.featuresGroup2)
but in this specific case providing the expected type is more verbose than using lambdas.
I thought maybe I could construct a list of those functions but that doesn't work
You need to explicitly perform eta expansion to turn methods into functions (they are not the same in Scala), by using an underscore operator:
val funcs = Seq(featuresGroup1 _, featuresGroup2 _)
or by using placeholders:
val funcs = Seq(featuresGroup1(_, _), featuresGroup2(_, _))
And you are absolutely right about using map operator:
val dataFrames = funcs.map(f => f(groupBy, asAdt))
I strongly recommend against using implicits of types String or Seq, as if used in multiple places, these lead to subtle bugs that are not immediately obvious from the code and the code will be prone to breaking when it's moved somewhere.
If you want to use implicits, wrap them into a custom types:
case class DfGrouping(groupBy: Seq[String]) extends AnyVal
implicit val grouping: DfGrouping = DfGrouping(Seq("a", "b"))
Why no just create a function in DataFrameExtensions to do so?
def getDataframeGroups(groupBy: Seq[String], asAt: String) = Seq(featuresGroup1(groupBy,asAt), featuresGroup2(groupBy,asAt))
I think you could create a list of functions as below:
val funcs:List[DataFrame=>(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame] = List(_.featuresGroup1, _.featuresGroup1)
funcs.map(x => x(df)(groupBy, asAt))
It seems you have a list of functions which convert a DataFrame to another DataFrame. If that is the pattern, you could go a little bit further with Endo in Scalaz
I like this answer best, courtesy of Alexey Romanov.
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import java.time.LocalDate
object DataFrameExtensions_ {
implicit class DataFrameExtensions(df: DataFrame){
def featuresGroup1(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
def featuresGroup2(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
}
}
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
import spark.implicits._
val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
val groupBy = Seq("a","b")
val asAt = LocalDate.now()
Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))
I've this code implemented:
scala> import org.apache.spark._
scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD
scala> import org.apache.spark.util.IntParam
import org.apache.spark.util.IntParam
scala> import org.apache.spark.graphx._
import org.apache.spark.graphx._
scala> import org.apache.spark.graphx.util.GraphGenerators
import org.apache.spark.graphx.util.GraphGenerators
scala> case class Transactions(ID:Long,Chain:Int,Dept:Int,Category:Int,Company:Long,Brand:Long,Date:String,ProductSize:Int,ProductMeasure:String,PurchaseQuantity:Int,PurchaseAmount:Double)
defined class Transactions
When I try to run this:
def parseTransactions(str:String): Transactions = {
| val line = str.split(",")
| Transactions(line(0),line(1),line(2),line(3),line(4),line(5),line(6),line(7),line(8),line(9),line(10))
| }
I am obtaining this error: :38: error: type mismatch;
found : String
required: Long
Anyone knows why I'm getting this error? I am doing a social netowork analysis over the schema that I put above.
Many thanks!
You are creating array from "," separated values which returns String array. Cast it to appropriate type before assigning to case class arguments.
val line = str.split(",")
line(0).toLong
I was trying to get into the new Scala Pickling library that was presented at the ScalaDays 2013: Scala Pickling
What I am really missing are some simple examples how the library is used.
I understood that I can pickle some object an unpickle it again like that:
import scala.pickling._
val pckl = List(1, 2, 3, 4).pickle
val lst = pckl.unpickle[List[Int]]
In this example, pckl is of the type Pickle. What exactly is the use of this type and how can I get for example get an Array[Byte] of it?
If you want wanted to pickle into bytes, then the code will look like this:
import scala.pickling._
import binary._
val pckl = List(1, 2, 3, 4).pickle
val bytes = pckl.value
If you wanted json, the code would look almost the exact same with a minor change of imports:
import scala.pickling._
import json._
val pckl = List(1, 2, 3, 4).pickle
val json = pckl.value
How the object is pickled depends on the import type that you chose under scala.pickling (being either binary or json). Import binary and the value property is an Array[Byte]. Import json and it's a json String.