I've used Groovy for a while, one of the things I loved from there was the expressiveness that can be in test assertions, for example, this code:
def actual = [name: [firstName: "Elvis", lastName: "Presley"], age: 42]
def expected = [name: [firstName: "Elvis", lastName: "Costello"], age: 42]
assert actual.name.lastName == expected.name.lastName
Will have the next output:
Assertion failed:
assert actual.name.lastName == expected.name.lastName
| | | | | | |
| | 'Presley'| | | 'Costello'
| | | | ['firstName':'Elvis', 'lastName':'Costello']
| | | ['name':['firstName':'Elvis', 'lastName':'Costello'], 'age':42]
| | false
| ['firstName':'Elvis', 'lastName':'Presley']
['name':['firstName':'Elvis', 'lastName':'Presley'], 'age':42]
at Script1.run(Script1.groovy:4)
As you can see it shows the results of the navigation.
Is there something like this for Scala?
Related
In PySpark, I have dataframe_a with:
+-----------+----------------------+
| str1 | array_of_str |
+-----------+----------------------+
| John | [mango, apple] |
| Tom | [mango, orange] |
| Matteo | [apple, banana] |
and dataframe_b with
+-----------+----------------------+
| key | value |
+-----------+----------------------+
| mango | 1 |
| apple | 2 |
| orange | 3 |
and I want to create a new column of type Array joined_result that maps each element in array_of_str (dataframe_a) to its value in dataframe_b, such as:
+-----------+----------------------+----------------------------------+
| str1 | array_of_str | joined_result |
+-----------+----------------------+----------------------------------+
| John | [mango, apple] | [1, 2] |
| Tom | [mango, orange] | [1, 3] |
| Matteo | [apple, banana] | [2] |
I'm not sure how to do it, I know I can use an udf with a lambda function but I don't manage to make it work :( Help!
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, ArrayType
# START EXTRACT OF CODE
ret = (df
.select(['str1', 'array_of_str'])
.withColumn('joined_result', F.udf(
map(lambda x: ??????, ArrayType(StringType))
)
)
return ret
# END EXTRACT OF CODE
My answer in your question:
lookup_list = map(lambda row: row.asDict(), dataframe_b.collect())
lookup_dict = {lookup['key']:lookup['value'] for lookup in lookup_list}
def mapper(keys):
return [lookup_dict[key][0] for key in keys]
dataframe_a = dataframe_a.withColumn('joined_result', F.udf(mapper)("arr_of_str"))
It works as you want :-)
I have this data in my feature in cucumber :
| deal | mir | stp1 | stp2 | date | mnt |
| 1255 | 120 | 1 | 1 | 2018-01-01 | 120 |
that I read in that case class
case class test1 (deal : String, mir: String, stp1:String, stp2: String, date: Sttring, mnt:Option[String])
in my step definition I read it like that :
Given("""^I have this data$""") {dt: DataTable =>
val dt_lists = dt.asList(classOf[test1 ])
}
Problem : when I put "mnt" which is Option[String] in my data like that :
| deal | mir | stp1 | stp2 | date | mnt |
| 1255 | 120 | 1 | 1 | 2018-01-01 | 120 |
I have an error : cucumber.runtime.CucumberException: cucumber.deps.com.thoughtworks.xstream.converters.ConversionException: Cannot construct scala.Option : scala.Option : Cannot construct scala.Option
when I retrieve "mnt" from the data:
| deal | mir | stp1 | stp2 | date |
| 1255 | 120 | 1 | 1 | 2018-01-01 |
in that case the program works.
any help is welcome thanks
I'm surprised why you want to convert dataTable to case class. If you are intended to use each field you can do something like this
`
Then("""^tGiven("""^I have this data$""") { (fieldNames: DataTable) =>
fieldNames.asList(classOf[String]).asScala.foreach { fieldName =>
// you will have all the field names here like deal,mir ,stp1 ,stp2,date,mnt
}
}
I want to join two tables A and B and pick the records having max date from table B for each value.
Consider the following tables:
Table A:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | a | 4/1/2018 |
| 3 | a | 8/1/2018 |
| 4 | c | 1/1/2018 |
| 5 | d | 1/1/2018 |
| 6 | e | 1/1/2018 |
+---+-----+----------+
Table B:
+---+-----+----------+
|Key|Value|sent_date |
+---+---- +----------+
| x | a | 2/1/2018 |
| y | a | 7/1/2018 |
| z | a | 11/1/2018|
| p | c | 5/1/2018 |
| q | d | 5/1/2018 |
| r | e | 5/1/2018 |
+---+-----+----------+
The aim is to bring in column id from Table A to Table B for each value in Table B.
For the same, table A and B needs to be joined together with column value and for each record in B, max(A.start_date) for each data in column Value in Table A is found with condition A.start_date < B.sent_date
Lets consider the value=a here.
In table A, we can see 3 records for Value=a with 3 different start_date.
So when joining Table B, for value=a with sent_date=2/1/2018, record with max(start_date) for start_date which are less than sent_date in Table B is taken(in this case 1/1/2018) and corresponding data in column A.id is pulled to Table B.
Similarly for record with value=a and sent_date = 11/1/2018 in Table B, id=3 from table A needs to be pulled to table B.
The result must be as follows:
+---+-----+----------+---+
|Key|Value|sent_date |id |
+---+---- +----------+---+
| x | a | 2/1/2018 | 1 |
| y | a | 7/1/2018 | 2 |
| z | a | 11/1/2018| 3 |
| p | c | 5/1/2018 | 4 |
| q | d | 5/1/2018 | 5 |
| r | e | 5/1/2018 | 6 |
+---+-----+----------+---+
I am using Spark 2.3.
I have joined the two tables(using Dataframe) and found the max(start_date) based on the condition.
But I am unable to figure out how to pull the records here.
Can anyone help me out here
Thanks in Advance!!
I just changed the date "11/1/2018" to "9/1/2018" as the string sorting gives incorrect results. When converted to date, the logic would still work. See below
scala> val df_a = Seq((1,"a","1/1/2018"),
| (2,"a","4/1/2018"),
| (3,"a","8/1/2018"),
| (4,"c","1/1/2018"),
| (5,"d","1/1/2018"),
| (6,"e","1/1/2018")).toDF("id","value","start_date")
df_a: org.apache.spark.sql.DataFrame = [id: int, value: string ... 1 more field]
scala> val df_b = Seq(("x","a","2/1/2018"),
| ("y","a","7/1/2018"),
| ("z","a","9/1/2018"),
| ("p","c","5/1/2018"),
| ("q","d","5/1/2018"),
| ("r","e","5/1/2018")).toDF("key","valueb","sent_date")
df_b: org.apache.spark.sql.DataFrame = [key: string, valueb: string ... 1 more field]
scala> val df_join = df_b.join(df_a,'valueb==='valuea,"inner")
df_join: org.apache.spark.sql.DataFrame = [key: string, valueb: string ... 4 more fields]
scala> df_join.filter('sent_date >= 'start_date).withColumn("rank", rank().over(Window.partitionBy('key,'valueb,'sent_date).orderBy('start_date.desc))).filter('rank===1).drop("valuea","start_date","rank").show()
+---+------+---------+---+
|key|valueb|sent_date| id|
+---+------+---------+---+
| q| d| 5/1/2018| 5|
| p| c| 5/1/2018| 4|
| r| e| 5/1/2018| 6|
| x| a| 2/1/2018| 1|
| y| a| 7/1/2018| 2|
| z| a| 9/1/2018| 3|
+---+------+---------+---+
scala>
UPDATE
Below is the udf to handle date strings with MM/dd/yyyy formats
scala> def dateConv(x:String):String=
| {
| val y = x.split("/").map(_.toInt).map("%02d".format(_))
| y(2)+"-"+y(0)+"-"+y(1)
| }
dateConv: (x: String)String
scala> val udfdateconv = udf( dateConv(_:String):String )
udfdateconv: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df_a_dt = df_a.withColumn("start_date",date_format(udfdateconv('start_date),"yyyy-MM-dd").cast("date"))
df_a_dt: org.apache.spark.sql.DataFrame = [id: int, valuea: string ... 1 more field]
scala> df_a_dt.printSchema
root
|-- id: integer (nullable = false)
|-- valuea: string (nullable = true)
|-- start_date: date (nullable = true)
scala> df_a_dt.show()
+---+------+----------+
| id|valuea|start_date|
+---+------+----------+
| 1| a|2018-01-01|
| 2| a|2018-04-01|
| 3| a|2018-08-01|
| 4| c|2018-01-01|
| 5| d|2018-01-01|
| 6| e|2018-01-01|
+---+------+----------+
scala>
Here's the code to produce this error:
build.sbt
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"ai.x" %% "safe" % "0.1.0"
)
scalacOptions := Seq("-Ytyper-debug") // Only add this if you want to see a bunch of stuff
test.scala
import ai.x.safe._
package object foo {
final implicit val (enc, dec) = {
("x" === "y") -> 0
}
}
Attempting to compile this will cause this error:
[info] Compiling 1 Scala source to /tmp/test/target/scala-2.11/classes...
[error] /tmp/test/test.scala:4: recursive value x$1 needs type
[error] final implicit val (enc, dec) = {
[error] ^
[error] one error found
With the full debug mode on I can see that it's trying to resolve === and the compiler is looking at the current implicits to determine if any match. Since (enc, dec) are implicit it appears to use them as well, and so it tries to type them, causing this implicit recursion the compiler is complaining about.
| | | | | |-- "x".$eq$eq$eq("y") EXPRmode-POLYmode-QUALmode (silent: value x$1 in package)
| | | | | | |-- "x".$eq$eq$eq BYVALmode-EXPRmode-FUNmode-POLYmode (silent: value x$1 in package)
| | | | | | | |-- "x" EXPRmode-POLYmode-QUALmode (silent: value x$1 in package)
| | | | | | | | \-> String("x")
| | | | | | | |-- x$1._1 EXPRmode (site: value enc in package)
| | | | | | | | |-- x$1 EXPRmode-POLYmode-QUALmode (site: value enc in package)
| | | | | | | | | caught scala.reflect.internal.Symbols$CyclicReference: illegal cyclic reference involving value x$1: while typing x$1
[error] /tmp/test/test.scala:4: recursive value x$1 needs type
[error] final implicit val (enc, dec) = {
[error] ^
| | | | | | | | | \-> <error>
| | | | | | | | \-> <error>
| | | | | | | |-- x$1._2 EXPRmode (site: value dec in package)
| | | | | | | | |-- x$1 EXPRmode-POLYmode-QUALmode (site: value dec in package)
| | | | | | | | | \-> <error>
| | | | | | | | \-> <error>
| | | | | | | |-- SafeEquals BYVALmode-EXPRmode-FUNmode-POLYmode (silent: value x$1 in package) implicits disabled
| | | | | | | | |-- ai.x.safe.`package` EXPRmode-POLYmode-QUALmode (silent: value x$1 in package) implicits disabled
| | | | | | | | | \-> ai.x.safe.type
| | | | | | | | \-> ai.x.safe.SafeEquals.type <and> [T](l: T)ai.x.safe.SafeEquals[T]
| | | | | | | solving for (T: ?T)
| | | | | | | solving for (T: ?T)
| | | | | | | solving for (T: ?T)
| | | | | | | [adapt] SafeEquals adapted to [T](l: T)ai.x.safe.package.SafeEquals[T] based on pt String("x") => ?{def ===: ?}
| | | | | | | |-- [T](l: T)ai.x.safe.package.SafeEquals[T] EXPRmode-POLYmode-QUALmode (silent: value x$1 in package)
| | | | | | | | \-> ai.x.safe.package.SafeEquals[String]
| | | | | | | |-- ai.x.safe.`package`.SafeEquals[String]("x").$eq$eq$eq BYVALmode-EXPRmode-FUNmode-POLYmode (silent: value x$1 in package)
| | | | | | | | \-> (r: String)Boolean
| | | | | | | \-> (r: String)Boolean
| | | | | | |-- "y" : pt=String BYVALmode-EXPRmode (silent: value x$1 in package)
| | | | | | | \-> String("y")
| | | | | | \-> Boolean
I can of course make it compile by doing something like this:
final val (x,y) = {
("x" === "y") -> 0
}
implicit val (f,b) = (x,y)
Since the implicits don't exist when the body of {} is being defined, they don't interfere with the compilers implicit search when locating that === from SafeEquals can apply to String and make the code work. Now I don't really have a problem with this because it does make sense since one can define lazy recursive serializers and other implicit things that use themselves without problems. So of course the compiler should look at the implicit that's being defined as a possible application to make something work.
But the weird thing to me is that this works if you don't extract the tuple directly during the assignment:
final implicit val tuple = {
("x" === "y") -> 0
}
Obviously that's not what I want to do since I want both things in the tuple to be implicit, (in my real case it's an encoder/decoder pair from circe). But it's just strange to me that the use of (what I believe to be) an extractor for the Tuple2 causes this compiler error with the implicit being searched. Can anyone tell me why this happens or what's causing the behavior? I'd be interested to know a bit more about what I'm seeing in the debug output. Why does resolving the type of each individual thing inside of the tuple cause the compiler error, but resolving the type of the overall tuple not cause any problems?
I need to "extract" some data contained in an Iterable[MyObject] (it was a RDD[MyObject] before a groupBy).
My initial RDD[MyObject] :
|-----------|---------|----------|
| startCity | endCity | Customer |
|-----------|---------|----------|
| Paris | London | ID | Age |
| | |----|-----|
| | | 1 | 1 |
| | |----|-----|
| | | 2 | 1 |
| | |----|-----|
| | | 3 | 50 |
|-----------|---------|----------|
| Paris | London | ID | Age |
| | |----|-----|
| | | 5 | 40 |
| | |----|-----|
| | | 6 | 41 |
| | |----|-----|
| | | 7 | 2 |
|-----------|---------|----|-----|
| New-York | Paris | ID | Age |
| | |----|-----|
| | | 9 | 15 |
| | |----|-----|
| | | 10| 16 |
| | |----|-----|
| | | 11| 46 |
|-----------|---------|----|-----|
| New-York | Paris | ID | Age |
| | |----|-----|
| | | 13| 7 |
| | |----|-----|
| | | 14| 9 |
| | |----|-----|
| | | 15| 60 |
|-----------|---------|----|-----|
| Barcelona | London | ID | Age |
| | |----|-----|
| | | 17| 66 |
| | |----|-----|
| | | 18| 53 |
| | |----|-----|
| | | 19| 11 |
|-----------|---------|----|-----|
I need to count them by age range by and groupBy startCity - endCity
The final result should be :
|-----------|---------|-------------|
| startCity | endCity | Customer |
|-----------|---------|-------------|
| Paris | London | Range| Count|
| | |------|------|
| | |0-2 | 3 |
| | |------|------|
| | |3-18 | 0 |
| | |------|------|
| | |19-99 | 3 |
|-----------|---------|-------------|
| New-York | Paris | Range| Count|
| | |------|------|
| | |0-2 | 0 |
| | |------|------|
| | |3-18 | 3 |
| | |------|------|
| | |19-99 | 2 |
|-----------|---------|-------------|
| Barcelona | London | Range| Count|
| | |------|------|
| | |0-2 | 0 |
| | |------|------|
| | |3-18 | 1 |
| | |------|------|
| | |19-99 | 2 |
|-----------|---------|-------------|
At the moment I'm doing this by count 3 times the same data (first time with 0-2 range, then 10-20, then 21-99).
Like :
Iterable[MyObject] ite
ite.count(x => x.age match {
case Some(age) => { age >= 0 && age < 2 }
}
It's working by giving me an Integer but not efficient at all I think since I have to count many times, what's the best way to do this please ?
Thanks
EDIT : The Customer object is a case class
def computeRange(age : Int) =
if(age<=2)
"0-2"
else if(age<=10)
"2-10"
// etc, you get the idea
Then, with an RDD of case class MyObject(id : String, age : Int)
rdd
.map(x=> computeRange(x.age) -> 1)
.reduceByKey(_+_)
Edit:
If you need to group by some columns, you can do it this way, provided that you have a RDD[(SomeColumns, Iterable[MyObject])]. The following lines would give you a map that associates each "range" to its number of occurences.
def computeMapOfOccurances(list : Iterable[MyObject]) : Map[String, Int] =
list
.map(_.age)
.map(computeRange)
.groupBy(x=>x)
.mapValues(_.size)
val result1 = rdd
.mapValues( computeMapOfOccurances(_))
And if you need to flatten your data, you can write:
val result2 = result1
.flatMapValues(_.toSeq)
Assuming that you have Customer[Object] as a case class as below
case class Customer(ID: Int, Age: Int)
And your RDD[MyObject] is a rdd of case class as below
case class MyObject(startCity: String, endCity: String, customer: List[Customer])
So using above case classes you should be having input (that you have in table format) as below
MyObject(Paris,London,List(Customer(1,1), Customer(2,1), Customer(3,50)))
MyObject(Paris,London,List(Customer(5,40), Customer(6,41), Customer(7,2)))
MyObject(New-York,Paris,List(Customer(9,15), Customer(10,16), Customer(11,46)))
MyObject(New-York,Paris,List(Customer(13,7), Customer(14,9), Customer(15,60)))
MyObject(Barcelona,London,List(Customer(17,66), Customer(18,53), Customer(19,11)))
And you've also mentioned that after grouping you have Iterable[MyObject] which is equivalent to below step
val groupedRDD = rdd.groupBy(myobject => (myobject.startCity, myobject.endCity)) //groupedRDD: org.apache.spark.rdd.RDD[((String, String), Iterable[MyObject])] = ShuffledRDD[2] at groupBy at worksheetTest.sc:23
So the next step for you to do is to use mapValues to iterate through the Iterable[MyObject], and then count the ages belonging to each ranges, and finally converting to the output you require as below
val finalResult = groupedRDD.mapValues(x => {
val rangeAge = Map("0-2" -> 0, "3-18" -> 0, "19-99" -> 0)
val list = x.flatMap(y => y.customer.map(z => z.Age)).toList
updateCounts(list, rangeAge).map(x => CustomerOut(x._1, x._2)).toList
})
where updateCounts is a recursive function
def updateCounts(ageList: List[Int], map: Map[String, Int]) : Map[String, Int] = ageList match{
case head :: tail => if(head >= 0 && head < 3) {
updateCounts(tail, map ++ Map("0-2" -> (map("0-2")+1)))
} else if(head >= 3 && head < 19) {
updateCounts(tail, map ++ Map("3-18" -> (map("3-18")+1)))
} else updateCounts(tail, map ++ Map("19-99" -> (map("19-99")+1)))
case Nil => map
}
and CustomerOut is another case class
case class CustomerOut(Range: String, Count: Int)
so the finalResult is as below
((Barcelona,London),List(CustomerOut(0-2,0), CustomerOut(3-18,1), CustomerOut(19-99,2)))
((New-York,Paris),List(CustomerOut(0-2,0), CustomerOut(3-18,4), CustomerOut(19-99,2)))
((Paris,London),List(CustomerOut(0-2,3), CustomerOut(3-18,0), CustomerOut(19-99,3)))