pyspark dataframe when and multiple otherwise clause

pyspark dataframe when and multiple otherwise clause - pyspark

I'll need to create an if multiple else in a pyspark dataframe.
I have two columns to be logically tested.
Logic is below:
If Column A OR Column B contains "something", then write "X"
Else If (Numeric Value in a string of Column A + Numeric Value in a string of Column B) > 100 , then write "X"
Else If (Numeric Value in a string of Column A + Numeric Value in a string of Column B) > 50 , then write "Y"
Else If (Numeric Value in a string of Column A + Numeric Value in a string of Column B) > 0 , then write "Z"
Else, then write "T"
to a new column "RESULT"
I thought the quickest search method is when, otherwise, otherwise, otherwise, otherwise and failed in the query below.
I'd be appreciated if you suggest a quicker method for that.
Note: when(clause).when(clause).when(clause).when(clause).otherwise(clause) searches the whole table again and again. I want to proceed with unmatched data only.
df = df.withColumn('RESULT', F.when(\
F.when((F.col("A").like("%something%") | F.col("B").like("%something%")), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 100), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 29), "Y").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 0), "Z").otherwise(\
"T"))))))
Desired Output

I got the solution anyway.
df = df.withColumn('RESULT',\
F.when((F.col("A").like("%something%") | F.col("B").like("%something%")), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 100), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 29), "Y").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 0), "Z").otherwise(\
"T")))))

Related

Printing reduceByKey output

How to print the output after reduceByKey
I tried things like
totalsByAge.foreach{ i =>println("Value = " + i )}
I have a couple of lines of code
val totalsByAgeEntry = rdd.mapValues(x => (x, 1))
val totalsByAge = totalsByAgeEntry.reduceByKey( (x,y) => (x._1 + y._1, x._2 + y._2))
I want to print the tuple that gets when reduceByKey is called. I dont print the output after (x._1 + y._1, x._2 + y._2) is computed.
I know that the data created after reduceByKey is something like:
(x,((x1,y1),(x2,y2))
But how can I print that

That is because reduceByKey is performed by the executors, and println prints the output in the executor's standard output. The executor's stdout is usually available at master.application.ip.address:8080.
If you want to print/view your data you can do that in several ways. For instance: 1) by applying totalByAge.take(numberOfLines).foreach(println); 2) by collecting (.collect()) the RDD to the driver; and 3) by converting the RDD to a Dataframe and then applying .show().
val rdd: RDD[(Int, Int)] =
sparkContext
.parallelize(Vector(1, 2, 3))
.map(i => (i, 1))
.reduceByKey(_ + _)
rdd.take(10).foreach(println) // take the first 10 lines and print them
rdd.collect().foreach(println) // centralize the entire RDD and print it
import spark.implicits._
rdd.toDF().show(10) // conver to dataframe and show the first 10 lines

How can I group RDD by key then count per unique string?

I have an RDD like:
[(1, "Western"),
(1, "Western")
(1, "Drama")
(2, "Western")
(2, "Romance")
(2, "Romance")]
I wish to count per userID the occurances of each movie genres resulting in
1, { "Western", 2), ("Drama", 1) } ...
After that it's my intention to pick the one with the largest number and thus gaining the most popular genre per user.
I have tried userGenre.sortByKey().countByValue()
but to no avail I have no clue on how I can perform this task. I'm using pyspark jupyter notebook.
EDIT:
I have tried the following and it seems to have worked, could someone confirm?
userGenreRDD.map(lambda x: (x, 1)).aggregateByKey(\
0, # initial value for an accumulator \
lambda r, v: r + v, # function that adds a value to an accumulator \
lambda r1, r2: r1 + r2 # function that merges/combines two accumulators \
)

Here is one way of doing
rdd = sc.parallelize([('u1', "Western"),('u2', "Western"),('u1', "Drama"),('u1', "Western"),('u2', "Romance"),('u2', "Romance")])
The occurrence of each movie genre could be
>>> rdd = sc.parallelize(rdd.countByValue().items())
>>> rdd.map(lambda ((x,y),z): (x,(y,z))).groupByKey().map(lambda (x,y): (x, [y for y in y])).collect()
[('u1', [('Western', 2), ('Drama', 1)]), ('u2', [('Western', 1), ('Romance', 2)])]
Most popular genre
>>> rdd.map(lambda (x,y): ((x,y),1)).reduceByKey(lambda x,y: x+y).map(lambda ((x,y),z):(x,(y,z))).groupByKey().mapValues(lambda (x,y): (y)).collect()
[('u1', ('Western', 2)), ('u2', ('Romance', 2))]
Now one could ask what should be most popular genre if more than one genre have the same popularity count?

How to get RDD[List[String]] to String and split it

I have below scenario , when I need to get the lines from list and split it.
scala> var nonErroniousBidsMap = rawBids.filter(line => !(line(2).contains("ERROR_") || line(5) == null || line(5) == ""))
nonErroniousBidsMap: org.apache.spark.rdd.RDD[List[String]] = MapPartitionsRDD[108] at filter at <console>:33
scala> nonErroniousBidsMap.take(2).foreach(println)
List(0000002, 15-04-08-2016, 0.89, 0.92, 1.32, 2.07, , 1.35)
List(0000002, 11-05-08-2016, 0.92, 1.68, 0.81, 0.68, 1.59, , 1.63, 1.77, 2.06, 0.66, 1.53, , 0.32, 0.88, 0.83, 1.01)
scala> val transposeMap = nonErroniousBidsMap.map( rec => ( rec.split(",")(0) + "," + rec.split(",")(1) + ",US" + "," + rec.split(",")(5) ) )
<console>:35: error: value split is not a member of List[String]
val transposeMap = nonErroniousBidsMap.map( rec => ( rec.split(",")(0) + "," + rec.split(",")(1) + ",US" + "," + rec.split(",")(5) ) )
^
I am getting an error as showed above.
Can you please help me how to solve this ?
Thank you.

The type of rec is List[String] - which does not have a split(String) method (as the compiler correctly warns). It looks like you're assuming your records are comma-separated Strings, but in fact they're not (when you call println on each one of them, they are printed with comma separators simply because that's how List.toString behaves).
You can simply remove all the calls to split(",") and get what you want:
nonErroniousBidsMap.map(rec => rec.head + "," + rec(1) + ",US" + "," + rec(5))
Or even more elegantly, using Scala's String Interpolation:
nonErroniousBidsMap.map(rec => s"${rec.head},${rec(1)},US,${rec(5)}")

How to optimize this Scala code?

I'm learning Scala, curious how to optimize this code. What I have is an RDD loaded from Spark. It's a tab delimited dataset. I want to combine the first column with the second column, and append it as a new column to the end of the dataset, with a "-" separating the two.
For example:
column1\tcolumn2\tcolumn3
becomes
column1\tcolumn2\tcolumn3\tcolumn1-column2
val f = sc.textFile("path/to/dataset")
f.map(line => if (line.split("\t").length > 1)
line.split("\t") :+ line.split("\t")(0)+"-"+line.split("\t")(1)
else
Array[String]()).map(a => a.mkString("\t")
)
.saveAsTextFile("output/path")

Try:
f.map{ line =>
val cols = line.split("\t")
if (cols.length > 1) line + "\t" + cols(0) + "-" + cols(1)
else line
}

Fold left and fold right

I am trying to learn how to use fold left and fold right. This is my first time learning functional programming. I am having trouble understanding when to use fold left and when to use fold right. It seems to me that a lot of the time the two functions are interchangeable. For example (in Scala)the two functions:
val nums = List(1, 2, 3, 4, 5)
val sum1 = nums.foldLeft(0) { (total, n) =>
total + n
}
val sum2 = nums.foldRight(0) {(total, n) =>
total + n
}
both yield the same result. Why and when would I choose one or the other?

foldleft and foldright differ in the way the function is nested.
foldleft: (((...) + a) + a) + a
foldright: a + (a + (a + (...)))
Since the function you are using is addition, both of them give the same result. Try using subtraction.
Moreover, the motivation to use fold(left/right) is not the result - in most of the cases, both yield the same result. It depends on which you you want your function to be aggregated.

Since the operator you are using is associated & commutative operator means a + b = b + a that's why leftFold and rightFold worked equivalent but it's not the equivalent in general as you can visualised by below examples where operator(+) is not associative & commutative operation i.e in case of string concatenation '+' operator is not associative & commutative means 'a' + 'b' != 'b' + 'a'
val listString = List("a", "b", "c") // : List[String] = List(a,b,c)
val leftFoldValue = listString.foldLeft("z")((el, acc) => el + acc) // : String = zabc
val rightFoldValue = listString.foldRight("z")((el, acc) => el + acc) // : abcz
OR in shorthand ways
val leftFoldValue = listString.foldLeft("z")(_ + _) // : String = zabc
val rightFoldValue = listString.foldRight("z")(_ + _) // : String = abcz
Explanation:
leftFold is worked as ( ( ('z' + 'a') + 'b') + 'c') = ( ('za' + 'b') + 'c') = ('zab' + 'c') = 'zabc'
and rightFold as ('a' + ('b' + ('c' + 'z'))) = ('a' + ('b' + 'cz')) = ('a' + 'bcz') = 'abcz'
So in short for operators that are associative and commutative, foldLeft and
foldRight are equivalent (even though there may be a difference in
efficiency).
But sometimes, only one of the two operators is appropriate.