How to do Luhn check in df column in spark scala - scala

df has one string column like "100256437". I want to add one more column to check whether it pass Luhn. If pass, lit(true), else lit(false)
def Mod10(c: Column): Column = {
var (odd, sum) = (true, 0)
for (int <- c.reverse.map { _.toString.toShort }) {
println(int)
if (odd) sum += int
else sum += (int * 2 % 10) + (int / 5)
odd = !odd
}
lit(sum % 10 === 0)
}
Error:
error: value reverse is not a member of org.apache.spark.sql.Column
for (int <- c.reverse.map { _.toString.toShort }) {
^
error: value === is not a member of Int
lit(sum % 10 === 0)
^

Looks like, you are dealing with Spark Dataframes.
Lets say you have this dataframe
val df = List("100256437", "79927398713").toDF()
df.show()
+-----------+
| value|
+-----------+
| 100256437|
|79927398713|
+-----------+
Now, you can implement this Luhn test as an UDF,
val isValidLuhn = udf { (s: String) =>
val array = s.toCharArray.map(_.toString.toInt)
val len = array.length
var i = 1
while (i < len) {
if (i % 2 == 0) {
var updated = array(len - i) * 2
while (updated > 9) {
updated = updated.toString.toCharArray.map(_.toString.toInt).sum
}
array(len - i) = updated
}
i = i + 1
}
val sum = array.sum
println(array.toList)
(sum % 10) == 0
}
Which can be used as,
val dfWithLuhnCheck = df.withColumn("isValidLuhn", isValidLuhn(col("value")))
dfWithLuhnCheck.show()
+-----------+-----------+
| value|isValidLuhn|
+-----------+-----------+
| 100256437| true|
|79927398713| true|
+-----------+-----------+

Related

Scala unit type, Fibonacci recusive depth function

So I want to write a Fibonacci function in scala that outputs a tree like so:
fib(3)
| fib(2)
| | fib(1)
| | = 1
| | fib(0)
| | = 0
| = 1
| fib(1)
| = 1
= 2
and my current code is as follows:
var depth: Int = 0
def depthFibonacci(n:Int, depth: Int): Int={
def fibonnaciTailRec(t: Int,i: Int, j: Int): Int = {
println(("| " * depth) + "fib(" + t + ")")
if (t==0) {
println(("| " * depth) + "=" + j)
return j
}
else if (t==1) {
println (("| " * depth) + "=" + i)
return i
}
else {
depthFibonacci(t-1,depth+1) + depthFibonacci(t-2,depth+1)
}
}
fibonnaciTailRec(n,1,0)
}
println(depthFibonacci(3,depth))
which, when run, looks like:
fib(3)
| fib(2)
| | fib(1)
| | =1
| | fib(0)
| | =0
| fib(1)
| =1
2
As you can see there is no "= " at the end of any fibonacci numbers greater than 1, and I am unable to implement this in my depthFibonacci function or else the type becomes Unit. How can I fix this?
Is this close to what you're after?
def depthFib(n:Int, prefix:String = ""):Int = {
println(s"${prefix}fib($n)")
val res = n match {
case x if x < 1 => 0
case 1 => 1
case _ => depthFib(n-1, prefix+"| ") +
depthFib(n-2, prefix+"| ")
}
println(s"$prefix= $res")
res
}
usage:
depthFib(3)
Stack Safe
As it turns out, we can achieve tail call elimination, even without proper tail call recursion, by using TailCalls from the standard library.
We start with the Fibonacci implementation as found on the ScalaDocs page and add 3 strategically placed println() statements.
import scala.util.control.TailCalls._
def fib(n: Int, deep:Int=0): TailRec[Int] = {
println(s"${"| "*deep}fib($n)")
if (n < 2) {
println(s"${"| "*deep}= $n")
done(n)
} else for {
x <- tailcall(fib(n-1, deep+1))
y <- tailcall(fib(n-2, deep+1))
} yield {
println(s"${"| "*deep}= ${x+y}")
x + y
}
}
usage:
fib(3).result
But things aren't quite what they seem.
val f3 = fib(3) // fib(3)
println("Wha?") // Wha?
f3.result // | fib(2)
// | | fib(1)
// | | = 1
// | | fib(0)
// | | = 0
// | = 1
// | fib(1)
// | = 1
// = 2
Thus are the dangers of relying on side effects for your results.

How add a new column to in dataframe and populate the column?

Add a new column called Download_Type to a Dataframe with conditions:
If Size < 100,000, Download_Type = “Small”
If Size > 100,000 and Size < 1,000,000, Download_Type = “Medium”
Else Download_Type = “Large”
Input Data: log_file.txt
Sample Data
"date","time","size","r_version","r_arch","r_os","package","version","country","ip_id"
"2012-10-01","00:30:13",35165,"2.15.1","i686","linux-gnu","quadprog","1.5-4","AU",1
I have created a dataframe using these steps:
val file1 = sc.textFile(“log_file.txt”)
val header = file1.first
val logdata = file1.filter(x=>x!=header)
case class Log(date:String, time:String, size: Double, r_version:String, r_arch:String, r_os:String, packagee:String, version:String, country:String, ipr:Int)
val logfiledata = logdata.map(_.split(“,”)),map(p=>Log(p(0),p(1),p(2).toDouble,p(3),p(4),p(5),p(6),p(7),p(8),p(9).toInt))
val logfiledf = logfiledata.toDF()
I isolated the size column and converted it into an array:
val size = logfiledf.select($"size")
val sizearr = size.collect.map(row=>row.getDouble(0))
I made a function so i can populate the newly added column:
def exp1(size:Array[Double])={
var result = ""
for(i <- 0 to (size.length-1)){
if(size(i)<100000) result += "small"
else(if(size(i) >=100000 && size(i) <1000000) "medium"
else "large"
}
return result
}
I tried this to populate the column Download_Type:
val logfiledf2 = logfiledf.withColumn("Download_Type", expr(exp1(sizearr))
How can I populate the new column called Download_type with the conditions:
If Size < 100,000, Download_Type = “Small”
If Size > 100,000 and Size < 1,000,000, Download_Type = “Medium”
Else Download_Type = “Large” ?
You can simply apply withColumn to the loaded DataFrame logfiledf using when/otherwise, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val logfiledf = Seq(
("2012-10-01","00:30:13",35165.0,"2.15.1","i686","linux-gnu","quadprog","1.5-4","AU",1),
("2012-10-02","00:40:14",150000.0,"2.15.1","i686","linux-gnu","quadprog","1.5-4","US",2)
).toDF("date","time","size","r_version","r_arch","r_os","package","version","country","ip_id")
logfiledf.withColumn("download_type", when($"size" < 100000, "Small").otherwise(
when($"size" < 1000000, "Medium").otherwise("Large")
)
).show
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
// | date| time| size|r_version|r_arch| r_os| package|version|country|ip_id|download_type|
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
// |2012-10-01|00:30:13| 35165.0| 2.15.1| i686|linux-gnu|quadprog| 1.5-4| AU| 1| Small|
// |2012-10-02|00:40:14|150000.0| 2.15.1| i686|linux-gnu|quadprog| 1.5-4| US| 2| Medium|
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+

How to get multiple adjacent data in a RDD with Scala Spark

I have a RDD, the rdd's value is 0 or 1, and a limit is 4. When I map the RDD, if rdd's value is 1 then the values from the current position to the (current position+limit) are all 1 else there are 0 0 .
example.
input : 1,0,0,0,0,0,1,0,0
expected output : 1,1,1,1,0,0,1,1,1
This is what I have tried so far :
val rdd = sc.parallelize(Array(1, 0, 0, 0, 0, 0, 1, 0, 0))
val limit = 4
val resultlimit = rdd.mapPartitions(parIter => {
var result = new ArrayBuffer[Int]()
var resultIter = new ArrayBuffer[Int]()
while (parIter.hasNext) {
val iter = parIter.next()
resultIter.append(iter)
}
var i = 0
while (i < resultIter.length) {
result.append(resultIter(i))
if (resultIter(i) == 1) {
var j = 1
while (j + i < resultIter.length && j < limit) {
result.append(1)
j += 1
}
i += j
} else {
i += 1
}
}
result.toIterator
})
resultlimit.foreach(println)
The result of resultlimit is RDD:[1,1,1,1,0,0,1,1,1]
My quick and dirty approach is to first create an Array but that is so ugly and inefficient.
Is there any cleaner solution?
Plain and simple. Import RDDFunctions
import org.apache.spark.mllib.rdd.RDDFunctions._
Define a limit:
val limit: Int = 4
Perpend limit - 1 zeros to the first partition:
val extended = rdd.mapPartitionsWithIndex {
case (0, iter) => Seq.fill(limit - 1)(0).toIterator ++ iter
case (_, iter) => iter
}
Slide over the RDD:
val result = extended.sliding(limit).map {
slice => if (slice.exists(_ != 0)) 1 else 0
}
Check the result:
val expected = Seq(1,1,1,1,0,0,1,1,1)
require(expected == result.collect.toSeq)
On a side note, your current approach doesn't correct for partition boundaries, therefore result will vary depending on the source.
Following is an improved approach to your requirement. Three while loops are reduced to one for loop and two ArrayBuffers are reduced to one ArrayBuffer. So processing time and memory usage both are reduced.
val resultlimit= rdd.mapPartitions(parIter => {
var result = new ArrayBuffer[Int]()
var limit = 0
for (value <- parIter) {
if (value == 1) limit = 4
if (limit > 0) {
result.append(1)
limit -= 1
}
else {
result.append(value)
}
}
result.toIterator
})
Edited
Above solution is when you don't have a partition defined in the original rdd. But when a partition is defined as
val rdd = sc.parallelize(Array(1,1,0,0,0,0,1,0,0), 4)
We need to collect the rdds as above solution will get executed on each partitions.
So the following solution should work
var result = new ArrayBuffer[Int]()
var limit = 0
for (value <- rdd.collect()) {
if (value == 1) limit = 4
if (limit > 0) {
result.append(1)
limit -= 1
}
else {
result.append(value)
}
}
result.foreach(println)

Add a column to DataFrame with value of 1 where prediction greater than a custom threshold

I am trying to add a column to a DataFrame that should have the value 1 when the output class probability is high. Something like this:
val output = predictions
.withColumn(
"easy",
when( $"label" === $"prediction" &&
$"probability" > 0.95, 1).otherwise(0)
)
The problem is, probability is a Vector, and 0.95 is a Double, so the above doesn't work. What I really need is more like max($"probability") > 0.95 but of course that doesn't work either.
What is the right way of accomplishing this?
Here is a simple example as to implement your question.
Create a udf and pass probability column and return 0 or 1 for the new added column. In a Row WrappedArray is used instead of Array, Vector.
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(Vector(0.78, 0.98, 0.97), 1), (Vector(0.78, 0.96), 2), (Vector(0.78, 0.50), 3)
)).toDF("probability", "id")
data.withColumn("label", label($"probability")).show()
def label = udf((prob: mutable.WrappedArray[Double]) => {
if (prob.max >= 0.95) 1 else 0
})
Output:
+------------------+---+-----+
| probability| id|label|
+------------------+---+-----+
|[0.78, 0.98, 0.97]| 1| 1|
| [0.78, 0.96]| 2| 1|
| [0.78, 0.5]| 3| 0|
+------------------+---+-----+
Define UDF
val findP = udf((label: <type>, prediction: <type>, probability: <type> ) => {
if (label == prediction && vector.toArray.max > 0.95) 1 else 0
})
Use UDF in withCoulmn()
val output = predictions.withColumn("easy",findP($"lable",$"prediction",$"probability"))
Use an udf, something like:
val func = (label: String, prediction: String, vector: Vector) => {
if(label == prediction && vector.toArray.max > 0.95) 1 else 0
}
val output = predictions
.select($"label", func($"label", $"prediction", $"probability").as("easy"))

Scala: Draw table to console

I need to display a table in a console.
My simple solution, if you would call it a "solution", is as follows:
override def toString() = {
var res = "\n"
var counter = 1;
res += stateDb._1 + "\n"
res += " +----------------------------+\n"
res += " + State Table +\n"
res += " +----------------------------+\n"
for (entry <- stateDb._2) {
res += " | " + counter + "\t | " + entry._1 + " | " + entry._2 + " |\n"
counter += 1;
}
res += " +----------------------------+\n"
res += "\n"
res
}
We don't have to argue this
a is looking bad when displayed
b code looks kinda messed up
Actually, such a question was asked for C# but I would like to know a nice solution for Scala as well.
So what is a (nice/good/simple/whatever) way to draw such a table in Scala to the console?
-------------------------------------------------------------------------
| Column 1 | Column 2 | Column 3 | Column 4 |
-------------------------------------------------------------------------
| | | | |
| | | | |
| | | | |
-------------------------------------------------------------------------
I've pulled the following from my current project:
object Tabulator {
def format(table: Seq[Seq[Any]]) = table match {
case Seq() => ""
case _ =>
val sizes = for (row <- table) yield (for (cell <- row) yield if (cell == null) 0 else cell.toString.length)
val colSizes = for (col <- sizes.transpose) yield col.max
val rows = for (row <- table) yield formatRow(row, colSizes)
formatRows(rowSeparator(colSizes), rows)
}
def formatRows(rowSeparator: String, rows: Seq[String]): String = (
rowSeparator ::
rows.head ::
rowSeparator ::
rows.tail.toList :::
rowSeparator ::
List()).mkString("\n")
def formatRow(row: Seq[Any], colSizes: Seq[Int]) = {
val cells = (for ((item, size) <- row.zip(colSizes)) yield if (size == 0) "" else ("%" + size + "s").format(item))
cells.mkString("|", "|", "|")
}
def rowSeparator(colSizes: Seq[Int]) = colSizes map { "-" * _ } mkString("+", "+", "+")
}
scala> Tabulator.format(List(List("head1", "head2", "head3"), List("one", "two", "three"), List("four", "five", "six")))
res1: java.lang.String =
+-----+-----+-----+
|head1|head2|head3|
+-----+-----+-----+
| one| two|three|
| four| five| six|
+-----+-----+-----+
If you want it somewhat more compact. Bonus: left aligned and padded with 1 char on both sides. Based on the answer by Duncan McGregor (https://stackoverflow.com/a/7542476/8547501):
def formatTable(table: Seq[Seq[Any]]): String = {
if (table.isEmpty) ""
else {
// Get column widths based on the maximum cell width in each column (+2 for a one character padding on each side)
val colWidths = table.transpose.map(_.map(cell => if (cell == null) 0 else cell.toString.length).max + 2)
// Format each row
val rows = table.map(_.zip(colWidths).map { case (item, size) => (" %-" + (size - 1) + "s").format(item) }
.mkString("|", "|", "|"))
// Formatted separator row, used to separate the header and draw table borders
val separator = colWidths.map("-" * _).mkString("+", "+", "+")
// Put the table together and return
(separator +: rows.head +: separator +: rows.tail :+ separator).mkString("\n")
}
}
scala> formatTable(Seq(Seq("head1", "head2", "head3"), Seq("one", "two", "three"), Seq("four", "five", "six")))
res0: String =
+-------+-------+-------+
| head1 | head2 | head3 |
+-------+-------+-------+
| one | two | three |
| four | five | six |
+-------+-------+-------+
Ton of thanks for the Tabulator code!
There is a modification for Spark dataset tabular printing.
I mean you can print DataFrame content or pulled result set, like
Tabulator(hiveContext.sql("SELECT * FROM stat"))
Tabulator(hiveContext.sql("SELECT * FROM stat").take(20))
The second one will be without header of course, for DF implementation you can set how many rows to pull from Spark data frame for printing and do you need header or not.
/**
* Tabular representation of Spark dataset.
* Usage:
* 1. Import source to spark-shell:
* spark-shell.cmd --master local[2] --packages com.databricks:spark-csv_2.10:1.3.0 -i /path/to/Tabulator.scala
* 2. Tabulator usage:
* import org.apache.spark.sql.hive.HiveContext
* val hiveContext = new HiveContext(sc)
* val stat = hiveContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("D:\\data\\stats-belablotski.tsv")
* stat.registerTempTable("stat")
* Tabulator(hiveContext.sql("SELECT * FROM stat").take(20))
* Tabulator(hiveContext.sql("SELECT * FROM stat"))
*/
object Tabulator {
def format(table: Seq[Seq[Any]], isHeaderNeeded: Boolean) : String = table match {
case Seq() => ""
case _ =>
val sizes = for (row <- table) yield (for (cell <- row) yield if (cell == null) 0 else cell.toString.length)
val colSizes = for (col <- sizes.transpose) yield col.max
val rows = for (row <- table) yield formatRow(row, colSizes)
formatRows(rowSeparator(colSizes), rows, isHeaderNeeded)
}
def formatRes(table: Array[org.apache.spark.sql.Row]): String = {
val res: Seq[Seq[Any]] = (for { r <- table } yield r.toSeq).toSeq
format(res, false)
}
def formatDf(df: org.apache.spark.sql.DataFrame, n: Int = 20, isHeaderNeeded: Boolean = true): String = {
val res: Seq[Seq[Any]] = (for { r <- df.take(n) } yield r.toSeq).toSeq
format(List(df.schema.map(_.name).toSeq) ++ res, isHeaderNeeded)
}
def apply(table: Array[org.apache.spark.sql.Row]): Unit =
println(formatRes(table))
/**
* Print DataFrame in a formatted manner.
* #param df Data frame
* #param n How many row to take for tabular printing
*/
def apply(df: org.apache.spark.sql.DataFrame, n: Int = 20, isHeaderNeeded: Boolean = true): Unit =
println(formatDf(df, n, isHeaderNeeded))
def formatRows(rowSeparator: String, rows: Seq[String], isHeaderNeeded: Boolean): String = (
rowSeparator ::
(rows.head + { if (isHeaderNeeded) "\n" + rowSeparator else "" }) ::
rows.tail.toList :::
rowSeparator ::
List()).mkString("\n")
def formatRow(row: Seq[Any], colSizes: Seq[Int]) = {
val cells = (for ((item, size) <- row.zip(colSizes)) yield if (size == 0) "" else ("%" + size + "s").format(item))
cells.mkString("|", "|", "|")
}
def rowSeparator(colSizes: Seq[Int]) = colSizes map { "-" * _ } mkString("+", "+", "+")
}
Tokenize it. I'd start with looking at making a few case objects and classes so that you produce a tokenized list which can be operated on for display purposes:
sealed trait TableTokens{
val width: Int
}
case class Entry(value: String) extends TableTokens{
val width = value.length
}
case object LineBreak extends TableTokens{
val width = 0
}
case object Div extends TableTokens{
val width = 1
}
So then you can form certain constraints with some sort of row object:
case class Row(contents: List[TableTokens]) extends TableTokens{
val width = contents.foldLeft(0)((x,y) => x = y.width)
}
Then you can check for constraits and things like that in an immutable fashion. Perhaps creating methods for appending tables and alignment...
case class Table(contents: List[TableTokens])
That means you could have several different variants of tables where your style is different from your structure, a la HTML and CSS.
Here's some modifications of #Duncan McGregor answer to accept unicode's box drawing or custom characters using Scala 3.
First we define a class to host the custom separators:
type ColumnSep = (Char, Char, Char)
case class TableSeparator(horizontal: Char, vertical: Char, upLeft: Char, upMiddle: Char, upRight: Char, middleLeft: Char, middleMiddle: Char, middleRight: Char, downLeft: Char, downMiddle: Char, downRight: Char):
def separate(sep: TableSeparator => ColumnSep)(seq: Seq[Any]): String =
val (a, b, c) = sep(this)
seq.mkString(a.toString, b.toString, c.toString)
def separateRows(posicao: TableSeparator => ColumnSep)(colSizes: Seq[Int]): String =
separate(posicao)(colSizes.map(horizontal.toString * _))
def up: ColumnSep = (upLeft, upMiddle, upRight)
def middle: ColumnSep = (middleLeft, middleMiddle, middleRight)
def down: ColumnSep = (downLeft, downMiddle, downRight)
def verticals: ColumnSep = (vertical, vertical, vertical)
then we define the separators on the companion object
object TableSeparator:
lazy val simple = TableSeparator(
'-', '|',
'+', '+', '+',
'+', '+', '+',
'+', '+', '+'
)
lazy val light = TableSeparator(
'─', '│',
'┌', '┬', '┐',
'├', '┼', '┤',
'└', '┴', '┘'
)
lazy val heavy = TableSeparator(
'━', '┃',
'┏', '┳', '┓',
'┣', '╋', '┫',
'┗', '┻', '┛'
)
lazy val dottedLight = TableSeparator(
'┄', '┆',
'┌', '┬', '┐',
'├', '┼', '┤',
'└', '┴', '┘'
)
lazy val dottedHeavy = TableSeparator(
'┅', '┇',
'┏', '┳', '┓',
'┣', '╋', '┫',
'┗', '┻', '┛'
)
lazy val double = TableSeparator(
'═', '║',
'╔', '╦', '╗',
'╠', '╬', '╣',
'╚', '╩', '╝'
)
And finally the Tabulator:
class Tabulator(val separators: TableSeparator):
def format(table: Seq[Seq[Any]]): String = table match
case Seq() => ""
case _ =>
val sizes = for (row <- table) yield for (cell <- row) yield if cell == null then 0 else cell.toString.length
val colSizes = for (col <- sizes.transpose) yield col.max
val rows = for (row <- table) yield formatRow(row, colSizes)
formatRows(colSizes, rows)
private def centralize(text: String, width: Int): String =
val space: Int = width - text.length
val prefix: Int = space / 2
val suffix: Int = (space + 1) / 2
if width > text.length then " ".repeat(prefix) + text + " ".repeat(suffix) else text
def formatRows(colSizes: Seq[Int], rows: Seq[String]): String =
(separators.separateRows(_.up)(colSizes) ::
rows.head ::
separators.separateRows(_.middle)(colSizes) ::
rows.tail.toList ::
separators.separateRows(_.down)(colSizes) ::
List()).mkString("\n")
def formatRow(row: Seq[Any], colSizes: Seq[Int]): String =
val cells = for (item, size) <- row zip colSizes yield if size == 0 then "" else centralize(item.toString, size)
separators.separate(_.verticals)(cells)
Some output examples:
+---+-----+----+
| a | b | c |
+---+-----+----+
|abc|true |242 |
|xyz|false|1231|
|ijk|true |312 |
+---+-----+----+
┌───┬─────┬────┐
│ a │ b │ c │
├───┼─────┼────┤
│abc│true │242 │
│xyz│false│1231│
│ijk│true │312 │
└───┴─────┴────┘
┏━━━┳━━━━━┳━━━━┓
┃ a ┃ b ┃ c ┃
┣━━━╋━━━━━╋━━━━┫
┃abc┃true ┃242 ┃
┃xyz┃false┃1231┃
┃ijk┃true ┃312 ┃
┗━━━┻━━━━━┻━━━━┛
╔═══╦═════╦════╗
║ a ║ b ║ c ║
╠═══╬═════╬════╣
║abc║true ║242 ║
║xyz║false║1231║
║ijk║true ║312 ║
╚═══╩═════╩════╝