Spark iterate over dataframe rows, cells

Spark iterate over dataframe rows, cells - scala

(Spark beginner) I wrote the code below to iterate over the rows and columns of a data frame (Spark 2.4.0 + Scala 2.12).
I have computed the row and cell counts as a sanity check.
I was surprised to find that the method returns 0, even though the counters are incremented during the iteration.
To be precise: while the code is running it prints messages showing that it has found
rows 10, 20, ..., 610 - as expected.
cells 100, 200, ..., 1580 -
as expected.
After the iteration is done, it prints "Found 0 cells", and returns 0.
I understand that Spark is a distributed processing engine, and that code is not executed exactly as written - but how should I think about this code?
The row/cell counts were just a sanity check; in reality I need to loop over the data and accumulate some results, but how do I prevent Spark from zeroing out my results as soon as the iteration is done?
def processDataFrame(df: sql.DataFrame): Int = {
var numRows = 0
var numCells = 0
df.foreach { row =>
numRows += 1
if (numRows % 10 == 0) println(s"Found row $numRows") // prints 10,20,...,610
row.toSeq.foreach { c =>
if (numCells % 100 == 0) println(s"Found cell $numCells") // prints 100,200,...,15800
numCells += 1
}
}
println(s"Found $numCells cells") // prints 0
numCells
}

Spark have accumulators variables which provides you functionality like counting in a distributed environment. You can use a simple long and int type of accumulator. Even custom datatype of accumulator can also be implemented quite easily in Spark.
In your code changing your counting variables to accumulator variables like below will give you the correct result.
val numRows = sc.longAccumulator("numRows Accumulator") // string name only for debug purpose
val numCells = sc.longAccumulator("numCells Accumulator")
df.foreach { row =>
numRows.add(1)
if (numRows.value % 10 == 0) println(s"Found row ${numRows.value}") // prints 10,20,...,610
row.toSeq.foreach { c =>
if (numCells.value % 100 == 0) println(s"Found cell ${numCells.value}") // prints 100,200,...,15800
numCells.add(1)
}
}
println(s"Found ${numCells.value} cells") // prints 0
numCells.value

Related

Prime numbers print from range 2...100

I have been assigned with a task to print prime numbers from a range 2...100. I've managed to get most of the prime numbers but can't figure out how to get rid of 9 and 15, basically multiples of 3 and 5. Please give me your suggestion on how can I fix this.
for n in 2...20 {
if n % 2 == 0 && n < 3{
print(n)
} else if n % 2 == 1 {
print(n)
} else if n % 3 == 0 && n > 6 {
}
}
This what it prints so far:
2
3
5
7
9
11
13
15
17
19

One of effective algorithms to find prime numbers is Sieve of Eratosthenes. It is based on idea that you have sorted array of all numbers in given range and you go from the beginning and you remove all numbers after current number divisible by this number which is prime number. You repeat this until you check last element in the array.
There is my algorithm which should do what I described above:
func primes(upTo rangeEndNumber: Int) -> [Int] {
let firstPrime = 2
guard rangeEndNumber >= firstPrime else {
fatalError("End of range has to be greater than or equal to \(firstPrime)!")
}
var numbers = Array(firstPrime...rangeEndNumber)
// Index of current prime in numbers array, at the beginning it is 0 so number is 2
var currentPrimeIndex = 0
// Check if there is any number left which could be prime
while currentPrimeIndex < numbers.count {
// Number at currentPrimeIndex is next prime
let currentPrime = numbers[currentPrimeIndex]
// Create array with numbers after current prime and remove all that are divisible by this prime
var numbersAfterPrime = numbers.suffix(from: currentPrimeIndex + 1)
numbersAfterPrime.removeAll(where: { $0 % currentPrime == 0 })
// Set numbers as current numbers up to current prime + numbers after prime without numbers divisible by current prime
numbers = numbers.prefix(currentPrimeIndex + 1) + Array(numbersAfterPrime)
// Increase index for current prime
currentPrimeIndex += 1
}
return numbers
}
print(primes(upTo: 100)) // [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]
print(primes(upTo: 2)) // [2]
print(primes(upTo: 1)) // Fatal error: End of range has to be greater than or equal to 2!

what is the Prime num : Prime numbers are the positive integers having only two factors, 1 and the integer itself,
//Funtion Call
findPrimeNumberlist(fromNumber: 1, toNumber: 100)
//You can print any range Prime number using this fucntion.
func findPrimeNumberlist(fromNumber:Int, toNumber: Int)
{
for i in fromNumber...toNumber
{
var isPrime = true
if i <= 1 { // number must be positive integer
isPrime = false
}
else if i <= 3 {
isPrime = true
}
else {
for j in 2...i/2 // here i am using loop from 2 to i/2 because it will reduces the iteration.
{
if i%j == 0 { // number must have only 1 factor except 1. so use break: no need to check further
isPrime = false
break
}
}
}
if isPrime {
print(i)
}
}
}

func getPrimeNumbers(rangeOfNum: Int) -> [Int]{
var numArr = [Int]()
var primeNumArr = [Int]()
var currentNum = 0
for i in 0...rangeOfNum{
currentNum = i
var counter = 0
if currentNum > 1{
numArr.append(currentNum)
for j in numArr{
if currentNum % j == 0{
counter += 1
}
}
if counter == 1{
primeNumArr.append(currentNum)
}
}
}
print(primeNumArr)
print(primeNumArr.count)
return primeNumArr
}
Then just call the function with the max limit using this
getPrimeNumbers(rangeOfNum: 100)
What is happening in above code:
The numArr is created to keep track of what numbers have been used
Any number that is prime number is added/appended to primeNumArr
Current number shows the number that is being used at the moment
We start from 0 ... upto our range where we need prime numbers upto (with little modification it can be changed if the range starts from other number beside 0)
Remember, for a number to be Prime it should have 2 divisor means should be only completely divisible by 2 numbers. First is 1 and second is itself. (Completely divisible means having remainder 0)
The counter variable is used to keep count of how many numbers divide the current number being worked on.
Since 1 is only has 1 Divisor itself hence its not a Prime number so we start from number > 1.
First as soon as we get in, we add the current number being checked into the number array to keep track of numbers being used
We run for loop to on number array and check if the Current Number (which in our case will always be New and Greater then previous ones) when divided by numbers in numArr leaves a remainder of 0.
If Remainder is 0, we add 1 to the counter.
Since we are already ignoring 1, the max number of counter for a prime number should be 1 which means only divisible by itself (only because we are ignoring it being divisible by 1)
Hence if counter is equal to 1, it confirms that the number is prime and we add it to the primeNumArr
And that's it. This will give you all prime numbers within your range.
PS: This code is written on current version of swift

Optimised with less number of loops
Considered below conditions
Even Number can not be prime number expect 2 so started top loop form 3 adding 2
Any prime number can not multiplier of even number expect 2 so started inner loop form 3 adding 2
Maximum multiplier of any number if half that number
var primeNumbers:[Int] = [2]
for index in stride(from: 3, to: 100, by: 2) {
var count = 0
for indexJ in stride(from: 3, to: index/2, by: 2) {
if index % indexJ == 0 {
count += 1
}
if count == 1 {
break
}
}
if count == 0 {
primeNumbers.append(index)
}
}
print("primeNumbers ===", primeNumbers)

I finally figured it out lol, It might be not pretty but it works haha, Thanks for everyone's answer. I'll post what I came up with if maybe it will help anyone else.
for n in 2...100 {
if n % 2 == 0 && n < 3{
print(n)
} else if n % 3 == 0 && n > 6 {
} else if n % 5 == 0 && n > 5 {
} else if n % 7 == 0 && n > 7{
} else if n % 2 == 1 {
print(n)
}
}

What's wrong with this solution for neighbor cell counting of a matrix? (Swift)

I was given this problem to solve inside a Swift playground. I was told my answer is correct but not good enough (that's right, pretty vague).
/*
Write a Swift playground that takes an n x n grid of integers. Each integer can be either 1 or 0.
The playground then outputs an n x n grid where each block indicates the number of 1's around that block, (excluding the block itself) . For Block 0 on row 0, surrounding blocks are (0,1) (1,0) and (1,1). Similary for block (1,1) all the blocks around it are to be counted as surrounding blocks.
Requirements:
Make sure your solution works for any size grid.
Spend an hour and a half coding the logic and another hour cleaning your code (adding comments, cleaning variable and function names).
Optimize your functions to be O(n^2).
Your output lines should not have any trailing or leading whitespaces.
Please use hard coded example input constants below.
Examples:
Input Grid:
let sampleGrid = [[0,1,0], [0,0,0], [1,0,0]]
Console Output:
1 0 1
2 2 1
0 1 0
/////////////////
Input Grid:
let sampleGrid = [[0,1,0,0], [0,0,0,1], [1,0,0,1],[0,1,0,1]]
Console Output:
1 0 2 1
2 2 3 1
1 2 4 2
2 1 3 1
*/
/// An *Error* type
struct ComputationError : Error {
/// The message of the error
let message : String
}
/// This function computes a matrix result from the input *matrix*
/// where an integer value represents the number of adjacent cells
/// in the input *matrix* having a 1.
///
/// The algorithm is O(n^2), if n is the side length of the matrix:
///
/// for each row in rows of matrix
/// for each column in columns of matrix
/// for each matrix[row][column] cell if equal to 1
/// add a 1 to adjacent (valid) cell in result matrix
///
/// - Parameter matrix: input square matrix with values of 0 or 1 only
/// - Returns: a matrix of equal size to input matrix with computed adjacency values
/// - Throws: throws a ComputationError is the matrix is of invalid size or has invalid values (something other than 1 or 0)
func compute(matrix:[[Int]]) throws -> [[Int]] {
// The number of rows in matrix, which should equal the number of columns, ie side length, or n
let side = matrix.count
// The resulting matrix to return
var result:[[Int]] = []
// Initialize the result matrix
for _ in 0..<side {
result.append(Array<Int>(repeating: 0, count: side))
}
// A convenience constant to refer to the last element in a row or column
let last = side-1
// Iterate over rows in matrix
for row in 0..<side {
if matrix[row].count < side {
throw ComputationError(message:"Invalid number of columns (\(matrix[row].count)), should match number of rows (\(side))")
}
// Iterate over columns in matrix
for column in 0..<side {
// Consider this cell if it is 1, otherwise skip
// If it is 1, then add a 1 to all valid adjacent cells
// in result matrix.
if matrix[row][column] == 1 {
if 0 < row {
if 0 < column {
result[row-1][column-1] += 1
}
result[row-1][column] += 1
if column < last {
result[row-1][column+1] += 1
}
}
if 0 < column {
result[row][column-1] += 1
}
if column < last {
result[row][column+1] += 1
}
if row < last {
if 0 < column {
result[row+1][column-1] += 1
}
result[row+1][column] += 1
if column < last {
result[row+1][column+1] += 1
}
}
}
else if matrix[row][column] == 0 {
// ok
}
// If value is neither 0 or 1 throw an error
else {
throw ComputationError(message:"Invalid value (\(matrix[row][column])) encountered at row \(row) and column \(column)")
}
}
}
return result
}
/// Print *matrix* to console by iterating over each row in the
/// matrix and each column in the row, regardless of the count
/// of columns in the row. The output is row-wise and a single
/// space delimits columns, with no leading or trailing whitespace.
///
/// - Parameter matrix: an array of array of Int
func print(matrix:[[Int]]) {
let rows = matrix.count
for row in 0..<rows {
let columns = matrix[row].count
var line = ""
for column in 0..<columns {
if 0 < column {
line += " "
}
line += "\(matrix[row][column])"
}
print(line)
}
}
do {
print(matrix:try compute(matrix: [[0,1,0], [0,0,0], [1,0,0]]))
}
catch let error {
print(error)
}
do {
print()
print(matrix:try compute(matrix: [[0,1,0,0], [0,0,0,1], [1,0,0,1],[0,1,0,1]]))
}
catch let error {
print(error)
}
// test for exception
do {
print()
print(matrix:try compute(matrix: [[0,1], [0,0,0], [1,0,0]]))
}
catch let error {
print(error)
}
// test for exception
do {
print()
print(matrix:try compute(matrix: [[0,1,0], [3,0,0], [1,0,0]]))
}
catch let error {
print(error)
}

They were likely looking for something in a functional programming style and you probably needed to demonstrate that your solution is O(n^2).
for example:
// let sampleGrid = [[0,1,0], [0,0,0], [1,0,0]]
let sampleGrid = [[0,1,0,0], [0,0,0,1], [1,0,0,1],[0,1,0,1]]
func printMatrix(_ m:[[Any]])
{
print( m.map{ $0.map{"\($0)"}.joined(separator:" ")}.joined(separator:"\n") )
}
let emptyLine = [Array(repeating:0, count:sampleGrid.first!.count)]
let left = sampleGrid.map{ [0] + $0.dropLast() } // O(n)
let right = sampleGrid.map{ $0.dropFirst() + [0] } // O(n)
let up = emptyLine + sampleGrid.dropLast() // O(n)
let down = sampleGrid.dropFirst() + emptyLine // O(n)
let leftRight = zip(left,right).map{zip($0,$1).map{$0+$1}} // O(n^2)
let upDown = zip(up,down).map{zip($0,$1).map{$0+$1}} // O(n^2)
let cornersUp = emptyLine + leftRight.dropLast() // O(n)
let cornersDown = leftRight.dropFirst() + emptyLine // O(n)
let sides = zip(leftRight,upDown).map{zip($0,$1).map{$0+$1}} // O(n^2)
let corners = zip(cornersUp,cornersDown).map{zip($0,$1).map{$0+$1}} // O(n^2)
// 6 x O(n) + 5 x O(n^2) ==> O(n^2)
let neighbourCounts = zip(sides,corners).map{zip($0,$1).map{$0+$1}} // O(n^2)
print("SampleGrid:")
printMatrix(sampleGrid)
print("\nNeighbour counts:")
printMatrix(neighbourCounts)
...
SampleGrid:
0 1 0 0
0 0 0 1
1 0 0 1
0 1 0 1
Neighbour counts:
1 0 2 1
2 2 3 1
1 2 4 2
2 1 3 1

Number of Cycles from list of values, which are mix of positives and negatives in Spark and Scala

Have an RDD with List of values, which are mix of positives and negatives.
Need to compute number of cycles from this data.
For example,
val range = List(sampleRange(2020,2030,2040,2050,-1000,-1010,-1020,Starting point,-1030,2040,-1020,2050,2040,2020,end point,-1060,-1030,-1010)
the interval between each value in above list is 1 second. ie., 2020 and 2030 are recorded in 1 second interval and so on.
how many times it turns from negative to positive and stays positive for >= 2 seconds.
If >= 2 seconds it is a cycle.
Number of cycles: Logic
Example 1: List(1,2,3,4,5,6,-15,-66)
No. of cycles is 1.
Reason: As we move from 1st element of list to 6th element, we had 5 intervals which means 5 seconds. So one cycle.
As we move to 6th element of list, it is a negative value. So we start counting from 6th element and move to 7th element. The negative values are only 2 and interval is only 1. So not counted as cycle.
Example 2:
List(11,22,33,-25,-36,-43,20,25,28)
No. of cycles is 3.
Reason: As we move from 1st element of list to 3rd element, we had 2 intervals which means 2 seconds. So one cycle As we move to 4th element of list, it is a negative value. So we start counting from 4th element and move to 5th, 6th element. we had 2 intervals which means 2 seconds. So one cycle As we move to 7th element of list, it is a positive value. So we start counting from 7th element and move to 8th, 9th element. we had 2 intervals which means 2 seconds. So one cycle.
range is a RDD in the use case. It looks like
scala> range
range: Seq[com.Range] = List(XtreamRange(858,890,899,920,StartEngage,-758,-790,-890,-720,920,940,950))

You can encode this "how many times it turns from negative to positive and stays positive for >= 2 seconds. If >= 2 seconds it is a cycle." pretty much directly into a pattern match with a guard. The expression if(h < 0 && ht > 0 && hht > 0) checks for a cycle and adds one to the result then continues with the rest of the list.
def countCycles(xs: List[Int]): Int = xs match {
case Nil => 0
case h::ht::hht::t if(h < 0 && ht > 0 && hht > 0) => 1 + countCycles(t)
case h::t => countCycles(t)
}
scala> countCycles(range)
res7: Int = 1

A one liner
range.sliding(3).count{case f::s::t::Nil => f < 0 && s > 0 && t > 0}
This generates all sub-sequences of length 3 and counts how many are -ve, +ve, +ve
Generalising cycle length
def countCycles(n:Int, xs:List[Int]) = xs.sliding(n+1)
.count(ys => ys.head < 0 && ys.tail.forall(_ > 0))

The below code would help you resolve you query.
object CycleCheck {
def main(args: Array[String]) {
var data3 = List(1, 4, 82, -2, -12, "startingpoint", -9, 32, 76,45, -98, 76, "Endpoint", -24)
var data2 = data3.map(x => getInteger(x)).filter(_ != "unknown").map(_.toString.toInt)
println(data2)
var nCycle = findNCycle(data2)
println(nCycle)
}
def getInteger(obj: Any) = obj match {
case n: Int => obj
case _ => "unknown"
}
def findNCycle(obj: List[Int]) : Int = {
var cycleCount =0
var sign = ""
var signCheck="+"
var size = obj.size - 1
var numberOfCycles=0
var i=0
for( x <- obj){
if (x < 0){
sign="-"
}
else if (x > 0){
sign="+"
}
if(signCheck.equals(sign))
cycleCount=cycleCount+1
if(!signCheck.equals(sign) && cycleCount>1){
cycleCount = 1
numberOfCycles=numberOfCycles+1
}
if(size==i && cycleCount>1)
numberOfCycles= numberOfCycles+1
if(cycleCount==1)
signCheck = sign;
i=i+1
}
return numberOfCycles
}
}

speed up prime number generating

I have written a program that generates prime numbers . It works well but I want to speed it up as it takes quite a while for generating the all the prime numbers till 10000
var list = [2,3]
var limitation = 10000
var flag = true
var tmp = 0
for (var count = 4 ; count <= limitation ; count += 1 ){
while(flag && tmp <= list.count - 1){
if (count % list[tmp] == 0){
flag = false
}else if ( count % list[tmp] != 0 && tmp != list.count - 1 ){
tmp += 1
}else if ( count % list[tmp] != 0 && tmp == list.count - 1 ){
list.append(count)
}
}
flag = true
tmp = 0
}
print(list)

Two simple improvements that will make it fast up through 100,000 and maybe 1,000,000.
All primes except 2 are odd
Start the loop at 5 and increment by 2 each time. This isn't going to speed it up a lot because you are finding the counter example on the first try, but it's still a very typical improvement.
Only search through the square root of the value you are testing
The square root is the point at which a you half the factor space, i.e. any factor less than the square root is paired with a factor above the square root, so you only have to check above or below it. There are far fewer numbers below the square root, so you should check the only the values less than or equal to the square root.
Take 10,000 for example. The square root is 100. For this you only have to look at values less than the square root, which in terms of primes is roughly 25 values instead of over 1000 checks for all primes less than 10,000.
Doing it even faster
Try another method altogether, like a sieve. These methods are much faster but have a higher memory overhead.

In addition to what Nick already explained, you can also easily take advantage of the following property: all primes greater than 3 are congruent to 1 or -1 mod 6.
Because you've already included 2 and 3 in your initial list, you can therefore start with count = 6, test count - 1 and count + 1 and increment by 6 each time.
Below is my first attempt ever at Swift, so pardon the syntax which is probably far from optimal.
var list = [2,3]
var limitation = 10000
var flag = true
var tmp = 0
var max = 0
for(var count = 6 ; count <= limitation ; count += 6) {
for(var d = -1; d <= 1; d += 2) {
max = Int(floor(sqrt(Double(count + d))))
for(flag = true, tmp = 0; flag && list[tmp] <= max; tmp++) {
if((count + d) % list[tmp] == 0) {
flag = false
}
}
if(flag) {
list.append(count + d)
}
}
}
print(list)
I've tested the above code on iswift.org/playground with limitation = 10,000, 100,000 and 1,000,000.

Scala for loop value example [duplicate]

This question already has answers here:
Get list of elements that are divisible by 3 or 5 from 1 - 1000
(6 answers)
Closed 7 years ago.
How to do it this problem in Scala? Do it in For-loop.
sum of all the multiples of 3 and 5 below 1000;
Example: 1*3+2*5+3*3+4*5+5*3+6*5 ... so on 999*3+1000*5 = How much?

I don't think that 1000*5 is a multiple of 5 below 1000. 1000*5 is 5000 which is not below 1000.
It seems like what you want is:
(1 to 1000).filter(x => x % 3 = 0 || x % 5 == 0).sum
Which doesn't use a "for-loop". A lot of people would cringe at such a term, scala doesn't really have for-loops. if MUST use the for construct, perhaps you would write
(for (x <- 1 to 1000 if x % 3 == 0 || x % 5 == 0) yield x).sum
which is exactly the same thing as above.
you could also (though I would not recommend it) use mutation:
var s = 0
for { x <- 1 to 1000 } { if(x % 3 == 0 || x % 5 == 0) s += x }
s
which could also be
var s = 0
for { x <- 1 to 1000 if (x % 3 == 0 || x % 5 == 0) } { s += x }
s

If you want to use the principles of functional programming you would do it recursive - better you can use tail recursion (sorry that the example is not that good but it's pretty late).
def calc(factorB:Int):Int = {
if(factorB+1 >= 1000)
3*factorB+5*(factorB+1)
else
3*factorB+5*(factorB+1)+calc(factorB+2)
}
In a for-loop you can do it like
var result = 0
for(i <- 1 to 1000){
result += i*(i%2==0?5:3)
}
After the for-loop result yields the calculated value. The downside is that you're using a var instead of val. Iam not sure if the statement i%2==0?5:3 is valid in scala but I don't see any reasons why it shouldn't.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark iterate over dataframe rows, cells - scala

Related

Prime numbers print from range 2...100

What's wrong with this solution for neighbor cell counting of a matrix? (Swift)

Number of Cycles from list of values, which are mix of positives and negatives in Spark and Scala

speed up prime number generating

Scala for loop value example [duplicate]

Categories

Resources