Using map in spark to make dictionary format - pyspark

I executed the following code:
temp = rdd.map( lambda p: ( p[0], (p[1],p[2],p[3],p[4],p[5]) ) ).groupByKey().mapValues(list).collect()
print(temp)
and I could get data:
[ ("A", [("a", 1, 2, 3, 4), ("b", 2, 3, 4, 5), ("c", 4, 5, 6, 7)]) ]
I'm trying to make a dictionary with second list argument.
For example I want to reconstruct temp like this format:
("A", {"a": [1, 2, 3, 4], "b":[2, 3, 4, 5], "c":[4, 5, 6, 7]})
Is there any clear way to do this?

If I understood you correctly you need something like this:
spark = SparkSession.builder.getOrCreate()
data = [
["A", "a", 1, 2, 5, 6],
["A", "b", 3, 4, 6, 9],
["A", "c", 7, 5, 6, 0],
]
rdd = spark.sparkContext.parallelize(data)
temp = (
rdd.map(lambda x: (x[0], ({x[1]: [x[2], x[3], x[4], x[5]]})))
.groupByKey()
.mapValues(list)
.mapValues(lambda x: {k: v for y in x for k, v in y.items()})
)
print(temp.collect())
# [('A', {'a': [1, 2, 5, 6], 'b': [3, 4, 6, 9], 'c': [7, 5, 6, 0]})]

This is easily doable with a custom Python function once you obtain the temp object. You just need to use tuple, list and dict manipulation.
def my_format(l):
# get tuple inside list
tup = l[0]
# create dictionary with key equal to first value of each sub-tuple
dct = {}
for e in tup[1]:
dct2 = {e[0]: list(e[1:])}
dct.update(dct2)
# combine first element of list with dictionary
return (tup[0], dct)
my_format(temp)
# ('A', {'a': [1, 2, 3, 4], 'b': [2, 3, 4, 5], 'c': [4, 5, 6, 7]})

Related

Find the greatest increase value in a Map

I am at the beginning of my Scala journey. I am trying to find and compare the highest increased value of a given dataset - type Map(String, List[Int]). The program should calculate the increase(or decrease) between the 7th last value of the List ant the last value of each row and then print the highest increase row of the entire Map. For example, given the following dataset:
DATASET
SK1, 9, 7, 2, 0, 7, 3, 7, 9, 1, 2, 8, 1, 9, 6, 5, 3, 2, 2, 7, 2, 8, 5, 4, 5, 1, 6, 5, 2, 4, 1
SK2, 0, 7, 6, 3, 3, 3, 1, 6, 9, 2, 9, 7, 8, 7, 3, 6, 3, 5, 5, 2, 9, 7, 3, 4, 6, 3, 4, 3, 4, 1
SK3, 8, 7, 1, 8, 0, 5, 8, 3, 5, 9, 7, 5, 4, 7, 9, 8, 1, 4, 6, 5, 6, 6, 3, 6, 8, 8, 7, 4, 0, 7
The program should calculate the increase of each row:
SK1 = 1 "last value" - 5 "7th last value" = - 4
SK2 = 1 "last value" - 4 "7th last value" = - 3
SK3 = 7 "last value" - 6 "7th last value" = 1
The program should then print SK3 - 0 because is the highest increase.
The program can calculate the the increase of each row but it currently needs an SK input with the following two methods:
def rise(stock: String): (Int) = {
mapdata.get(stock).map(findLast(_)).getOrElse(0) -
(mapdata.get(stock).map(_.takeRight(7).head.toInt).getOrElse(0))
}
def stockRise(stock: String): (String, Int) = {
(stock, rise(stock))
}
The two methods are then called within the program menu using:
def handleFive(): Boolean = {
menuShowSingleDataStock(stockRise)
true
}
//Pull two rows from the dataset
def menuShowDoubleDataStock(resultCalculator: (String, String) => (String, Int)) = {
print("Please insert the Stock > ")
val stockName1 = readLine
print("Please insert the Stock > ")
val stockName2 = readLine
val result = resultCalculator(stockName1, stockName2)
println(s"${result._1}: ${result._2}")
}
I have tried to call the following method that calculates the rises of every row using the following method but it doesn't seem to be working:
def menuShowStocks(f: () => Map[String, List[Int]]) = {
val highestIncrese = 0
f() foreach { case (x, y) => println(s"$x: $y") }
}
A common approach is:
first map each row, calculate the score
use an aggregation function to select the desired row
Here we go:
scala> val dataSet = Map(
| "SK1" -> List(9, 7, 2, 0, 7, 3, 7, 9, 1, 2, 8, 1, 9, 6, 5, 3, 2, 2, 7, 2, 8, 5, 4, 5, 1, 6, 5, 2, 4, 1),
| "SK2" -> List(0, 7, 6, 3, 3, 3, 1, 6, 9, 2, 9, 7, 8, 7, 3, 6, 3, 5, 5, 2, 9, 7, 3, 4, 6, 3, 4, 3, 4, 1),
| "SK3" -> List(8, 7, 1, 8, 0, 5, 8, 3, 5, 9, 7, 5, 4, 7, 9, 8, 1, 4, 6, 5, 6, 6, 3, 6, 8, 8, 7, 4, 0, 7)
| )
val dataSet: Map[String, List[Int]] = Map(SK1 -> List(9, 7, 2, 0, 7, 3, 7, 9, 1, 2, 8, 1, 9, 6, 5, 3, 2, 2, 7, 2, 8, 5, 4, 5, 1, 6, 5, 2, 4, 1), SK2 -> List(0, 7, 6, 3, 3, 3, 1, 6, 9, 2, 9, 7, 8, 7, 3, 6, 3, 5, 5, 2, 9, 7, 3, 4, 6, 3, 4, 3, 4, 1), SK3 -> List(8, 7, 1, 8, 0, 5, 8, 3, 5, 9, 7, 5, 4, 7, 9, 8, 1, 4, 6, 5, 6, 6, 3, 6, 8, 8, 7, 4, 0, 7))
scala> val highestIncrease = dataSet
| .toSeq
| .map { case (name, ints) =>
| name -> (ints.last - ints(ints.length - 7))
| }
| .maxBy(_._2)
val highestIncrease: (String, Int) = (SK3,1)
Some notes:
The map is converted to a Seq first with toSeq. Mapping over Map's is entirely possible but a bit more complicated. Better leave this for a later learning moment. This produces a Seq[(String, List[Int])].
Using map we iterate over the tuples in the Seq. This uses pattern matching to extract the variables name and ints.
The score is calculated. Also, we use the -> operator to construct a new tuple of 2 items so we hang on to the name of the row.
Method maxBy accepts a function to get a value. The expression _._2, equivalent to x => x._2 is a function that gives the second value in each tuple.
The following could print the name of what we found:
println(s"The highest increase is in dataset ${highestIncrease._1} and is ${highestIncrease._2}.")

I am trying to remove duplicate element from list using below code but getting below output. why i am Not getting expected output?

my_list = [1, 2, 3, 3, 4, 4, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6]
for item in my_list:
if my_list.count(item) > 1:
my_list.pop(item)
print(my_list) # => actual result - [1, 2, 3, 4, 5, 6, 6, 6, 6]

Use partition in Swift to move some predefined elements to the end of the Array

I would like to use the Array.partition(by:) to move some predefined elements from an array to the the end of it.
Example:
var my_array = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
let elementsToMove = [1, 3, 4, 5, 8]
// desired result: [0, 2, 6, 7, 9, ...remaining items in any order...]
Is there an elegant way to do that? Observe that elementsToMove does not follow a pattern.
partition(by:) does not preserve the order of the elements:
var my_array = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
let elementsToMove = [1, 3, 4, 5, 8]
_ = my_array.partition(by: { elementsToMove.contains($0) } )
print(my_array) // [0, 9, 2, 7, 6, 5, 4, 3, 8, 1]
A simple solution would be to filter-out and append the elements from
the second array:
let my_array = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
let elementsToMove = [1, 3, 4, 5, 8]
let newArray = my_array.filter({ !elementsToMove.contains($0) }) + elementsToMove
print(newArray) // [0, 2, 6, 7, 9, 1, 3, 4, 5, 8]
For larger arrays it can be advantageous to create a set of the
to-be-moved elements first:
let my_array = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
let elementsToMove = [1, 3, 4, 5, 8]
let setToMove = Set(elementsToMove)
let newArray = my_array.filter({ !setToMove.contains($0) }) + elementsToMove
print(newArray) // [0, 2, 6, 7, 9, 1, 3, 4, 5, 8]
If you have unique object in your my_array then you can try something like this.
var my_array = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
var tempArray = my_array //Preserve original array
let elementsToMove = [1, 3, 4, 5, 8]
let p = tempArray.partition(by: elementsToMove.contains)
//Now sort first part of tempArray on basis of your my_array to get order you want
let newArray = tempArray[0..<p].sorted(by: { my_array.index(of: $0)! < my_array.index(of: $1)! }) + tempArray[p...]
print(newArray)
Output
[0, 2, 6, 7, 9, 5, 4, 3, 8, 1]

Spark - Why does ArrayBuffer seem to get elements that haven't been traversed yet

Why does the ArrayBuffer in the MapPartition seem to have elements that it has not traversed yet?
For instance, the way I look at this code, the first item should have 1 element, second 2, third 3 and so on. How could it be possible that the first ArrayBuffer output has 9 items. That would seem to imply that there were 9 iterations prior to the first output, but the yields count makes it clear that this was the first iteration.
val a = ArrayBuffer[Int]()
for(i <- 1 to 9) a += i
for(i <- 1 to 9) a += 9-i
val rdd1 = sc.parallelize(a.toArray())
def timePivotWithLoss(iter: Iterator[Int]) : Iterator[Row] = {
val currentArray = ArrayBuffer[Int]()
var loss = 0
var yields = 0
for (item <- iter) yield {
currentArray += item
//var left : Int = -1
yields += 1
Row(yields, item.toString(), currentArray)
}
}
rdd1.mapPartitions(it => timePivotWithLoss(it)).collect()
Output -
[1,1,ArrayBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9)]
[2,2,ArrayBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9)]
[3,3,ArrayBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9)]
[4,4,ArrayBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9)]
[5,5,ArrayBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9)]
[6,6,ArrayBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9)]
[7,7,ArrayBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9)]
[8,8,ArrayBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9)]
[9,9,ArrayBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9)]
[1,8,ArrayBuffer(8, 7, 6, 5, 4, 3, 2, 1, 0)]
[2,7,ArrayBuffer(8, 7, 6, 5, 4, 3, 2, 1, 0)]
[3,6,ArrayBuffer(8, 7, 6, 5, 4, 3, 2, 1, 0)]
[4,5,ArrayBuffer(8, 7, 6, 5, 4, 3, 2, 1, 0)]
[5,4,ArrayBuffer(8, 7, 6, 5, 4, 3, 2, 1, 0)]
[6,3,ArrayBuffer(8, 7, 6, 5, 4, 3, 2, 1, 0)]
[7,2,ArrayBuffer(8, 7, 6, 5, 4, 3, 2, 1, 0)]
[8,1,ArrayBuffer(8, 7, 6, 5, 4, 3, 2, 1, 0)]
[9,0,ArrayBuffer(8, 7, 6, 5, 4, 3, 2, 1, 0)]
This happens because all rows in the partition use reference to the same mutable object. Spilling to disc could further make it non-deterministic with some objects being serialized and not reflecting the changes.
You can use mutable reference and immutable object:
def timePivotWithLoss(iter: Iterator[Int]) : Iterator[Row] = {
var currentArray = Vector[Int]()
var loss = 0
var yields = 0
for (item <- iter) yield {
currentArray = currentArray :+ item
yields += 1
Row(yields, item.toString(), currentArray)
}
}
but in general mutable state and Spark are not good match.

How to split `NSmutableArray` array array to chunks swift 3?

NSMutableArray *sample;
I have an NSmutableArray, and I want to split it into chunks. I have tried checking the internet didn't find the solution for it. I got the link to split integer array.
How about this which is more Swifty?
let integerArray = [1,2,3,4,5,6,7,8,9,10]
let stringArray = ["a", "b", "c", "d", "e", "f"]
let anyObjectArray: [Any] = ["a", 1, "b", 2, "c", 3]
extension Array {
func chunks(_ chunkSize: Int) -> [[Element]] {
return stride(from: 0, to: self.count, by: chunkSize).map {
Array(self[$0..<Swift.min($0 + chunkSize, self.count)])
}
}
}
integerArray.chunks(2) //[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
stringArray.chunks(3) //[["a", "b", "c"], ["d", "e", "f"]]
anyObjectArray.chunks(2) //[["a", 1], ["b", 2], ["c", 3]]
To Convert NSMutableArray to Swift Array:
let nsarray = NSMutableArray(array: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
if let swiftArray = nsarray as NSArray as? [Int] {
swiftArray.chunks(2) //[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
}
If you wanna insist to use NSArray, then:
let nsarray = NSMutableArray(array: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
extension NSArray {
func chunks(_ chunkSize: Int) -> [[Element]] {
return stride(from: 0, to: self.count, by: chunkSize).map {
self.subarray(with: NSRange(location: $0, length: Swift.min(chunkSize, self.count - $0)))
}
}
}
nsarray.chunks(3) //[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
You can use the subarray method.
let array = NSArray(array: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
let left = array.subarray(with: NSMakeRange(0, 5))
let right = array.subarray(with: NSMakeRange(5, 5))