PySpark join Key from first RDD to Values from second RDD - pyspark

I got one RDD userPreferences, which looks like such:
userPreferences = [(userID = 123, userPreferences = [1, 5]),
(userID = 213, userPreferences = [2 , 3])
and the second RDD words
words = [(bookID = 1, words = ["hi", "no", "yes"]),
(bookID = 5, words = ["no", "yes"]),
(bookID = 3, words = ["absolutely"])]
I just included userId, userPreferences, bookID and words for better explanation. My goal is to find for each userID, all words from his userPreferences. I.e. I want to join the Key from the words on the values from the userPreferences RDD. The result could look like this:
thirdRDD = [(userID = 123, words = ["hi", "no", "yes"]),
(userID = 213, words = ["absolutely"])

Related

remove the duplicates in-place such that each unique element appears only once

// Given an integer array nums sorted in non-decreasing order, remove the duplicates in-place such that each unique element appears only once. The relative order of the elements should be kept the same.
// Input: nums = [1,1,2]
// Output: 2, nums = [1,2,_]
// Explanation: Your function should return k = 2, with the first two elements of nums being 1 and 2 respectively.
// It does not matter what you leave beyond the returned k (hence they are underscores).
// Input: nums = [0,0,1,1,1,2,2,3,3,4]
// Output: 5, nums = [0,1,2,3,4,,,,,_]
// Explanation: Your function should return k = 5, with the first five elements of nums being 0, 1, 2, 3, and 4 respectively.
// It does not matter what you leave beyond the returned k (hence they are underscores).
int removeDuplicates(List<int> nums) {
if(nums.length < 2) {
return nums.length;
}
int p1 = 0, p2 = 1;
while(p2 < nums.length) {
if(nums[p1] < nums[p2]) {
nums[++p1] = nums[p2];
}
p2++;
}
return p1+1;
}
This is LC26. The solution is based on 2 pointer method where fist pointer points to latest unique number and second pointer is increased and checked if the new index value satisfied the condition if it's greater than first pointer index value.
void main() {
// Given an integer array nums sorted in non-decreasing order, remove the duplicates in-place such that each unique element appears only once. The relative order of the elements should be kept the same.
// Input: nums = [1,1,2]
// Output: 2, nums = [1,2,_]
// Explanation: Your function should return k = 2, with the first two elements of nums being 1 and 2 respectively.
// It does not matter what you leave beyond the returned k (hence they are underscores).
// Input: nums = [0,0,1,1,1,2,2,3,3,4]
// Output: 5, nums = [0,1,2,3,4,_,_,_,_,_]
// Explanation: Your function should return k = 5, with the first five elements of nums being 0, 1, 2, 3, and 4 respectively.
// It does not matter what you leave beyond the returned k (hence they are underscores).
var array = [1, 1, 2];
var array2 = [0, 0, 1, 1, 1, 2, 2, 3, 3, 4];
findExpectedArray(array2);
}
findExpectedArray(array) {
var expectedArray = [];
Set<int> setOfAray = array.toSet();
for (var i = 0; i < array.length; i++) {
if (i < setOfAray.length) {
expectedArray.add('${setOfAray.elementAt(i)}');
} else {
expectedArray.add('_');
}
}
print(expectedArray);
}

How to create an multidimensional array with two types

I have to create an array with two values.
This array has to be two values ( String and Integer)
How did I have to create the empty array and append new values?
Thx all for you help.
This is what I need.
var myData = [ ["1": 15], ["2" : 30], ["3": 15], ["4" : 30] ]
Why not array of tuples?
var data : [(String, Int)] = [("1", 15),
("2", 30),
("3", 15),
("4", 30)]
data.append(("5", 50))
let value = data[0]
let yourString = value.0 // "1"
let yourInteger = value.1 // 15
This keyword of this topic array Dictionary.
You should create array dictionary follow this syntax:
var testData: [[String: Int]] = []
let data_1 = ["1" : 15]
let data_2 = ["2" : 30]
let data_3 = ["3" : 15]
let data_4 = ["4" : 30]
You can add array into array:
testData.append(contentsOf: [data_1, data_2, data_3, data_4])
Or You can add every element into array:
testData.append(data_1)
testData.append(data_2)
testData.append(data_3)
testData.append(data_4)
You can read more about Collection Types in swift:
https://docs.swift.org/swift-book/LanguageGuide/CollectionTypes.html
I think this should work
var myData = [Dictionary<String:Int>]()

Spark Scala Generating Random RDD with (1's and 0's )?

How does one create an RDD filled with values from an array say (0,1) - filling random 1000 values as 1 and remaining 0.
I know I can filter and do this but it won't be random. I want it to be as random as possible
var populationMatrix = new IndexedRowMatrix(RandomRDDs.uniformVectorRDD(sc, populationSize, chromosomeLength)
I was exploring random RDDs in spark but could find something that meets my needs .
Not really sure if this is what you are looking for, but with this code you are able to create an RDD array with random numbers between 0 and 1s:
import scala.util.Random
val arraySize = 15 // Total number of elements that you want
val numberOfOnes = 10 // From that total, how many do you want to be ones
val listOfOnes = List.fill(numberOfOnes)(1) // List of 1s
val listOfZeros = List.fill(arraySize - numberOfOnes)(0) // Rest list of 0s
val listOfOnesAndZeros = listOfOnes ::: listOfZeros // Merge lists
val randomList = Random.shuffle(listOfOnesAndZeros) // Random shuffle
val randomRDD = sc.parallelize(randomList) // RDD creation
randomRDD.collect() // Array[Int] = Array(1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1)
Or, if you want to use only RDDs:
val arraySize = 15
val numberOfOnes = 10
val rddOfOnes = spark.range(numberOfOnes).map(_ => 1).rdd
val rddOfZeros = spark.range(arraySize - numberOfOnes).map(_ => 0).rdd
val rddOfOnesAndZeros = rddOfOnes ++ rddOfZeros
val shuffleResult = rddOfOnesAndZeros.mapPartitions(iter => {
val rng = new scala.util.Random()
iter.map((rng.nextInt, _))
}).partitionBy(new org.apache.spark.HashPartitioner(rddOfOnesAndZeros.partitions.size)).values
shuffleResult.collect() // Array[Int] = Array(0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1)
Let me know if it was what you need it.

Pyspark dataframe operator "IS NOT IN"

I would like to rewrite this from R to Pyspark, any nice looking suggestions?
array <- c(1,2,3)
dataset <- filter(!(column %in% array))
In pyspark you can do it like this:
array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)
Or using the binary NOT operator:
dataframe.filter(~dataframe.column.isin(array))
Take the operator ~ which means contrary :
df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))
df_result = df[df.column_name.isin([1, 2, 3]) == False]
slightly different syntax and a "date" data set:
toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)
* is not needed. So:
list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))
You can use the .subtract() buddy.
Example:
df1 = df.select(col(1),col(2),col(3))
df2 = df.subtract(df1)
This way, df2 will be defined as everything that is df that is not df1.
You can also loop the array and filter:
array = [1, 2, 3]
for i in array:
df = df.filter(df["column"] != i)

Elegant way of turning Dictionary or Generator or Sequence in Swift into an Array?

Is there an elegant way to convert Dictionary (or Sequence or Generator) into an Array. I know I can convert it by looping through the sequence as follows.
var d = ["foo" : 1, "bar" : 2]
var g: DictionaryGenerator<String, Int> = d.generate()
var a = Array<(String, Int)>()
while let item = g.next() {
a += item
}
I am hoping there is something similar to Python's easy conversion:
>>> q = range(10)
>>> i = iter(q)
>>> i
<listiterator object at 0x1082b2090>
>>> z = list(i)
>>> z
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>>
The + operator for an array will accept a sequence, so you can write
var d = ["foo" : 1, "bar" : 2]
var a = [] + d
I don't think anything similar is possible for generators though
Just pass it to the Array’s init:
var dict = ["foo" : 1, "bar" : 2]
var arr = Array(dict)