Pyspark dataframe operator "IS NOT IN" - pyspark

I would like to rewrite this from R to Pyspark, any nice looking suggestions?
array <- c(1,2,3)
dataset <- filter(!(column %in% array))

In pyspark you can do it like this:
array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)
Or using the binary NOT operator:
dataframe.filter(~dataframe.column.isin(array))

Take the operator ~ which means contrary :
df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))

df_result = df[df.column_name.isin([1, 2, 3]) == False]

slightly different syntax and a "date" data set:
toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)

* is not needed. So:
list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))

You can use the .subtract() buddy.
Example:
df1 = df.select(col(1),col(2),col(3))
df2 = df.subtract(df1)
This way, df2 will be defined as everything that is df that is not df1.

You can also loop the array and filter:
array = [1, 2, 3]
for i in array:
df = df.filter(df["column"] != i)

Related

Scala - how to copy from list_a to list_b in a loop using counter

I am getting list of strings using that:
val v = session(InfraExecConstants_v2.getSuAndSave).as[Vector[String]].toList
lets say that my list will look like that: list_a("1" , "2" , "3")
I would like to copy this list to a new list in a loop using a counter , like this:
list_a("1" , "2" , "3")
counter = 10
the code will use "list_a" and a "counter" to add the values in "list_b"
now list_b will look like that :
list_b("1" , "2" , "3", "1" , "2" , "3", "1" , "2" , "3", "1")
thanks
In Scala when working with a List you probably do not want to use loops and counters. For one thing you want to avoid too low-level and imperative code in general, and List is just not well suited for it either.
For your specific use case I would create a lazy Iterator that infinitely repeats your list and then take the first 10 elements.
val listA = List(1, 2, 3)
val listB = Iterator.continually(listA).flatten.take(10).toList
// List(1, 2, 3, 1, 2, 3, 1, 2, 3, 1)
You still have to check that listA is not empty though, or you will loop forever or possibly run out of memory.
Succinct but not terribly memory-efficient.
val list_b = List.fill(10)(list_a).flatten.take(10)
It works even if list_a is empty.
I think the simplest solution based on Iterator is:
val listA: List[Int] = List(1, 2, 3)
val counter: Int = 10
val listB =
Iterator
.tabulate(counter)(n => listA(n % listA.size)) //access source list
.toList
// List(1, 2, 3, 1, 2, 3, 1, 2, 3, 1)
Note: I noticed how popular the answers based on continually are but in my opinion given your question tabulate will better resemble a direct circular loop.
There are quite a few answers, but I decided to share my solution anyway. Its advantage is that unnecessary elements are not created.
val list = List(1, 2, 3)
val counter = 10
val size = list.size
val howManyTimes = counter / size
val diff = counter % size
var listB = List[Int]()
if (diff > 0) {
listB = List.fill(howManyTimes + 1)(list).flatten.dropRight(size - diff)
} else {
listB = List.fill(howManyTimes)(list).flatten
}

Checking multiple array elements and return true if all matched

I wanted to find if all emement of below array matched to each other:
val a = Array(1,1,1)
val b = Array(1,0,1)
val c = Array(0,1,1)
here output should be
Array(0,0,1)
as all the value of a(2),b(2) and c(2) is 1 however for all cases it's 0. Is there any functional way of solving this in Scala?
If the arrays are all the same size, then one approach is to transpose the arrays, then map-and-reduce the result with Java's bitwise AND operator &:
val a = Array(1, 1, 1)
val b = Array(1, 0, 1)
val c = Array(0, 1, 1)
val result = Array(a, b, c).transpose.map(_.reduce(_ & _)) // Array(0, 0, 1)

Dropping multiple columns from Spark dataframe by Iterating through the columns from a Scala List of Column names

I have a dataframe which has columns around 400, I want to drop 100 columns as per my requirement.
So i have created a Scala List of 100 column names.
And then i want to iterate through a for loop to actually drop the column in each for loop iteration.
Below is the code.
final val dropList: List[String] = List("Col1","Col2",...."Col100”)
def drpColsfunc(inputDF: DataFrame): DataFrame = {
for (i <- 0 to dropList.length - 1) {
val returnDF = inputDF.drop(dropList(i))
}
return returnDF
}
val test_df = drpColsfunc(input_dataframe)
test_df.show(5)
If you just want to do nothing more complex than dropping several named columns, as opposed to selecting them by a particular condition, you can simply do the following:
df.drop("colA", "colB", "colC")
Answer:
val colsToRemove = Seq("colA", "colB", "colC", etc)
val filteredDF = df.select(df.columns .filter(colName => !colsToRemove.contains(colName)) .map(colName => new Column(colName)): _*)
This should work fine :
val dropList : List[String] |
val df : DataFrame |
val test_df = df.drop(dropList : _*)
You can just do,
def dropColumns(inputDF: DataFrame, dropList: List[String]): DataFrame =
dropList.foldLeft(inputDF)((df, col) => df.drop(col))
It will return you the DataFrame without the columns passed in dropList.
As an example (of what's happening behind the scene), let me put it this way.
scala> val list = List(0, 1, 2, 3, 4, 5, 6, 7)
list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7)
scala> val removeThese = List(0, 2, 3)
removeThese: List[Int] = List(0, 2, 3)
scala> removeThese.foldLeft(list)((l, r) => l.filterNot(_ == r))
res2: List[Int] = List(1, 4, 5, 6, 7)
The returned list (in our case, map it to your DataFrame) is the latest filtered. After each fold, the latest is passed to the next function (_, _) => _.
You can use the drop operation to drop multiple columns. If you are having column names in the list that you need to drop than you can pass that using :_* after the column list variable and it would drop all the columns in the list that you pass.
Scala:
val df = Seq(("One","Two","Three"),("One","Two","Three"),("One","Two","Three")).toDF("Name","Name1","Name2")
val columnstoDrop = List("Name","Name1")
val df1 = df.drop(columnstoDrop:_*)
Python:
In python you can use the * operator to do the same stuff.
data = [("One", "Two","Three"), ("One", "Two","Three"), ("One", "Two","Three")]
columns = ["Name","Name1","Name2"]
df = spark.sparkContext.parallelize(data).toDF(columns)
columnstoDrop = ["Name","Name1"]
df1 = df.drop(*columnstoDrop)
Now in df1 you would get the dataframe with only one column i.e Name2.

How to zip after distnct in pySpark

The following program fails in the zip step.
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()
print x.zip(y).collect()
The error that is produced depends on whether multiple partitions have been specified or not.
I understand that
the two RDDs [must] have the same number of partitions and the same number of elements in each partition.
What is the best way to work around this restriction?
I have been performing the operation with the following code, but I am hoping to find something more efficient.
def safe_zip(left, right):
ix_left = left.zipWithIndex().map(lambda row: (row[1], row[0]))
ix_right = right.zipWithIndex().map(lambda row: (row[1], row[0]))
return ix_left.join(ix_right).sortByKey().values()
I think this would be accomplished by using cartesian() on your RDD
import pyspark
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
x.distinct().cartesian(y.distinct()).collect()

Filter matching and non-matching elements into different halves of a tuple

Is there a simple and efficient way to perform the following in Scala?
val elements = List(1, 2, 3, 4, 5, 6)
val (odd, even) = elements.filter(_ % 2 == 0)
I am aware of groupBy, but I would like something that works with a constant number of groups that can be extracted into separate values.
List.partition does what you want:
val (even, odd) = elements.partition(_ % 2 == 0)
Note that it works only with two final groups.