From many pyspark columns (with certain condition) to one column with all the conditions combined. PYSPARK - pyspark

I have one Python list with some PySpark columns which contains certain condition. I want to have just one column that summarizes all the conditions I have in the list of columns.
I've tried to use the sum() operation to combine all the columns but it didn't work (obviusly). Also, I've been checking the documentation https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
But nothing seemed to work for me.
I'm doing something like this:
my_condition_list = [col(c).isNotNull() for c in some_of_my_sdf_columns]
That returns a list of different Pyspark columns, I want just one with all the conditiones included combined with the | operator so I can use it in a .filter() or .when() clause.
THANK YOU

PySpark wouldn't accept a list as to where/filter condition. It accepts either a string or condition.
The way you tried wouldn't work, you need to tweak certain things to work on it. Below are 2 approaches for this -
data = [(("ID1", 3, None)), (("ID2", 4, 12)), (("ID3", None, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])
df.show()
from pyspark.sql import functions as F
way - 1
#below change df_name if you have any other name
df_name = "df"
my_condition_list = ["%s['%s'].isNotNull()"%(df_name, c) for c in df.columns]
print (my_condition_list[0])
"df['ID'].isNotNull()"
print (" & ".join(my_condition_list))
"df['ID'].isNotNull() & df['colA'].isNotNull() & df['colB'].isNotNull()"
print (eval(" & ".join(my_condition_list)))
Column<b'(((ID IS NOT NULL) AND (colA IS NOT NULL)) AND (colB IS NOT NULL))'>
df.filter(eval(" & ".join(my_condition_list))).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID2| 4| 12|
+---+----+----+
df.filter(eval(" | ".join(my_condition_list))).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID1| 3|null|
|ID2| 4| 12|
|ID3|null| 3|
+---+----+----+
way - 2
my_condition_list = ["%s is not null"%c for c in df.columns]
print (my_condition_list[0])
'ID is not null'
print (" and ".join(my_condition_list))
'ID is not null and colA is not null and colB is not null'
df.filter(" and ".join(my_condition_list)).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID2| 4| 12|
+---+----+----+
df.filter(" or ".join(my_condition_list)).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID1| 3|null|
|ID2| 4| 12|
|ID3|null| 3|
+---+----+----+
Preferred way is way-2

Related

Create summary of Spark Dataframe

I have a Spark Dataframe which I am trying to summarise in order to find overly long columns:
// Set up test data
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df = Seq(
( 1, "a", "bb", "cc", "file1" ),
( 2, "d", "ee", "fff", "file2" ),
( 3, "g", "hhhh", "ii", "file3" )
).
toDF("rowId", "col1", "col2", "col3", "filename")
I can summarise the lengths of the columns and find overly long ones like this:
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df2 = df.columns
.map(c => (c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("columnName", "maxLength")
.filter($"maxLength" > 2)
If I try and add the existing filename column to the map I get an error:
val df2 = df.columns
.map(c => ($"filename", c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("fn", "columnName", "maxLength")
.filter($"maxLength" > 2)
I have tried a few variations of the $"filename" syntax. How can I incorporate the filename column into the summary?
columnName
maxLength
filename
col2
4
file3
col3
3
file2
The real dataframes have 300+ columns and millions of rows so I cannot hard-type column names.
#wBob does the following achieve your goal?
group by file name and get the maximum per column:
val cols = df.columns.dropRight(1) // to remove the filename col
val maxLength = cols.map(c => s"max(length(${c})) as ${c}").mkString(",")
print(maxLength)
df.createOrReplaceTempView("temp")
val df1 = spark
.sql(s"select filename, ${maxLength} from temp group by filename")
df1.show()`
With the output:
+--------+-----+----+----+----+
|filename|rowId|col1|col2|col3|
+--------+-----+----+----+----+
| file1| 1| 1| 2| 2|
| file2| 1| 1| 2| 3|
| file3| 1| 1| 4| 2|
+--------+-----+----+----+----+
Use subqueries to get the maximum per column and concatenate the results using union:
df1.createOrReplaceTempView("temp2")
val res = cols.map(col => {
spark.sql(s"select '${col}' as columnName, $col as maxLength, filename from temp2 " +
s"where $col = (select max(${col}) from temp2)")
}).reduce(_ union _)
res.show()
With the result:
+----------+---------+--------+
|columnName|maxLength|filename|
+----------+---------+--------+
| rowId| 1| file1|
| rowId| 1| file2|
| rowId| 1| file3|
| col1| 1| file1|
| col1| 1| file2|
| col1| 1| file3|
| col2| 4| file3|
| col3| 3| file2|
+----------+---------+--------+
Note that there are multiple entries for rowId and col1 since the maximum is not unique.
There is probably a more elegant way to write it, but I am struggling to find one at the moment.
Pushed a little further for better result.
df.select(
col("*"),
array( // make array of columns name/value/length
(for{ col_name <- df.columns } yield
struct(
length(col(col_name)).as("length"),
lit(col_name).as("col"),
col(col_name).cast("String").as("col_value")
)
).toSeq:_* ).alias("rowInfo")
)
.select(
col("rowId"),
explode( // explode array into rows
expr("filter(rowInfo, x -> x.length >= 3)") //filter the array for the length your interested in
).as("rowInfo")
)
.select(
col("rowId"),
col("rowInfo.*") // turn struct fields into columns
)
.sort("length").show
+-----+------+--------+---------+
|rowId|length| col|col_value|
+-----+------+--------+---------+
| 2| 3| col3| fff|
| 3| 4| col2| hhhh|
| 3| 5|filename| file3|
| 1| 5|filename| file1|
| 2| 5|filename| file2|
+-----+------+--------+---------+
It might be enough to sort your table by total text length. This can be achieved quickly and concisely.
df.select(
col("*"),
length( // take the length
concat( //slap all the columns together
(for( col_name <- df.columns ) yield col(col_name)).toSeq:_*
)
)
.as("length")
)
.sort( //order by total length
col("length").desc
).show()
+-----+----+----+----+--------+------+
|rowId|col1|col2|col3|filename|length|
+-----+----+----+----+--------+------+
| 3| g|hhhh| ii| file3| 13|
| 2| d| ee| fff| file2| 12|
| 1| a| bb| cc| file1| 11|
+-----+----+----+----+--------+------+
Sorting an array[struct] it will sort on the first field first and second field next. This works as we put the size of the sting up front. If you re-order the fields you'll get different results. You can easily accept more than 1 result if you so desired but I think dsicovering a row is challenging is likely enough.
df.select(
col("*"),
reverse( //sort ascending
sort_array( //sort descending
array( // add all columns lengths to an array
(for( col_name <- df.columns ) yield struct(length(col(col_name)),lit(col_name),col(col_name).cast("String")) ).toSeq:_* )
)
)(0) // grab the row max
.alias("rowMax") )
.sort("rowMax").show
+-----+----+----+----+--------+--------------------+
|rowId|col1|col2|col3|filename| rowMax|
+-----+----+----+----+--------+--------------------+
| 1| a| bb| cc| file1|[5, filename, file1]|
| 2| d| ee| fff| file2|[5, filename, file2]|
| 3| g|hhhh| ii| file3|[5, filename, file3]|
+-----+----+----+----+--------+--------------------+

Spark Scala - Bitwise operation in filter

I have a source dataset aggregated with columns col1, and col2. Col2 values are aggregated by bitwise OR operation. I need to apply filter on the Col2 values to select records whose bits are on for 8,4,2
initial source raw data
val SourceRawData = Seq(("Flag1", 2),
("Flag1", 4),("Flag1", 8), ("Flag2", 8), ("Flag2", 16),("Flag2", 32)
,("Flag3", 2),("Flag4", 32),
("Flag5", 2), ("Flag5", 8)).toDF("col1", "col2")
SourceRawData.show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 2|
|Flag1| 4|
|Flag1| 8|
|Flag2| 8|
|Flag2| 16|
|Flag2| 32|
|Flag3| 2|
|Flag4| 32|
|Flag5| 2|
|Flag5| 8|
+-----+----+
Aggregated source data based on 'SourceRawData above' after collapsing Col1 values to single row per Col1 value and this is provided other team and Col2 values are aggregated by Bitwise OR operation. Note I here i am providing the output not the actual aggregation logic
val AggregatedSourceData = Seq(("Flag1", 14L),
("Flag2", 56L),("Flag3", 2L), ("Flag4", 32L), ("Flag5", 10L)).toDF("col1", "col2")
AggregatedSourceData.show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 14|
|Flag2| 56|
|Flag3| 2|
|Flag4| 32|
|Flag5| 10|
+-----+----+
Now I need to apply filter on the aggregated dataset above to get the rows whose col2 values meeting any of the (8,4,2) col2 bits are on as per the original source raw data values
expected output
+-----+----+
|Col1 |Col2|
+-----+----+
|Flag1|14 |
|Flag2|56 |
|Flag3|2 |
|Flag5|10 |
+-----+----+
I tried something like below and seems to be getting hte expected output but unable to understand how its working. Is this the correct approach?? if so ,how its working ( I am not that knowledgeable in bitwise operations so looking for expert help to understand please)
`
``
var myfilter:Long = 2 | 4| 8
AggregatedSourceData.filter($"col2".bitwiseAND(myfilter) =!= 0).show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 14|
|Flag2| 56|
|Flag3| 2|
|Flag5| 10|
+-----+----+
I think you do not need to use bitWiseAnd to filter, instead, just use contains/in “A set of decimal representation of the bit numbers you want” or == to “a decimal representation of a bit number you want”
Also if you try your existing calculations without Scala or spark, you will see where you understood things wrong, eg use here :
https://www.rapidtables.com/calc/math/binary-calculator.html
you will find you defined you filter “wrong”.
18&18 is 18
18|2 is 18
Your dataset the flag column each row will only be one value, so just filter the flag column , whose values are in the set of numbers you want .
$"flag == 18 or
(18,2,20) contains $"flag for example

Spark select column based on row values

I have a all string spark dataframe and I need to return columns in which all rows meet a certain criteria.
scala> val df = spark.read.format("csv").option("delimiter",",").option("header", "true").option("inferSchema", "true").load("file:///home/animals.csv")
df.show()
+--------+---------+--------+
|Column 1| Column 2|Column 3|
+--------+---------+--------+
|(ani)mal| donkey| wolf|
| mammal|(mam)-mal| animal|
| chi-mps| chimps| goat|
+--------+---------+--------+
Over here the criteria is return columns where all row values have length==6, irrespective of special characters. The response should be below dataframe since all rows in column 1 and column 2 have length==6
+--------+---------+
|Column 1| Column 2|
+--------+---------+
|(ani)mal| donkey|
| mammal|(mam)-mal|
| chi-mps| chimps|
+--------+---------+
You can use regexp_replace to delete the special characters if you know what there are and then get the length, filter to field what you want.
val cols = df.columns
val df2 = cols.foldLeft(df) {
(df, c) => df.withColumn(c + "_len", length(regexp_replace(col(c), "[()-]", "")))
}
df2.show()
+--------+---------+-------+-----------+-----------+-----------+
| Column1| Column2|Column3|Column1_len|Column2_len|Column3_len|
+--------+---------+-------+-----------+-----------+-----------+
|(ani)mal| donkey| wolf| 6| 6| 4|
| mammal|(mam)-mal| animal| 6| 6| 6|
| chi-mps| chimps| goat| 6| 6| 4|
+--------+---------+-------+-----------+-----------+-----------+

Remove rows from Spark DataFrame that ONLY satisfy two conditions

I am using Scala and Spark. I want to filter out certain rows from a DataFrame that do NOT satisfy ALL the conditions that I am specifying, while keeping other rows that might only one of the conditions be satisfied.
For example: let's say I have this DataFrame
+-------+----+
|country|date|
+-------+----+
| A| 1|
| A| 2|
| A| 3|
| B| 1|
| B| 2|
| B| 3|
+-------+----+
and I want to filter out country A and dates 1 and 2, so that the expected output should be:
+-------+----+
|country|date|
+-------+----+
| A| 3|
| B| 1|
| B| 2|
| B| 3|
+-------+----+
As you can see, I am still keeping country B with dates 1 and 2.
I tried to use filter in the following way
df.filter("country != 'A' and date not in (1,2)")
But the output filters out all dates 1, and 2, which is not what I want.
Thanks.
Your current condition is
df.filter("country != 'A' and date not in (1,2)")
which can be translated as "accept any country other than A, then accept any date except 1 or 2". Your conditions are applied independently
What you want is:
df.filter("not (country = 'A' and date in (1,2))")
i.e. "Find the rows with country A and date of 1 or 2, and reject them"
or equivalently:
df.filter("country != 'A' or date not in (1,2)")
i.e. "If country isn't A, then accept it regardless of the date. If the country is A, then the date mustn't be 1 or 2"
See De Morgan's laws:
not(A or B) = not A and not B
not (A and B) = not A or not B

How to combine where and groupBy in Spark's DataFrame?

How can I use aggregate functions in a where clause in Apache Spark 1.6?
Consider the following DataFrame
+---+------+
| id|letter|
+---+------+
| 1| a|
| 2| b|
| 3| b|
+---+------+
How can I select all rows where letter occurs more than once, i.e. the expected output would be
+---+------+
| id|letter|
+---+------+
| 2| b|
| 3| b|
+---+------+
This does obviously not work:
df.where(
df.groupBy($"letter").count()>1
)
My example its about count, but I'd like to be able to use other aggregate functions (the results thereof) as well.
EDIT:
Just for counting,I just came up with the following solution:
df.groupBy($"letter").agg(
collect_list($"id").as("ids")
)
.where(size($"ids") > 1)
.withColumn("id", explode($"ids"))
.drop($"ids")
You can use left semi join:
df.join(
broadcast(df.groupBy($"letter").count.where($"count" > 1)),
Seq("letter"),
"leftsemi"
)
or window functions:
import org.apache.spark.sql.expressions.Window
df
.withColumn("count", count($"*").over(Window.partitionBy("letter")))
.where($"count" > 1)
In Spark 2.0 or later you can Bloom filter but it is not available in 1.x