Remove na row index from PySpark DataFrame

Remove na row index from PySpark DataFrame - pyspark

I am not able to remove the first row.

In addition to your solution with where clause, you could use these too. Yours would only filter out vendorID with \r, these will filter out all rows with have all nones in all other columns, irrespective of what you have as VendorID
Filter:
df.filter((' or '.join([''+x+' is not null' for x in df.columns if x !='VendorID'])))
Dropna:
df.dropna(how='all', subset=[x for x in df.columns if x!='VendorID'])

Related

How to remove rows with more then x Null values in pyspark

I am having some trouble removing rows with more then the na_threshold of nulls in my dataframe
na_threshold=2
df3=df3.dropna(thresh=len(df3.columns) - na_threshold)
When I run
df_null = df3.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df3.columns)))
df_null is a dataframe entry with 1 row which only has one column with a null value
I have tried increasing the value na_threshold but it hasn't made a difference.

I have realised that the drop.na function does work
What happened is that the file was initially read in with Pandas and I had put the drop na function before converting other Null like values to Null as Pandas uses nan, NaT and sometimes "-"
for column in columns:
df3 = df3.withColumn(column,when((col(column)=='nan'),None).otherwise(F.col(column)))
df3 = df3.withColumn(column,when((col(column)=='NaT'),None).otherwise(F.col(column)))
df3 = df3.withColumn(column,when((col(column)=='-'),None).otherwise(F.col(column)))
na_threshold=2
df3=df3.dropna(thresh=len(df3.columns) - na_threshold)```

How to check if a field exists in all the rows of a given dataframe

My dataframe has fields (X, Y, Z) -> But few rows of the dataframe do not have the field 'Z'. I have a check like this presently
if not 'f' in df.columns:
df = df.withColumn('f', <>)
But cases, when certain rows has the field and certain other rows do not have the field - how do I check for that?
Note:
"Z" can either be "null" or have some_value. And there can be cases when there is no Z field at all.

how to pick the rows from dataframe comparing with hashmap

I have two dataframes,
df1
id slt sln elt eln start end
df2
id evt slt sln speed detector
Hashmap
Map(351608084643945 -> List(1544497916,1544497916), 351608084643944 -> List(1544498103,1544498093))
I want to compare the values in the list and if the two values in the list match ,then I want to have the full row from dataframe(df1) of that id.
else,full row from df2 of that id.
Both the dataframes and maps will have distinct and unique id.

If I understand correctly you want to traverse your hash map and for entry you want to check if value which is list have all the values same. If list have same element that you want data from df1 else from df2 for that key. If that is what you want than below is the code for same.
hashMap.foreach(x => {
var key = x._1.toString
var valueElements = x._2.toList
if (valueElements.forall(_ == valueElements.head)) {
df1.filter($"id".equalTo(key))
} else {
df2.filter($"id".equalTo(key))
}
})

Two steps:
Step One: Split the hashmap into two hashmaps, one is the matched hashmap, the other is not matched hashmap.
Step Two: Use matched hashmap to join with df1 on id, then you get the matched df1. And use unmatched hashmap to join with df2 on id, then you get the unmatched df2.

pyspark group by sum

I have a pyspark dataframe with 4 columns.
id/ number / value / x
I want to groupby columns id, number, and then add a new columns with the sum of value per id and number. I want to keep colunms x without doing nothing on it.
df= df.select("id","number","value","x")
.groupBy( 'id', 'number').withColumn("sum_of_value",df.value.sum())
At the end I want a data frame with 5 columns : id/ number / value / x /sum_of_value)
Does anyone can help ?

The result you are trying to achieve doesn't make sense. Your output dataframe will only have columns that were grouped by or aggregated (summed in this case). x and value would have multiple values when you group by id and number.
You can have a 3-column output (id, number and sum(value)) like this:
df_summed = df.groupBy(['id', 'number'])['value'].sum()

Lets say your DataFrame df has 3 Columns Initially.
df1 = df.groupBy("id","number").count()
Now df1 will contain 2 columns id, number and count.
Now you can join df1 and df based on columns "id" and "number" and select whatever columns you would like to select.
Hope it helps.
Regards,
Neeraj

how to select all columns that starts with a common label

I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
I know I can do like this to select specific columns:
df.select("colA", "colB", "colE")
but how to select, say "colA", "colB" and all the colF-* columns at once? Is there a way like in Pandas?

The process canbe broken down into following steps:
First grab the column names with df.columns,
then filter down to just the column names you want .filter(_.startsWith("colF")). This gives you an array of Strings.
But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)),
and finally turn the Array of Columns into a var arg with : _*.
df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):
df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show
If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.
df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show

Python (tested in Azure Databricks)
selected_columns = [column for column in df.columns if column.startswith("colF")]
df2 = df.select(selected_columns)

In PySpark, use: colRegex to select columns starting with colF
Whit the sample:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
Apply:
df.select(col("colA"), col("colB"), df.colRegex("`(colF)+?.+`")).show()
The result is:
colA, colB, colF-0, colF-1, colF-2

I wrote a function that does that. Read the comments to see how it works.
/**
* Given a sequence of prefixes, select suitable columns from [[DataFrame]]
* #param columnPrefixes Sequence of prefixes
* #param dF Incoming [[DataFrame]]
* #return [[DataFrame]] with prefixed columns selected
*/
def selectPrefixedColumns(columnPrefixes: Seq[String], dF: DataFrame): DataFrame = {
// Find out if given column name matches any of the provided prefixes
def colNameStartsWith: String => Boolean = (colName: String) =>
columnsPrefix.map(prefix => colName.startsWith(prefix)).reduce(_ || _)
// Filter columns list by checking against given prefixes sequence
val columns = dF.columns.filter(colNameStartsWith)
// Select filtered columns list
dF.select(columns.head, columns.tail:_*)
}