How to get the numeric value of missing values in a PySpark column? - pyspark

I am working with the OpenFoodFacts dataset using PySpark. There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.
The following code shows the number of missing values in a column but displays it in a table format:
from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()
I have tried the following codes:
This one does not work as intended as it doesn't drop any columns (as expected)
for c in data.columns:
if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
data = data.drop(c)
data.show()
This one I am currently trying but takes ages to execute
for c in data.columns:
if(data.filter(data[c].isNull()).count() == data.count()):
data = data.drop(c)
data.show()
Is there a way to get ONLY the number? Thanks

If you need the number instead of showing in the table format, you need to use the .collect(), which is:
list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()
What you get is a list of Row, which contain all the information in the table.

Related

Pyspark - filter to select max value

I have a date column with column type "string". It has multiple dates and several rows of data for each date.
I'd like to filter the data to select the most recent (max) date however, when I run it, the code runs but ends up populating an empty table.
Currently I am typing in my desired date manually because I am under the impression that no form of max function will work since the column is of string type.
This is the code I am using
extract = raw_table.drop_duplicates() \
.filter(raw.as_of_date == '2022-11-25')
and what I desire to do is make this automated. Something on the lines of
.filter(raw.as_of_date == max(as_of_date)
Please advise on how to convert column type from string to date, how to code to select max date and why my hardcoding results in an empty table
you'll need to calculate the max of the date in a column and then use that column for filtering as spark does not allow aggregations in filters.
you might need something like the following
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
raw_table.drop_duplicates(). \
withColumn('max_date', func.max('as_of_date').over(wd.partitionBy())). \
filter(func.col('as_of_date') == func.col('max_date')). \
drop('max_date')

how to multiply variable to each element of a column in database

I am trying to add a column to a collection by multiplying the 0.9 to existing database column recycling. but I get a run time error.
I tried to multiply 0.9 direction in the function but it is showing error, so I created the class and multiplied it there yet no use. what could be the problem?
Your error message is telling you what the problem is: your database query is using GROUP BY in an invalid way.
It doesn't make sense to group by one column and then select other columns (you've selected all columns in your case); what values would they contain, since you haven't grouped by them as well (and get one row returned per group)? You either have to group by all the columns you're selecting for, and/or use aggregates such as SUM for the non-grouped columns.
Perhaps you meant to ORDER BY that column (orderBy(dt.recycling.asc()) if ascending order in QueryDSL format), or to select all rows with a particular value of that column (where(dt.recycling.eq(55)) for example)?

datenum and matrix column string conversion

I want to convert the second column of table T using datenum.
The elements of this column are '09:30:31.848', '15:35:31.325', etc. When I use datenum('09:30:31.848','HH:MM:SS.FFF') everything works, but when I want to apply datenum to the whole column it doesn't work. I tried this command datenum(T(:,2),'HH:MM:SS.FFF') and I receive this error message:
"The input to DATENUM was not an array of character vectors"
Here a snapshot of T
Thank you
You are not calling the data from the table, but rather a slice of the table (so its stays a table). Refer to the data in the table using T.colName:
times_string = ['09:30:31.848'; '15:35:31.325'];
T = table(times_string)
times_num = datenum(T.times_string, 'HH:MM:SS.FFF')
Alternatively, you can slice the table using curly braces to extract the data (if you want to use the column number instead of name):
times_num = datenum(T{:,2}, 'HH:MM:SS.FFF')

Transform CSV Column Values into Single Row

My data in CSV is like this(Expected Image):
Actual Data
And I want to convert this Data into:
Expected Data
(hivetablename.hivecolumnname = dbtablename.dbtablecolumn)
By joining the multiple Row values into a Single row value like above.
Please note that 'AND' is a Literal between the condition to be built, which would appear until the second last record.
Once the Last Record is reached, Only the condition would appear(xx=yy)
I wish the result to be in SCALA SPARK.
Many thanks in advance!

Find all non-integers in column

I have some corrupted rows in my large CSV file where some data values get shifted due to missing line breaks. This results in values appearing in the wrong column header. For eg. if three columns exists in my table, , , , after corruption, I start to see values like , , .
Is there a way for me to drop all rows where for e.g. I see a non-int in a row that I know should, in fact, be an Int?
What you can do is loop through the lines, and when the lines.split(",").count() doesn't equal what you want, you can filter it out. Something like this:
import scala.io.Source
val n = 5 //or how many columns you require
Source.fromFile(input_file).getLines().toSeq.map(_.split(",")).filter(_.count == n)
This should do what you want :)