Remove rows with a certain value in any column in pyspark - pyspark

I am working in pyspark to clean a data set. The data set has "?" in various rows in various columns. I want to remove any row that has the value anywhere in it. I tried the following:
df = df.replace("?", "np.Nan")
df=df.dropna()
However, It did not work to remove those values.
I keep looking online but can't find any understandable answers (i am a newbie)

Related

`set_sorted` when a dataframe is sorted on multiple columns

I have some panel data in polars. The dataframe is sorted by its id column and then its date column (basically it's a bunch of time series concatenated together).
I've seen that polars has a .set_sorted method for working with expressions. I can of course set pl.col("id").set_sorted() but I want it to be aware that it's actually sorted in both id and date columns. In pandas I know the Index has an .is_monotonic_increasing property that is aware of whether all the columns of the Index are sorted but is there a way to do something similar with polars?
Have you tried
df.get_column('id').is_sorted()
and
df.get_column('date').is_sorted()
to see if they're each already known to be sorted?
For instance if I do:
df=pl.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
Then I get 2 Trues even though I haven't ever told it that the columns are sorted.
In general, I don't think you want to be manually setting columns as sorted. Just sort them and it'll keep track of the fact that they're sorted.
If you do:
df=pl.DataFrame({'a':[1,2,1,2], 'b':[1,3,2,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
then you get False twice, as you'd hope. If you then do df=df.sort(['a','b']) and follow it up by checking the sortedness of a and b again then you see that it knows they're sorted

How to extract specific value in a dictionary column with multiple lists

I'm trying to extract specific value inside a column in a dataframe as you can see in the next image without any success, referring back to similar question still didn't work for my code.
If there is any way to extract the values as [Culture, Climate change, technology, ...]
Data
First Try
I have tried split() function however I reached a dead end as still I need the exact value after the word "name", and this new dataframe contains 75 columns. If I can only get a for loop to extract the value after the word "name" that's my latest vision to solve my problem.

Removing duplicated columns in a dataframe

I know this seems like a really simple question and I have scoured google and stackoverflow for it, but could not find exactly what I need.
I have aggregated some data from one dataframe config into another config1 with the following code. The basis of the code was provided by another stackoverflow member Thank You #Sunny Shukla.
exprs=map(lambda c: max(c).alias(c), config.columns)
config1=config.groupBy(["seq_id","tool_id"])\
.agg(f.count(f.lit(1)).alias('count'),
*exprs).where('count = 1').drop('count')
The config dataframe has 20 columns and the config1 df has 22 columns because I have grouped it using 2 columns seq_id and tool_id but mapped the entire original columns to retain the original column names (im sure there is a more elegant way to do this)
The resulting dataframe config1 therefore has a duplicated columns of seq_id and tool_id. If I do
the config1.drop('seq_id','tool_id') then it drops 4 columns and i end up with 18 columns instead of 20.
Is there a more elegant way to do this without writing UDFs?
Thank You

How to properly remove NaN values from table

After reading an Excel spreadsheet in Matlab I unfortunately have NaNs included in my resulting table. So for example this Excel table:
would result in this table:
where an additional column of NaNs occurs. I tried to remove the NaNs with the following code snippet:
measurementCells = readtable('MWE.xlsx','ReadVariableNames',false,'ReadRowNames',true);
measurementCells = measurementCells(any(isstruct(measurementCells('TIME',1)),1),:);
However this results in a 0x6 table, without any values present anymore. How can I properly remove the NaNs without removing any data from the table?
Either this:
tab = tab(~any(ismissing(tab),2),:);
or:
tab = rmmissing(tab);
if you want to remove rows that contain one or more missing value.
If you want instead to replace missing values with other values, read about how fillmissing (https://mathworks.com/help/matlab/ref/fillmissing.html) and standardizeMissing (https://mathworks.com/help/matlab/ref/standardizemissing.html) functions work. The examples are exhaustive and should help you to find the solution that best fits your needs.
One last solution you have is to spot (and manipulate in the way you prefer) NaN values within the call to the readtable function using the EmptyValue parameter. But this works only against numeric data.

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame