I need to replace only null values in selected columns in a dataframe. I know we have df.na.fill option . How can we implement it only on selected columns or is there any better option other than using df.na.fill
Reading spark documentation here we can see that fill is well suited for your need. You can do something like:
df.na.fill(0, Seq("colA", "colB"))
Related
I am creating a process in spark scala within an ETL that checks for some events occurred during the ETL process. I start with an empty dataframe and if events occur this dataframe is filled with information ( a dataframe can't be filled it can only be joined with other dataframes with the same structure ). The thing is that at the end of the process, the dataframe that has been generated is loaded into a table but it can happen that the dataframe ends up being empty because no event has occured and I don't want to load a dataframe that is empty because it has no sense. So, I'm wondering if there is an elegant way to load the dataframe into the table only if it is not empty without using the if condition. Thanks!!
I recommend to create the dataframe anyway; If you don't create a table with the same schema, even if it's empty, your operations/transformations on DF could fail as it could refer to columns that may not be present.
To handle this, you should always create a DataFrame with the same schema, which means the same column names and datatypes regardless if the data exists or not. You might want to populate it with data later.
If you still want to do it your way, I can point a few ideas for Spark 2.1.0 and above:
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
These are equivalent.
I don't recommend using df.count > 0 because it is linear in time complexity and you would still have to do a check like df != null before.
A much better solution would be:
df.rdd.isEmpty
Or since Spark 2.4.0 there is also Dataset.isEmpty.
As you can see, whatever you decide to do, there is a check somewhere that you need to do, so you can't really get rid of the if condition - as the sentence implies: if you want to avoid creating an empty dataframe.
I have spark dataframe
Here it is
I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question
col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)
I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).
from pyspark.sql import functions as F
var = df.select(F.col('column_you_want')).toPandas()
Then you can iterate on it like a normal pandas series.
I have many json string lines in many files and they are very similar in schema but there are a few different in some cases.
I made a DataFrame from them and want to see only rows which have a specific column like
DF.filter("myColumn" is present).show
How can I do this?
You can use isNotNull in filter()
import org.apache.spark.sql.functions.isNotNull
df.filter($"myColumn".isNotNull)
Is anyone using the DataFrame package (https://github.com/rothnic/DataFrame)?
I use it because older MATLAB can also use it. However, I just have very basic question:
How to change value in the DataFrame?
In MATLAB's table function, it is straightforward to do it. For example, t{1,{'prior'}} = {'a = 2000'}, where t is a table and I assign a cell to it. I cannot figure out how to do it in DataFrame package.
The DataFrame author seems not maintaining it anymore(?). I hope someone could give more examples of its methods in DataFrame.
Thanks!
I have a groupBy for a DataFrame which is based on 3 columns. I am doing something like this:
myDf.groupBy($"col1", $"col2", $"col3")
Anyway I am not sure how this works.
Does it manage ignore cases? I need that for each column "FOO" and "foo" are considered the same like "" and null.
If this is not the supposed working mode how I can add it? From the API doc I can see something with apply on a column but I could not find any example.
Any idea?
You can run functions inside of your groupBy statement. So in this case it sounds like you will want to convert the strings to lower case when you are grouping them. Check out the lower function
https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.sql.functions$