Replace randomly RDD values to null with scala spark - scala

I have a csv file containing almost 15000 records. Each line contain 3 types of data devided by a tab (\t). I Actually want to replace randomly the second column value into null ! Maybe I will keep 8000 as they are and replace 7000 values to null.
Any help please with scala (spark)?
This is how it looks like:

read data as dataframe
generate a new column say rnd which is a random number from 0 to 1
make col2 = col2 when rnd < 0.5 (if you want to make 50% values null) else null
import org.apache.spark.sql.functions.{lit, rand, when}
import spark.implicits._
spark.read.option("header", "true").option("sep", "\t").csv(<your_path>)
.withColumn("rnd", rand())
.withColumn("col2", when($"rnd" < 0.5, $"col2").otherwise(lit(null).cast(<col2_datatype_here>)))

#amelie, notice $ in front of "rnd" in when condition in my answer.
You were supposed to do a column comparison, not value comparison.
PS: Not able to put comment as I am a stackoverflow newbie, hence a separate answer.

Related

How debug spark dropduplicate and join function calls?

There is some table with duplicated rows. I am trying to reduce duplicates and stay with latest my_date (if there are
rows with same my_date it is no matter which one to use)
val dataFrame = readCsv()
.dropDuplicates("my_id", "my_date")
.withColumn("my_date_int", $"my_date".cast("bigint"))
import org.apache.spark.sql.functions.{min, max, grouping}
val aggregated = dataFrame
.groupBy(dataFrame("my_id").alias("g_my_id"))
.agg(max(dataFrame("my_date_int")).alias("g_my_date_int"))
val output = dataFrame.join(aggregated, dataFrame("my_id") === aggregated("g_my_id") && dataFrame("my_date_int") === aggregated("g_my_date_int"))
.drop("g_my_id", "g_my_date_int")
But after this code I when grab distinct my_id I get about 3000 less than in source table. What a reason can be?
how to debug this situation?
After doing drop duplicates do a except of this data frame with the original data frame this should give some insight on the rows which are additionally getting dropped . Most probably there are certain null or empty values for those columns which are being considered duplicates.

Pyspark remove columns with 10 null values

I am new to PySpark.
I have read a parquet file. I only want to keep columns that have atleast 10 values
I have used describe to get the count of not-null records for each column
How do I now extract the column names that have less than 10 values and then drop those columns before writing to a new file
df = spark.read.parquet(file)
col_count = df.describe().filter($"summary" == "count")
You can convert it into a dictionary and then filter out the keys(column names) based on their values (count < 10, the count is a StringType() which needs to be converted to int in the Python code):
# here is what you have so far which is a dataframe
col_count = df.describe().filter('summary == "count"')
# exclude the 1st column(`summary`) from the dataframe and save it to a dictionary
colCountDict = col_count.select(col_count.columns[1:]).first().asDict()
# find column names (k) with int(v) < 10
bad_cols = [ k for k,v in colCountDict.items() if int(v) < 10 ]
# drop bad columns
df_new = df.drop(*bad_cols)
Some notes:
use #pault's approach if the information can not be retrieved directly from df.describe() or df.summary() etc.
you need to drop() instead of select() columns since describe()/summary() only include numeric and string columns, selecting columns from a list processed by df.describe() will lose columns of TimestampType(), ArrayType() etc

Extract and Replace values from duplicates rows in PySpark Data Frame

I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""

pyspark group by sum

I have a pyspark dataframe with 4 columns.
id/ number / value / x
I want to groupby columns id, number, and then add a new columns with the sum of value per id and number. I want to keep colunms x without doing nothing on it.
df= df.select("id","number","value","x")
.groupBy( 'id', 'number').withColumn("sum_of_value",df.value.sum())
At the end I want a data frame with 5 columns : id/ number / value / x /sum_of_value)
Does anyone can help ?
The result you are trying to achieve doesn't make sense. Your output dataframe will only have columns that were grouped by or aggregated (summed in this case). x and value would have multiple values when you group by id and number.
You can have a 3-column output (id, number and sum(value)) like this:
df_summed = df.groupBy(['id', 'number'])['value'].sum()
Lets say your DataFrame df has 3 Columns Initially.
df1 = df.groupBy("id","number").count()
Now df1 will contain 2 columns id, number and count.
Now you can join df1 and df based on columns "id" and "number" and select whatever columns you would like to select.
Hope it helps.
Regards,
Neeraj

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark