Filter columns having count equal to the input file rdd Spark - scala

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?

According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.

Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

Related

Pyspark remove columns with 10 null values

I am new to PySpark.
I have read a parquet file. I only want to keep columns that have atleast 10 values
I have used describe to get the count of not-null records for each column
How do I now extract the column names that have less than 10 values and then drop those columns before writing to a new file
df = spark.read.parquet(file)
col_count = df.describe().filter($"summary" == "count")
You can convert it into a dictionary and then filter out the keys(column names) based on their values (count < 10, the count is a StringType() which needs to be converted to int in the Python code):
# here is what you have so far which is a dataframe
col_count = df.describe().filter('summary == "count"')
# exclude the 1st column(`summary`) from the dataframe and save it to a dictionary
colCountDict = col_count.select(col_count.columns[1:]).first().asDict()
# find column names (k) with int(v) < 10
bad_cols = [ k for k,v in colCountDict.items() if int(v) < 10 ]
# drop bad columns
df_new = df.drop(*bad_cols)
Some notes:
use #pault's approach if the information can not be retrieved directly from df.describe() or df.summary() etc.
you need to drop() instead of select() columns since describe()/summary() only include numeric and string columns, selecting columns from a list processed by df.describe() will lose columns of TimestampType(), ArrayType() etc

How to rename a duplicate column using column index?

I have a dataframe that has two same name columns, since the first column (agreementID) holds a value, I want to rename the second column) which holds null values to a different name, and different records. I want to use the aggrementID as a key in the future.
enter image description here
enter image description here
Please help on how to rename the column using column position ore index?
val columnIndex = 1
val newColumnName = "new_name"
val cols = df.columns
cols(columnsIndex) = newColumnName
df.toDF(cols)
This should work:
val distinctColumns = Seq("name","agreementId","dupAgreementId")
val df = df.toDF(distinctColumns:_*)

Extract and Replace values from duplicates rows in PySpark Data Frame

I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""

Iterate across columns in spark dataframe and calculate min max value

I want to iterate across the columns of dataframe in my Spark program and calculate min and max value.
I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe.
I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection.
val parquetRDD = spark.read.parquet("filename.parquet")
parquetRDD.collect.foreach ({ i => parquetRDD_subset.agg(max(parquetRDD(parquetRDD.columns(2))), min(parquetRDD(parquetRDD.columns(2)))).show()})
Appreciate any help on this.
You should not be iterating on rows or records. You should be using aggregation function
import org.apache.spark.sql.functions._
val df = spark.read.parquet("filename.parquet")
val aggCol = col(df.columns(2))
df.agg(min(aggCol), max(aggCol)).show()
First when you do spark.read.parquet you are reading a dataframe.
Next we define the column we want to work on using the col function. The col function translate a column name to a column. You could instead use df("name") where name is the name of the column.
The agg function takes aggregation columns so min and max are aggregation functions which take a column and return a column with an aggregated value.
Update
According to the comments, the goal is to have min and max for all columns. You can therefore do this:
val minColumns = df.columns.map(name => min(col(name)))
val maxColumns = df.columns.map(name => max(col(name)))
val allMinMax = minColumns ++ maxColumns
df.agg(allMinMax.head, allMinMax.tail: _*).show()
You can also simply do:
df.describe().show()
which gives you statistics on all columns including min, max, avg, count and stddev

Tagging a HBase Table using Spark RDD in Scala

I am trying add an extra "tag" column to an Hbase table. Tagging is done on the basis of words present in the rows of the table. Say for example, If "Dark" appears in a certain row, then its tag will be added as "Horror". I have read all the rows from the table in a spark RDD and have matched them with words based on which we would tag. A snippet to code looks like this:
var hBaseRDD2=sc.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
(Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieName"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieSummary"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieActor")))
)
})
Here, "moviesdata" is the columnfamily of the HBase table and "MovieName"&"MovieSummary" & "MovieActor" are column names. "transformedRDD" in the above snippet is of type RDD[String,String,String]. It has been converted into type RDD[String] by:
val arrayRDD: RDD[String] = transformedRDD.map(x => (x._1 + " " + x._2 + " " + x._3))
From this, all words have been extracted by doing this:
val words = arrayRDD.map(x => x.split(" "))
The words which we would are looking for in the HBase Table rows are in a csv file. One of the column, let's say "synonyms" column, of the csv has the words which we would look for. Another column in the csv is a "target_tag" column, which has the words which would be tagged to the row corresponding to which there is match.
Read the csv by:
val csv = sc.textFile("/tag/moviestagdata.csv")
reading the synonyms column: (synonyms column is the second column, therefore "p(1)" in the below snippet)
val synonyms = csv.map(_.split(",")).map( p=>p(1))
reading the target_tag column: (target_tag is the 3rd column)
val targettag = csv.map(_.split(",")).map(p=>p(2))
Some rows in synonyms and targetag have more than one strings and are seperated by "###". The snippet to seperate them is this:
val splitsyno = synonyms.map(x => x.split("###"))
val splittarget = targettag.map(x=>x.split("###"))
Now, to match each string from "splitsyno", we need to traverse every row, and further a row might have many strings, hence, to create a set of every string, I did this:(an empty set was created)
splitsyno.map(x=>x.foreach(y=>set += y)
To match every string with those in "words" created up above, I did this:
val check = words.exists(set contains _)
Now, the problem which I am facing is that I don't exactly know that strings from what rows in csv are matching to strings from what rows in HBase table. This is needed as I would need to find corresponding target string and which row in HBase table to add to. How should I get it done? Any help would be highly appreciated.