How to iterate and add values of rows and column in list using pandas - python-3.7

import pandas as pd
data = pd.read_csv("marksheet.csv")
print(data.head())
print(data.iloc[1])
index st_id student_name subject Mid_term1 Mid_term2 Quaterly Exam Mid_term3
0 1781 John English Pass Pass Pass Pass
1 1781 John Maths Pass Fail Pass Pass
Required solution:
list1 = ['Pass','Pass','Pass','Pass']
list2 = ['Pass','Fail, 'Pass','Pass']
Is there any solution to iterate the rows and get the values from each row and append in list

Related

How to get column values from list which contains column names in spark scala dataframe

I have a config defined which contains a list of column for each table to be used as a dedup key
for ex:
config 1 :
val lst = List(section_xid, learner_xid)
these are the column that needs to be used as a dedup keys. This list is dynamic some table will have 1 value some will have 2 or 3 values in it
what I am trying to do is build a single key column from this list
df.
.withColumn( "dedup_key_sk", uuid(md5(concat($"lst(0)",$"lst(1)"))) )
how do I make this dynamic which will work for any number of columns in list .
I tried doing this
df.withColumn("dedup_key_sk", concat(Seq($"col1", $"col2"):_*))
For this to work I had to convert list to Df and each value in list needs to be in separate columns I was not able to figure that out.
tried doing this but didn't work
val res = sc.parallelize(List((lst))).toDF
ANy input here will be appreciated . Thank you
The list of strings can be mapped to a list of columns (using functions.col). This list of columns can then be used with concat:
val lst: List[String] = List("section_xid", "learner_xid")
df.withColumn("dedup_key_sk", concat(lst.map(col):_*)).show()

How would I use an ANY condition to filter if any rows of a group have a 0 value?

Say I have this dataframe...
var df = Seq(("Steve",1),("Steve",0),("Michael",3),("Michael",2),("Katherine",4),("Katherine",0),("Devin",0),("Devin",0)).toDF("name","score")
I want to return the unique names where NONE of their scores are equals to zero. So in this case, the only name that would be returned would be Michael, since both of his scores above zero.
Thanks so much!
When you want a condition to apply on several rows, you need to use either groupBy or Window functions
In your case, you can group by column "name", aggregate the list of scores for each name and then filter out all the records where list of score contains 0. Your code would be:
import org.apache.spark.sql.functions.{col, collect_set, array_contains, not}
df.groupBy("name")
.agg(collect_set(col("score")).as("all_scores"))
.filter(not(array_contains(col("all_scores"), 0)))
.select("name")

Pyspark remove columns with 10 null values

I am new to PySpark.
I have read a parquet file. I only want to keep columns that have atleast 10 values
I have used describe to get the count of not-null records for each column
How do I now extract the column names that have less than 10 values and then drop those columns before writing to a new file
df = spark.read.parquet(file)
col_count = df.describe().filter($"summary" == "count")
You can convert it into a dictionary and then filter out the keys(column names) based on their values (count < 10, the count is a StringType() which needs to be converted to int in the Python code):
# here is what you have so far which is a dataframe
col_count = df.describe().filter('summary == "count"')
# exclude the 1st column(`summary`) from the dataframe and save it to a dictionary
colCountDict = col_count.select(col_count.columns[1:]).first().asDict()
# find column names (k) with int(v) < 10
bad_cols = [ k for k,v in colCountDict.items() if int(v) < 10 ]
# drop bad columns
df_new = df.drop(*bad_cols)
Some notes:
use #pault's approach if the information can not be retrieved directly from df.describe() or df.summary() etc.
you need to drop() instead of select() columns since describe()/summary() only include numeric and string columns, selecting columns from a list processed by df.describe() will lose columns of TimestampType(), ArrayType() etc

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark