I am trying to fetch the columns from a dataframe where the column values is 0 but unable to do so. did anyone tried the same ever?
## Data Frame with One Row
row = [[1,0,0,1,2,3,4,0,0,0]]
df = sc.parallelize(row).toDF(['Col1','Col2','Col3','Col4','Col5','Col6','Col7','Col8','Col9','Col10'])
df.show()
#Say you have only one row hene we wrote that zero
list_of_dict = map(lambda row: row.asDict(), df.collect())[0]
zeroCol = []
for key in list_of_dict.keys():
if list_of_dict[key] > 0:
zeroCol.append(key)
print zeroCol
+----+----+----+----+----+----+----+----+----+-----+
|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|
+----+----+----+----+----+----+----+----+----+-----+
| 1| 0| 0| 1| 2| 3| 4| 0| 0| 0|
+----+----+----+----+----+----+----+----+----+-----+
['Col6', 'Col7', 'Col4', 'Col5', 'Col1']
Related
I have a all string spark dataframe and I need to return columns in which all rows meet a certain criteria.
scala> val df = spark.read.format("csv").option("delimiter",",").option("header", "true").option("inferSchema", "true").load("file:///home/animals.csv")
df.show()
+--------+---------+--------+
|Column 1| Column 2|Column 3|
+--------+---------+--------+
|(ani)mal| donkey| wolf|
| mammal|(mam)-mal| animal|
| chi-mps| chimps| goat|
+--------+---------+--------+
Over here the criteria is return columns where all row values have length==6, irrespective of special characters. The response should be below dataframe since all rows in column 1 and column 2 have length==6
+--------+---------+
|Column 1| Column 2|
+--------+---------+
|(ani)mal| donkey|
| mammal|(mam)-mal|
| chi-mps| chimps|
+--------+---------+
You can use regexp_replace to delete the special characters if you know what there are and then get the length, filter to field what you want.
val cols = df.columns
val df2 = cols.foldLeft(df) {
(df, c) => df.withColumn(c + "_len", length(regexp_replace(col(c), "[()-]", "")))
}
df2.show()
+--------+---------+-------+-----------+-----------+-----------+
| Column1| Column2|Column3|Column1_len|Column2_len|Column3_len|
+--------+---------+-------+-----------+-----------+-----------+
|(ani)mal| donkey| wolf| 6| 6| 4|
| mammal|(mam)-mal| animal| 6| 6| 6|
| chi-mps| chimps| goat| 6| 6| 4|
+--------+---------+-------+-----------+-----------+-----------+
Hello I am new to Spark and scala, and I have three similar dataframes as the following:
df1:
+--------+-------+-------+-------+
| Country|1/22/20|1/23/20|1/24/20|
+--------+-------+-------+-------+
| Chad| 1| 0| 5|
+--------+-------+-------+-------+
|Paraguay| 4| 6| 3|
+--------+-------+-------+-------+
| Russia| 0| 0| 1|
+--------+-------+-------+-------+
df2 and d3 are exactly similar just with different values
I would like to apply a function to each row of df1 but I also need to select the same row (using the Country as key) from the other two dataframes because I need the selected rows as input arguments for the function I want to apply.
I thought of using
df1.map{ r =>
val selectedRowDf2 = selectRow using r at column "Country" ...
val selectedRowDf3 = selectRow using r at column "Country" ...
r.apply(functionToApply(r, selectedRowDf2, selectedRowDf3)
}
I also tried with map but I get an error as follows:
Error:(238, 23) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[Unit])org.apache.spark.sql.Dataset[Unit].
Unspecified value parameter evidence$6.
df1.map{
A possible approach could be to append each dataframe columns with a key to uniquely identify the columns and finally merge all the dataframe to a single dataframe using country column. The desired operation could be performed on each row of the merged datafarme.
def appendColWithKey(df: DataFrame, key: String) = {
var newdf = df
df.schema.foreach(s => {
newdf = newdf.withColumnRenamed(s.name, s"$key${s.name}")
})
newdf
}
val kdf1 = appendColWithKey(df1, "key1_")
val kdf2 = appendColWithKey(df2, "key2_")
val kdf3 = appendColWithKey(df3, "key3_")
val tempdf1 = kdf1.join(kdf2, col("key1_country") === col("key2_country"))
val tempdf = tempdf1.join(kdf3, col("key1_country") === col("key3_country"))
val finaldf = tempdf
.drop("key2_country")
.drop("key3_country")
.withColumnRenamed("key1_country", "country")
finaldf.show(10)
//Output
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| country|key1_1/22/20|key1_1/23/20|key1_1/24/20|key2_1/22/20|key2_1/23/20|key2_1/24/20|key3_1/22/20|key3_1/23/20|key3_1/24/20|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Chad| 1| 0| 5| 1| 0| 5| 1| 0| 5|
|Paraguay| 4| 6| 3| 4| 6| 3| 4| 6| 3|
| Russia| 0| 0| 1| 0| 0| 1| 0| 0| 1|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
This question already has answers here:
How do I add a new column to a Spark DataFrame (using PySpark)?
(10 answers)
Closed 5 years ago.
I am bit new on pyspark. I have a spark dataframe with about 5 columns and 5 records. I have list of 5 records.
Now I want to add these 5 static records from the list to the existing dataframe using withColumn. I did that, but its not working.
Any suggestions are greatly appreciated.
Below is my sample:
dq_results=[]
for a in range(0,len(dq_results)):
dataFile_df=dataFile_df.withColumn("dq_results",lit(dq_results[a]))
print lit(dq_results[a])
thanks,
Sreeram
dq_results=[]
Create one data frame from list dq_results:
df_list=spark.createDataFrame(dq_results_list,schema=dq_results_col)
Add one column for df_list id (it will be row id)
df_list_id = df_list.withColumn("id", monotonically_increasing_id())
Add one column for dataFile_df id (it will be row id)
dataFile_df= df_list.withColumn("id", monotonically_increasing_id())
Now we can join the both dataframe df_list and dataFile_df.
dataFile_df.join(df_list,"id").show()
So dataFile_df is final data frame
withColumn will add a new Column, but I guess you might want to append Rows instead. Try this:
df1 = spark.createDataFrame([(a, a*2, a+3, a+4, a+5) for a in range(5)], "A B C D E".split(' '))
new_data = [[100 + i*j for i in range(5)] for j in range(5)]
df1.unionAll(spark.createDataFrame(new_data)).show()
+---+---+---+---+---+
| A| B| C| D| E|
+---+---+---+---+---+
| 0| 0| 3| 4| 5|
| 1| 2| 4| 5| 6|
| 2| 4| 5| 6| 7|
| 3| 6| 6| 7| 8|
| 4| 8| 7| 8| 9|
|100|100|100|100|100|
|100|101|102|103|104|
|100|102|104|106|108|
|100|103|106|109|112|
|100|104|108|112|116|
+---+---+---+---+---+
I have the following dataframe:
df.show
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 2|
|2017-05-20| 1|
|2017-06-23| 2|
|2017-06-16| 3|
|2017-06-30| 1|
I want to replace the count values by 0, where it is greater than 1, i.e., the resultant dataframe should be:
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 0|
|2017-05-20| 1|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| 1|
I tried the following expression:
df.withColumn("count", when(($"count" > 1), 0)).show
but the output was
+----------+--------+
| createdon| count|
+----------+--------+
|2017-06-28| null|
|2017-06-17| 0|
|2017-05-20| null|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| null|
I am not able to understand, why for the value 1, null is getting displayed and how to overcome that. Can anyone help me?
You need to chain otherwise after when to specify the values where the conditions don't hold; In your case, it would be count column itself:
df.withColumn("count", when(($"count" > 1), 0).otherwise($"count"))
This can be done using udf function too
def replaceWithZero = udf((col: Int) => if(col > 1) 0 else col) //udf function
df.withColumn("count", replaceWithZero($"count")).show(false) //calling udf function
Note : udf functions should always be the choice only when there is no inbuilt functions as it requires serialization and deserialization of column data.
I have an eventlog in csv consisting of three columns timestamp, eventId and userId.
What I would like to do is append a new column nextEventId to the dataframe.
An example eventlog:
eventlog = sqlContext.createDataFrame(Array((20160101, 1, 0),(20160102,3,1),(20160201,4,1),(20160202, 2,0))).toDF("timestamp", "eventId", "userId")
eventlog.show(4)
|timestamp|eventId|userId|
+---------+-------+------+
| 20160101| 1| 0|
| 20160102| 3| 1|
| 20160201| 4| 1|
| 20160202| 2| 0|
+---------+-------+------+
The desired endresult would be:
|timestamp|eventId|userId|nextEventId|
+---------+-------+------+-----------+
| 20160101| 1| 0| 2|
| 20160102| 3| 1| 4|
| 20160201| 4| 1| Nil|
| 20160202| 2| 0| Nil|
+---------+-------+------+-----------+
So far I've been messing around with sliding windows but can't figure out how to compare 2 rows...
val w = Window.partitionBy("userId").orderBy(asc("timestamp")) //should be a sliding window over 2 rows...
val nextNodes = second($"eventId").over(w) //should work if there are only 2 rows
What you're looking for is lead (or lag). Using window you already defined:
import org.apache.spark.sql.functions.lead
eventlog.withColumn("nextEventId", lead("eventId", 1).over(w))
For true sliding window (like sliding average) you can use rowsBetween or rangeBetween clauses of the window definition but it is not really required here. Nevertheless example usage could be something like this:
val w2 = Window.partitionBy("userId")
.orderBy(asc("timestamp"))
.rowsBetween(-1, 0)
avg($"foo").over(w2)