from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(0.0, 1.2, -1.3), (0.0, 0.0, 0.0),
(-17.2, 20.3, 15.2), (23.4, 1.4, 0.0),],
['col1', 'col2', 'col3'])
df1 = df.agg(F.avg('col1'))
df2 = df.agg(F.avg('col2'))
df3 = df.agg(F.avg('col3'))
If I have a dataframe,
ID COL1 COL2 COL3
1 0.0 1.2 -1.3
2 0.0 0.0 0.0
3 -17.2 20.3 15,2
4 23.4 1.4 0.0
I want to calculate mean for each column.
avg1 avg2 avg3
1 3.1 7.6 6.9
The result of above code is 1.54, 5.725, 3.47, which includes zero elements during averaging.
How can I do it?
None values are not affecting average so if you turn zero values to null you can have average of none zero values
(
df
.agg(
F.avg(F.when(F.col('col1') == 0, None).otherwise(F.col('col1'))).alias('avg(col1)'),
F.avg(F.when(F.col('col2') == 0, None).otherwise(F.col('col2'))).alias('avg(col2)'),
F.avg(F.when(F.col('col3') == 0, None).otherwise(F.col('col3'))).alias('avg(col3)'))
).show()
Related
How can I change the value of a column depending on some validation between some cells? What I need is to compare the kilometraje values of each customer's (id) record to compare whether the record that follows the kilometraje is higher.
fecha id estado id_cliente error_code kilometraje error_km
1/1/2019 1 A 1 10
2/1/2019 2 A ERROR 20
3/1/2019 1 D 1 ERROR 30
4/1/2019 2 O ERROR
The error in the error_km column is because for customer (id) 2 the kilometraje value is less than the same customer record for 2/1/2019 (If time passes the car is used so the kilometraje increases, so that there is no error the mileage has to be higher or the same)
I know that withColumn I can overwrite or create a column that doesn't exist and that using when I can set conditions. For example: This would be the code I use to validate the estado and id_cliente column and ERROR overwrite the error_code column where applicable, but I don't understand how to validate between different rows for the same client.
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
file_path = 'archive.txt'
error = 'ERROR'
df = spark.read.parquet(file_path)
df = df.persist(StorageLevel.MEMORY_AND_DISK)
df = df.select('estado', 'id_cliente')
df = df.withColumn("error_code", lit(''))
df = df.withColumn('error_code',
F.when((F.col('status') == 'O') &
(F.col('client_id') != '') |
(F.col('status') == 'D') &
(F.col('client_id') != '') |
(F.col('status') == 'A') &
(F.col('client_id') == ''),
F.concat(F.col("error_code"), F.lit(":[{}]".format(error)))
)
.otherwise(F.col('error_code')))
You achieve that with the lag window function. The lag function returns you the row before the current row. With that you can easily compare the kilometraje values. Have a look at the code below:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 30 ),
('7/1/2019', 3 , 30 ),
('4/1/2019', 2 , 5)]
columns = ['fecha', 'id', 'kilometraje']
df=spark.createDataFrame(l, columns)
df = df.withColumn('fecha',F.to_date(df.fecha, 'dd/MM/yyyy'))
w = Window.partitionBy('id').orderBy('fecha')
df = df.withColumn('error_km', F.when(F.lag('kilometraje').over(w) > df.kilometraje, F.lit('ERROR') ).otherwise(F.lit('')))
df.show()
Output:
+----------+---+-----------+--------+
| fecha| id|kilometraje|error_km|
+----------+---+-----------+--------+
|2019-01-01| 1| 10| |
|2019-01-03| 1| 30| |
|2019-01-04| 1| 10| ERROR|
|2019-01-05| 1| 30| |
|2019-01-07| 3| 30| |
|2019-01-02| 2| 20| |
|2019-01-04| 2| 5| ERROR|
+----------+---+-----------+--------+
The fourth row doesn't get labeled with 'ERROR' as the previous value had a smaller kilometraje value (10 < 30). When you want to label all the id's with 'ERROR' which contain at least one corrupted row, perform a left join.
df.drop('error_km').join(df.filter(df.error_km == 'ERROR').groupby('id').agg(F.first(df.error_km).alias('error_km')), 'id', 'left').show()
I use .rangeBetween(Window.unboundedPreceding,0).
This function searches from the current value for the added value for the back
import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
error = 'This is error'
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 22 ),
('7/1/2019', 1 , 23 ),
('22/1/2019', 2 , 5),
('11/1/2019', 2 , 24),
('13/2/2019', 1 , 16),
('14/2/2019', 2 , 18),
('5/2/2019', 1 , 19),
('6/2/2019', 2 , 23),
('7/2/2019', 1 , 14),
('8/3/2019', 1 , 50),
('8/3/2019', 2 , 50)]
columns = ['date', 'vin', 'mileage']
df=spark.createDataFrame(l, columns)
df = df.withColumn('date',F.to_date(df.date, 'dd/MM/yyyy'))
df = df.withColumn("max", lit(0))
df = df.withColumn("error_code", lit(''))
w = Window.partitionBy('vin').orderBy('date').rangeBetween(Window.unboundedPreceding,0)
df = df.withColumn('max',F.max('mileage').over(w))
df = df.withColumn('error_code', F.when(F.col('mileage') < F.col('max'), F.lit('ERROR')).otherwise(F.lit('')))
df.show()
Finally, all that remains is to remove the column that has the maximum
df = df.drop('max')
df.show()
I am currently learning pyspark and currently working on adding columns to pyspark dataframes using multiple conditions.
I have tried working with UDFs but getting some errors like:
TypeError: 'object' object has no attribute '__getitem__'
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType, StringType, FloatType
from pyspark.sql.functions import pandas_udf, PandasUDFType
#first dataframe (superset)
df1 = simple_example1
#second dataframe
df = diff_cols.dropna()
def func(x,y):
z = df1[(df1['a'] == x) & (df1['b'] <= (y+10000000000)) & (df1['b'] >= (y-10000000000))]
z = z[(z["c"] ==1) | (z["d"] ==1)]
z = z[(z["e"]!=0) | (z["f"]!=0) | (z["g"]!=0) | (z["h"]!=0)]
return 1 if z.count() > 3 else 0
udf_func = udf(func, IntegerType())
df = df.withColumn('status', udf_func(df['a'],df['b']))
what i am trying is as follow:
1. for each row of df filter data from df1 where parameter a is equal to the parameter in df and parameter b should be in between b-10 to b+10
2. then filter that data further with either c or d = 1
3. then filter that data further if any of parameters from e f g h are non 0
4. then count number of rows in the subset and assign 0/1
5. return this 0/1 in status column of df
Initially I have a matrix
0.0 0.4 0.4 0.0
0.1 0.0 0.0 0.7
0.0 0.2 0.0 0.3
0.3 0.0 0.0 0.0
The matrix matrix is converted into a normal_array by
`val normal_array = matrix.toArray`
and I have an array of string
inputCols : Array[String] = Array(p1, p2, p3, p4)
I need to convert this matrix into a following data frame. (Note: The number of rows and columns in the matrix will be the same as the length of the inputCols)
index p1 p2 p3 p4
p1 0.0 0.4 0.4 0.0
p2 0.1 0.0 0.0 0.7
p3 0.0 0.2 0.0 0.3
p4 0.3 0.0 0.0 0.0
In python, this can be easily achieved by pandas library.
arrayToDataframe = pandas.DataFrame(normal_array,columns = inputCols, index = inputCols)
But how can I do this in Scala?
You can do something like below
//convert your data to Scala Seq/List/Array
val list = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
//Define your Array of desired columns
val inputCols : Array[String] = Array("p1", "p2", "p3", "p4")
//Create DataFrame from given data, It will create dataframe with its own column names like _c1,_c2 etc
val df = sparkSession.createDataFrame(list)
//Getting the list of column names from dataframe
val dfColumns=df.columns
//Creating query to rename columns
val query=inputCols.zipWithIndex.map(index=>dfColumns(index._2)+" as "+inputCols(index._2))
//Firing above query
val newDf=df.selectExpr(query:_*)
//Creating udf which get index(0,1,2,3) as input and returns corresponding column name from your given array of columns
val getIndexUDF=udf((row_no:Int)=>inputCols(row_no))
//Adding temporary column row_no which contains index of row and removing after adding index column
val dfWithRow=newDf.withColumn("row_no",monotonicallyIncreasingId).withColumn("index",getIndexUDF(col("row_no"))).drop("row_no")
dfWithRow.show
Sample Output:
+---+---+---+---+-----+
| p1| p2| p3| p4|index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0| p1|
|0.1|0.0|0.0|0.7| p2|
|0.0|0.2|0.0|0.3| p3|
|0.3|0.0|0.0|0.0| p4|
+---+---+---+---+-----+
Here is another way:
val data = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
val cols = Array("p1", "p2", "p3", "p4","index")
Zip the collection and convert it into DataFrame.
data.zip(cols).map {
case (col,index) => (col._1,col._2,col._3,col._4,index)
}.toDF(cols: _*)
Output:
+---+---+---+---+-----+
|p1 |p2 |p3 |p4 |index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0|p1 |
|0.1|0.0|0.0|0.7|p2 |
|0.0|0.2|0.0|0.3|p3 |
|0.3|0.0|0.0|0.0|p4 |
+---+---+---+---+-----+
Newer and shorter version should look like
for Spark version > 2.4.5.
Please find the inline description of the statements
val spark = SparkSession.builder()
.master("local[*]")
.getOrCreate()
import spark.implicits._
val cols = (1 to 4).map( i => s"p$i")
val listDf = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
.toDF(cols: _*) // Map the data to new column names
.withColumn("index", // Create a column with auto increasing id
functions.concat(functions.lit("p"),functions.monotonically_increasing_id()))
listDf.show()
I have a RDD, I want to get the average values in front of the current position(including current position) in a RDD
for example:
inputRDD:
1, 2, 3, 4, 5, 6, 7, 8
output:
1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5
this is my try:
val rdd=sc.parallelize(List(1,2,3,4,5,6,7,8),4)
var sum=0.0
var index=0.0
val partition=rdd.getNumPartitions
rdd.zipWithIndex().collect().foreach(println)
rdd.zipWithIndex().sortBy(x=>{x._2},true,1).mapPartitions(ite=>{
var result=new ArrayBuffer[Tuple2[Double,Long]]()
while (ite.hasNext){
val iteNext=ite.next()
sum+=iteNext._1
index+=1
var avg:Double=sum/index
result.append((avg,iteNext._2))
}
result.toIterator
}).sortBy(x=>{x._2},true,partition).map(x=>{x._1}).collect().foreach(println)
I have to repartition to 1 then calculate it with a array,it's so inefficient.
Is there any cleaner solution without using array in 4 partitions?
Sorry I dont use Scala and hope you could read it
df = spark.createDataFrame(map(lambda x: (x,), range(1, 9)), ['val'])
df = df.withColumn('spec_avg',
f.avg('val').over(Window().orderBy('val').rowsBetween(start=Window.unboundedPreceding, end=0)))
a simpler solution would be to use Spark-SQL.
here I am computing the running average for each row
val df = sc.parallelize(List(1,2,3,4,5,6,7,8)).toDF("col1")
df.createOrReplaceTempView("table1")
val result = spark.sql("""SELECT col1, sum(col1) over(order by col1 asc)/row_number() over(order by col1 asc) as avg FROM table1""")
or alternatively if you want to use the DataFrames API.
import org.apache.spark.sql.expressions._
val result = df
.withColumn("csum", sum($"col1").over(Window.orderBy($"col1")))
.withColumn("rownum", row_number().over(Window.orderBy($"col1")))
.withColumn("avg", $"csum"/$"rownum")
.select("col1","avg")
Output:
result.show()
+----+---+
|col1|avg|
+----+---+
| 1|1.0|
| 2|1.5|
| 3|2.0|
| 4|2.5|
| 5|3.0|
| 6|3.5|
| 7|4.0|
| 8|4.5|
+----+---+
I have a DataFrame:
name column1 column2 column3 column4
first 2 1 2.1 5.4
test 1.5 0.5 0.9 3.7
choose 7 2.9 9.1 2.5
I want a new dataframe with a column with contain, the column name with have max value for row :
| name | max_column |
|--------|------------|
| first | column4 |
| test | column4 |
| choose | column3 |
Thank you very much for support.
There might some better way of writing UDF. But this could be the working solution
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
//implicits for magic functions like .toDf
import spark.implicits._
import org.apache.spark.sql.functions.udf
//We have hard code number of params as UDF don't support variable number of args
val maxval = udf((c1: Double, c2: Double, c3: Double, c4: Double) =>
if(c1 >= c2 && c1 >= c3 && c1 >= c4)
"column1"
else if(c2 >= c1 && c2 >= c3 && c2 >= c4)
"column2"
else if(c3 >= c1 && c3 >= c2 && c3 >= c4)
"column3"
else
"column4"
)
//create schema class
case class Record(name: String,
column1: Double,
column2: Double,
column3: Double,
column4: Double)
val df = Seq(
Record("first", 2.0, 1, 2.1, 5.4),
Record("test", 1.5, 0.5, 0.9, 3.7),
Record("choose", 7, 2.9, 9.1, 2.5)
).toDF();
df.withColumn("max_column", maxval($"column1", $"column2", $"column3", $"column4"))
.select("name", "max_column").show
Output
+------+----------+
| name|max_column|
+------+----------+
| first| column4|
| test| column4|
|choose| column3|
+------+----------+
You get the job done making a detour to an RDD and using 'getValuesMap'.
val dfIn = Seq(
("first", 2.0, 1., 2.1, 5.4),
("test", 1.5, 0.5, 0.9, 3.7),
("choose", 7., 2.9, 9.1, 2.5)
).toDF("name","column1","column2","column3","column4")
The simple solution is
val dfOut = dfIn.rdd
.map(r => (
r.getString(0),
r.getValuesMap[Double](r.schema.fieldNames.filter(_!="name"))
))
.map{case (n,m) => (n,m.maxBy(_._2)._1)}
.toDF("name","max_column")
But if you want to take back all columns from the original dataframe (like in Scala/Spark dataframes: find the column name corresponding to the max), you have to play a bit with merging rows and extending the schema
import org.apache.spark.sql.types.{StructType,StructField,StringType}
import org.apache.spark.sql.Row
val dfOut = sqlContext.createDataFrame(
dfIn.rdd
.map(r => (r, r.getValuesMap[Double](r.schema.fieldNames.drop(1))))
.map{case (r,m) => Row.merge(r,(Row(m.maxBy(_._2)._1)))},
dfIn.schema.add(StructField("max_column",StringType))
)
I want post my final solution:
val finalDf = originalDf.withColumn("name", maxValAsMap(keys, values)).select("cookie_id", "max_column")
val maxValAsMap = udf((keys: Seq[String], values: Seq[Any]) => {
val valueMap:Map[String,Double] = (keys zip values).filter( _._2.isInstanceOf[Double] ).map{
case (x,y) => (x, y.asInstanceOf[Double])
}.toMap
if (valueMap.isEmpty) "not computed" else valueMap.maxBy(_._2)._1
})
It's work very fast.