I have a self written function and it gets a dataframe and returns the whole dataframe plus a new column. That new column must not have a fixed name but instead the current month as part of the new column name. E.g. "forecast_august2022".
I tried it like
.withColumnRenamed(
old_columnname,
new_columnname
)
But I do not know, how to create the new column name with a fixed value (forecast_) concatenating it with the current month. Ideas?
You can define a variable at start with current month and year and use it in f string while adding it in with column
from pyspark.sql import functions as F
import datetime
mydate = datetime.datetime.now()
month_nm=mydate.strftime("%B%Y") #gives you July2022 for today
dql1=spark.range(3).toDF("ID")
dql1.withColumn(f"forecast_{month_nm}",F.lit(0)).show()
#output
+---+-----------------+
| ID|forecast_July2022|
+---+-----------------+
| 0| 0|
| 1| 0|
| 2| 0|
+---+-----------------+
you can do like this :
>>> import datetime
>>> #provide month number
>>> month_num = "3"
>>> datetime_object = datetime.datetime.strptime(month_num, "%m")
>>> full_month_name = datetime_object.strftime("%B")
>>> df.withColumn(("newcol"+"_"+full_month_name+"2022"),col('period')).show()
+------+-------+------+----------------+
|period|product|amount|newcol_March2022|
+------+-------+------+----------------+
| 20191| prod1| 30| 20191|
| 20192| prod1| 30| 20192|
| 20191| prod2| 20| 20191|
| 20191| prod3| 60| 20191|
| 20193| prod1| 30| 20193|
| 20193| prod2| 30| 20193|
+------+-------+------+----------------+
Related
I'm trying to remove numbers and full stops that lead the names of horses in a betting dataframe.
The format is like this:
Horse Name
Horse Name
I would like the resulting df column to just have the horses name.
I've tried splitting the column at the full stop but am not getting the required result.
import pyspark.sql.functions as F
runners_returns = runners_returns.withColumn('runner_name', F.split(F.col('runner_name'), '.'))
Any help is greatly appreciated
With a Dataframe like the following.
df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| 123.John|
| 2| 5.42Anna|
| 3| .203Josh|
| 4| 102Paul|
+---+-----------+
You can do remove the leading numbers and periods like this.
import pyspark.sql.functions as F
df = (df.withColumn("runner_name",
F.regexp_replace('runner_name', r'(^[\d\.]+)', '')))
df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| John|
| 2| Anna|
| 3| Josh|
| 4| Paul|
+---+-----------+
Pyspark newbie here. I have a dataframe, say,
+------------+-------+----+
| id| mode|count|
+------------+------+-----+
| 146360 | DOS| 30|
| 423541 | UNO| 3|
+------------+------+-----+
I want a dataframe with a new column aggregate with count * 2 , when mode is 'DOS' and count * 1 when mode is 'UNO'
+------------+-------+----+---------+
| id| mode|count|aggregate|
+------------+------+-----+---------+
| 146360 | DOS| 30| 60|
| 423541 | UNO| 3| 3|
+------------+------+-----+---------+
Appreciate your inputs and also some pointers to best practices :)
Method 1: using pyspark.sql.functions with when :
from pyspark.sql.functions import when,col
df = df.withColumn('aggregate', when(col('mode')=='DOS', col('count')*2).when(col('mode')=='UNO', col('count')*1).otherwise('count'))
Method 2: using SQL CASE expression with selectExpr:
df = df.selectExpr("*","CASE WHEN mode == 'DOS' THEN count*2 WHEN mode == 'UNO' THEN count*1 ELSE count END AS aggregate")
The result:
+------+----+-----+---------+
| id|mode|count|aggregate|
+------+----+-----+---------+
|146360| DOS| 30| 60|
|423541| UNO| 3| 3|
+------+----+-----+---------+
I think the question is related to: Spark DataFrame: count distinct values of every column
So basically I have a spark dataframe, with column A has values of 1,1,2,2,1
So I want to count how many times each distinct value (in this case, 1 and 2) appears in the column A, and print something like
distinct_values | number_of_apperance
1 | 3
2 | 2
I just post this as I think the other answer with the alias could be confusing. What you need are the groupby and the count methods:
from pyspark.sql.types import *
l = [
1
,1
,2
,2
,1
]
df = spark.createDataFrame(l, IntegerType())
df.groupBy('value').count().show()
+-----+-----+
|value|count|
+-----+-----+
| 1| 3|
| 2| 2|
+-----+-----+
I am not sure if you are looking for below solution:
Here are my thoughts on this. Suppose you have a dataframe like this.
>>> listA = [(1,'AAA','USA'),(2,'XXX','CHN'),(3,'KKK','USA'),(4,'PPP','USA'),(5,'EEE','USA'),(5,'HHH','THA')]
>>> df = spark.createDataFrame(listA, ['id', 'name','country'])
>>> df.show();
+---+----+-------+
| id|name|country|
+---+----+-------+
| 1| AAA| USA|
| 2| XXX| CHN|
| 3| KKK| USA|
| 4| PPP| USA|
| 5| EEE| USA|
| 5| HHH| THA|
+---+----+-------+
I want to know the distinct country code appears in this particular dataframe and should be printed as alias name.
import pyspark.sql.functions as func
df.groupBy('country').count().select(func.col("country").alias("distinct_country"),func.col("count").alias("country_count")).show()
+----------------+-------------+
|distinct_country|country_count|
+----------------+-------------+
| THA| 1|
| USA| 4|
| CHN| 1|
+----------------+-------------+
were you looking something similar to this?
I am having problem figuring this. Here is the problem statement
lets say I have a dataframe, I want to select value for column c where column b value is foo and create a new column D and repeat the vale "3" for all rows
+---+----+---+
| A| B| C|
+---+----+---+
| 4|blah| 2|
| 2| | 3|
| 56| foo| 3|
|100|null| 5|
+---+----+---+
want it to become:
+---+----+---+-----+
| A| B| C| D |
+---+----+---+-----+
| 4|blah| 2| 3 |
| 2| | 3| 3 |
| 56| foo| 3| 3 |
|100|null| 5| 3 |
+---+----+---+-----+
You will have to extract the column C value i.e. 3 with foo in column B
import org.apache.spark.sql.functions._
val value = df.filter(col("B") === "foo").select("C").first()(0)
Then use that value using withColumn to create a new column D using lit function
df.withColumn("D", lit(value)).show(false)
You should get your desired output.
I am trying to create an RDD of LabeledPoint from a data frame, so I can later use it for MlLib.
The code below works fine if my_target column is the first column in sparkDF. However, if my_target column is not the first column, how do I modify the code below to exclude my_target to create a correct LabeledPoint?
import pyspark.mllib.classification as clf
labeledData = sparkDF.rdd.map(lambda row: clf.LabeledPoint(row['my_target'],row[1:]))
logRegr = clf.LogisticRegressionWithSGD.train(labeledData)
That is, row[1:] now exclude the value in the first column; if I want to exclude value in column N of row, how do I do this? Thanks!
>>> a = [(1,21,31,41),(2,22,32,42),(3,23,33,43),(4,24,34,44),(5,25,35,45)]
>>> df = spark.createDataFrame(a,["foo","bar","baz","bat"])
>>> df.show()
+---+---+---+---+
|foo|bar|baz|bat|
+---+---+---+---+
| 1| 21| 31| 41|
| 2| 22| 32| 42|
| 3| 23| 33| 43|
| 4| 24| 34| 44|
| 5| 25| 35| 45|
+---+---+---+---+
>>> N = 2
# N is the column that you want to exclude (in this example the third, indexing starts at 0)
>>> labeledData = df.rdd.map(lambda row: LabeledPoint(row['foo'],row[:N]+row[N+1:]))
# it is just a concatenation with N that is excluded both in row[:N] and row[N+1:]
>>> labeledData.collect()
[LabeledPoint(1.0, [1.0,21.0,41.0]), LabeledPoint(2.0, [2.0,22.0,42.0]), LabeledPoint(3.0, [3.0,23.0,43.0]), LabeledPoint(4.0, [4.0,24.0,44.0]), LabeledPoint(5.0, [5.0,25.0,45.0])]