I have a df tthat one of the columns is a set of words. How I can make them lower case in the efficient way?
The df has many column but the column that I am trying to make it lower case is like this:
B
['Summer','Air Bus','Got']
['Parmin','Home']
Note:
In pandas I do df['B'].str.lower()
If I understood you correctly, you have a column that is an array of strings.
To lower the string, you can use lower function like this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
data = [
{"B": ["Summer", "Air Bus", "Got"]},
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df = df.withColumn("result", F.expr("transform(B, x -> lower(x))"))
Result:
+----------------------+----------------------+
|B |result |
+----------------------+----------------------+
|[Summer, Air Bus, Got]|[summer, air bus, got]|
+----------------------+----------------------+
A slight variation on #vladsiv's answer, which tries to answer a question in the comments about passing a dynamic column name.
# set column name
m = "B"
# use F.tranform directly, rather than in a F.expr
df = df.withColumn("result", F.transform(F.col(m), lambda x:F.lower(x)))
Related
I have a dataframe like this
df.show(5)
kv |list1 |list2 |p
[k1,v2|[1,2,5,9 |[5,1,7,9,6,3,1,4,9] |0.5
[k1,v3|[1,2,5,8,9|[5,1,7,9,6,3,1,4,15] |0.9
[k2,v2|[77,2,5,9]|[0,1,8,9,7,3,1,4,100]|0.01
[k5,v5|[1,0,5,9 |[5,1,7,9,6,3,1,4,3] |0.3
[k9,v2|[1,2,5,9 |[5,1,7,9,6,3,1,4,200]|2.5
df.count()
5200158
I want to get the row that have maximum p, this below works for me but I don't know if there is another cleaner way
val f = df.select(max(struct(
col("pp") +: df.columns.collect { case x if x != "p" => col(x) }: _*
))).first()
Just order by and then take:
import org.apache.spark.sql.functions.desc
df.orderBy(desc("pp")).take(1)
or
df.orderBy(desc("pp")).limit(1).first
You can also use Window-Functions, this is especially useful if the logic of selecting the row gets more complex (other than global min/max) :
import org.apache.spark.sql.expressions.Window
df
.withColumn("max_p",max($"p").over(Window.partitionBy()))
.where($"p" === $"max_p")
.drop($"max_p")
.first()
So, what I'm doing below is I drop a column A from a DataFrame because I want to apply a transformation (here I just json.loads a JSON string) and replace the old column with the transformed one. After the transformation I just join the two resulting data frames.
df = df_data.drop('A').join(
df_data[['ID', 'A']].rdd\
.map(lambda x: (x.ID, json.loads(x.A))
if x.A is not None else (x.ID, None))\
.toDF()\
.withColumnRenamed('_1', 'ID')\
.withColumnRenamed('_2', 'A'),
['ID']
)
The thing I dislike about this is of course the overhead I'm faced because I had to do the withColumnRenamed operations.
With pandas All I'd do something like this:
pdf = pd.DataFrame([json.dumps([0]*np.random.randint(5,10)) for i in range(10)], columns=['A'])
pdf.A = pdf.A.map(lambda x: json.loads(x))
pdf
but the following does not work in pyspark:
df.A = df[['A']].rdd.map(lambda x: json.loads(x.A))
So is there an easier way than what I'm doing in my first code snipped?
I do not think you need to drop the column and do the join. The following code should* be equivalent to what you posted:
cols = df_data.columns
df = df_data.rdd\
.map(
lambda row: tuple(
[row[c] if c != 'A' else (json.loads(row[c]) if row[c] is not None else None)
for c in cols]
)
)\
.toDF(cols)
*I haven't actually tested this code, but I think this should work.
But to answer your general question, you can transform a column in-place using withColumn().
df = df_data.withColumn("A", my_transformation_function("A").alias("A"))
Where my_transformation_function() can be a udf or a pyspark sql function.
From what i could understand, is it something like this you are trying to achieve?
import pyspark.sql.functions as F
import json
json_convert = F.udf(lambda x: json.loads(x) if x is not None else None)
cols = df_data.columns
df = df_data.select([json_convert(F.col('A')).alias('A')] + \
[col for col in cols if col != 'A'])
I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element?
I've tried doing the following
from pyspark.sql.functions import udf
first_elem_udf = udf(lambda row: row.values[0])
df.select(first_elem_udf(df.features)).show()
but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0]) instead.
I also tried explode() but I get an error because it requires an array or map type.
This should be a common operation, I think.
Convert output to float:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
Example usage:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.select(ith("features", lit(1))).show()
## +-----------------+
## |ith_(features, 1)|
## +-----------------+
## | 2.0|
## | 9.0|
## +-----------------+
Explanation:
Output values have to be reserialized to equivalent Java objects. If you want to access values (beware of SparseVectors) you should use item method:
v.values.item(0)
which return standard Python scalars. Similarly if you want to access all values as a dense structure:
v.toArray().tolist()
If you prefer using spark.sql, you can use the follow custom function 'to_array' to convert the vector to array. Then you can manipulate it as an array.
from pyspark.sql.types import ArrayType, DoubleType
def to_array_(v):
return v.toArray().tolist()
from pyspark.sql import SQLContext
sqlContext=SQLContext(spark.sparkContext, sparkSession=spark, jsqlContext=None)
sqlContext.udf.register("to_array",to_array_, ArrayType(DoubleType()))
example
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.createOrReplaceTempView("tb")
spark.sql("""select * , to_array(features)[1] Second from tb """).toPandas()
output
id features Second
0 1 [1.0, 2.0, 3.0] 2.0
1 2 (0.0, 9.0, 0.0) 9.0
I ran into the same problem with not being able to use explode(). One thing you can do is use VectorSlice from the pyspark.ml.feature library. Like so:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
slicer = VectorSlicer(inputCol="features", outputCol="features_one", indices=[0])
output = slicer.transform(df)
output.select("features", "features_one").show()
For anyone trying to split the probability columns generated after training a PySpark ML model into usable columns. This does not use UDF or numpy. And this will only work for binary classification. Here lr_pred is the dataframe which has the predictions from the Logistic Regression Model.
prob_df1=lr_pred.withColumn("probability",lr_pred["probability"].cast("String"))
prob_df =prob_df1.withColumn('probabilityre',split(regexp_replace("probability", "^\[|\]", ""), ",")[1].cast(DoubleType()))
Since Spark 3.0.0 this can be done without using UDF.
from pyspark.ml.functions import vector_to_array
https://discuss.dizzycoding.com/how-to-split-vector-into-columns-using-pyspark/
Why is Vector[Double] is used in the results? That's not a very nice data type.
I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs
I am working with Spark dataframes. I have a categorical variable in my dataframe with many levels. I am attempting a simple transformation of this variable - Only pick the top few levels which has greater than n observations (say,1000). Club all other levels into an "Others" category.
I am fairly new to Spark, so I have been struggling to implement this. This is what I have been able to achieve so far:
# Extract all levels having > 1000 observations (df is the dataframe name)
val levels_count = df.groupBy("Col_name").count.filter("count >10000").sort(desc("count"))
# Extract the level names
val level_names = level_count.select("Col_name").rdd.map(x => x(0)).collect
This gives me an Array which has the level names that I would like to retain. Next, I should define the transformation function which can be applied to the column. This is where I am getting stuck. I believe we need to create a User defined function. This is what I tried:
# Define UDF
val var_transform = udf((x: String) => {
if (level_names contains x) x
else "others"
})
# Apply UDF to the column
val df_new = df.withColumn("Var_new", var_transform($"Col_name"))
However, when I try df_new.show it throws a "Task not serializable" exception. What am I doing wrong? Also, is there a better way to do this?
Thanks!
Here is a solution that would be, in my opinion, better for such a simple transformation: stick to the DataFrame API and trust catalyst and Tungsten to be optimised (e.g. making a broadcast join):
val levels_count = df
.groupBy($"Col_name".as("new_col_name"))
.count
.filter("count >10000")
val df_new = df
.join(levels_count,$"Col_name"===$"new_col_name", joinType="leftOuter")
.drop("Col_name")
.withColumn("new_col_name",coalesce($"new_col_name", lit("other")))