Pyspark -- Remove all-alphabets elements from array - pyspark

I have an array column like [abc, 123, ab12] in my df, would like to remove those elements which are pure alphabets in this array, so the output will be [123, ab12] for this example. Is there any built-ins to avoid using udf?
Thank you guys!

You can filter with an appropriate regex:
import pyspark.sql.functions as F
df2 = df.withColumn('arr', F.expr("filter(arr, x -> x not rlike '^[a-z]*$')"))

Related

how make elements of a list lower case?

I have a df tthat one of the columns is a set of words. How I can make them lower case in the efficient way?
The df has many column but the column that I am trying to make it lower case is like this:
B
['Summer','Air Bus','Got']
['Parmin','Home']
Note:
In pandas I do df['B'].str.lower()
If I understood you correctly, you have a column that is an array of strings.
To lower the string, you can use lower function like this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
data = [
{"B": ["Summer", "Air Bus", "Got"]},
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df = df.withColumn("result", F.expr("transform(B, x -> lower(x))"))
Result:
+----------------------+----------------------+
|B |result |
+----------------------+----------------------+
|[Summer, Air Bus, Got]|[summer, air bus, got]|
+----------------------+----------------------+
A slight variation on #vladsiv's answer, which tries to answer a question in the comments about passing a dynamic column name.
# set column name
m = "B"
# use F.tranform directly, rather than in a F.expr
df = df.withColumn("result", F.transform(F.col(m), lambda x:F.lower(x)))

Pyspark: I want to manually map the values of one of the columns in my dataframe

I have a dataframe in spark and I want to manually map the values of one of the columns:
Col1
Y
N
N
Y
N
Y
I want "Y" to be equal to 1 and "N" to be equal to 0, like this:
Col1
1
0
0
1
0
1
I have tried StringIndexer, but it I think it randomly encodes the categorical data. (I am not sure)
The python equivalent for this is:
df["Col1"] = df["Col1"].map({"Y": 1, "N": 0})
Can you please help me on how can I achieve this in Pyspark?
Since you want to map the values to 1 and 0, an easy way is to specify a boolean condition and cast the result to int
from pyspark.sql.functions import col
df.withColumn("Col1", (col("Col1")=="Y").cast("int"))
For a more general case, you can use pyspark.sql.functions.when to implement if-then-else logic:
from pyspark.sql.functions import when
df.withColumn("Col1", when(col("Col1").isin(["Y"]), 1).otherwise(0))

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

Pyspark data frame aggregation with user defined function

How can I use the 'groupby(key).agg(' with a user defined functions? Specifically I need a list of all unique values per key [not count].
The collect_set and collect_list (for unordered and ordered results respectively) can be used to post-process groupby results. Starting out with a simple spark dataframe
df = sqlContext.createDataFrame(
[('first-neuron', 1, [0.0, 1.0, 2.0]),
('first-neuron', 2, [1.0, 2.0, 3.0, 4.0])],
("neuron_id", "time", "V"))
Let's say the goal is to return the longest length of the V list for each neuron (grouped by name)
from pyspark.sql import functions as F
grouped_df = tile_img_df.groupby('neuron_id').agg(F.collect_list('V'))
We have now grouped the V lists into a list of lists. Since we wanted the longest length we can run
import pyspark.sql.types as sq_types
len_udf = F.udf(lambda v_list: int(np.max([len(v) in v_list])),
returnType = sq_types.IntegerType())
max_len_df = grouped_df.withColumn('max_len',len_udf('collect_list(V)'))
To get the max_len column added with the maximum length of the V list
I found pyspark.sql.functions.collect_set(col) which does the job I wanted.

Parse and concatenate two columns

I am trying to parse and concatenate two columns at the same time using the following expression:
val part : RDD[(String)] = sc.textFile("hdfs://xxx:8020/user/sample_head.csv")
.map{line => val row = line split ','
(row(1), row(2)).toString}
which returns something like:
Array((AAA,111), (BBB,222),(CCC,333))
But how could I directly get:
Array(AAA, 111 , BBB, 222, CCC, 333)
Your toString() on a tuple really doesn't make much sense to me. Can you explain why do you want to create strings from tuples and then split them again later?
If you are willing to map each row into a list of elements instead of a stringified tuple of elements, you could rewrite
(row(1), row(2)).toString
to
List(row(1), row(2))
and simply flatten the resulting list:
val list = List("0,aaa,111", "1,bbb,222", "2,ccc,333")
val tuples = list.map{ line =>
val row = line split ','
List(row(1), row(2))}
val flattenedTuples = tuples.flatten
println(flattenedTuples) // prints List(aaa, 111, bbb, 222, ccc, 333)
Note that what you are trying to achieve involves flattening and can be done using flatMap, but not using just map. You need to either flatMap directly, or do map followed by flatten like I showed you (I honestly don't remember if Spark supports flatMap). Also, as you can see I used a List as a more idiomatic Scala data structure, but it's easily convertible to Array and vice versa.