how to make a new column by pairing elements of the other column? - pyspark

I have a big data dataframe and I want to make pairs from elements of the other column.
col
['summer','book','hot']
['g','o','p']
output:
the pair of the above rows:
new_col
['summer','book'],['summer','hot'],['hot','book']
['g','o'],['g','p'],['p','o']
Note that tuple will work instead of list. like ('summer','book').
I know in pandas I can do this:
df['col'].apply(lambda x: list(itertools.combinations(x, 2)))
but not sure in pyspark.

You can use a UDF to do the same as you would do in python. Then cast the output to an array of array of strings.
import itertools
from pyspark.sql import functions as F
combinations_udf = F.udf(
lambda x: list(itertools.combinations(x, 2)), "array<array<string>>"
)
df = spark.createDataFrame([(['hot','summer', 'book'],),
(['g', 'o', 'p'], ),
], ['col1'])
df1 = df.withColumn("new_col", combinations_udf(F.col("col1")))
display(df1)

Related

Convert each row of pyspark DataFrame column to a Json string

How to create a column with json structure based on other columns of a pyspark dataframe.
For example, I want to achieve the below in pyspark dataframe. I am able to do this on pandas dataframe as below, but how do I do the same on pyspark dataframe
df = {'Address': ['abc', 'dvf', 'bgh'], 'zip': [34567, 12345, 78905], 'state':['VA', 'TN', 'MA']}
df = pd.DataFrame(df, columns = ['Address', 'zip', 'state'])
lst = ['Address', 'zip']
df['new_col'] = df[lst].apply(lambda x: x.to_json(), axis = 1)
Expected output
Assuming your pyspark dataframe is named df, use the struct function to construct a struct, and then use the to_json function to convert it to a json string.
import pyspark.sql.functions as F
....
lst = ['Address', 'zip']
df = df.withColumn('new_col', F.to_json(F.struct(*[F.col(c) for c in lst])))
df.show(truncate=False)

pyspark create an array with literal values and then explode

Lets say I have a dataframe like below
df = spark.createDataFrame([(100, 'AB', 304), (200, 'BC', 305), (300, 'CD', 306)],['number', 'letter', 'id'])
df.show()
I want to create an array column with these ["source1","source2","source3"]
which I later want to explode
df_arr=df.withColumn("source",array(lit("source1"),lit("source2"),lit("source3")))
This did not work. i created a numpy array to use to explode which did work either. how can I achieve explode dataframe.
worked as per werner suggestion
from pyspark.sql import functions as f
from pyspark.sql.functions import lit
df2=df.withColumn("source", F.array(lit("source1"),lit("source2"),lit("source3")))
df2.withColumn("source", F.explode("source")).show()

How to cast String column to List

My dataframe looks like this:
df.schema results in:
StructType(
StructField(a,StringType,true),
StructField(b,StringType,true),
StructField(c,IntegerType,true),
StructField(d,StringType,true)
)
I want to cast column b to a List of Ints and column d to List of Strings. How do I do this?
Strip [] and split on ,:
import org.apache.spark.sql.functions._
val p = "^\\[(.*)\\]$"
df
.withColumn("b", split(regexp_extract(col("b"), p, 1), "\\s*,\\s*").cast("array<int>"))
.withColumn("d", split(regexp_extract(col("d"), p, 1), "\\s*,\\s*"))

get the distinct elements of an ArrayType column in a spark dataframe

I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:
Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[]
2, ["feat1_2"],["feat2_1","feat2_2"]
3,["feat1_4"],["feat2_3"]
I want to get the list of distinct elements inside each feature column, so the output will be:
distinct_feat1,distinct_feat2
-----------------------------
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]
what is the best way to do this in Scala?
You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:
import org.apache.spark.sql.functions._
val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
withColumn("feat2", explode(col("feat2"))).
agg(collect_set("feat1").alias("distinct_feat1"),
collect_set("feat2").alias("distinct_feat2"))
distinct_df.show
+--------------------+--------------------+
| distinct_feat1| distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+
distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
WrappedArray(, feat2_1, feat2_2, feat2_3)])
one more solution for spark 2.4+
.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))
beware, if one of columns is null, result will be null
The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:
def array_unique_values(df, fields):
from pyspark.sql.functions import col, collect_set, explode
from functools import reduce
data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])
And then:
data = array_unique_values(df, my_fields)
data.take(1)

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28