PySpark difference between pyspark.sql.functions.col and pyspark.sql.functions.lit - pyspark

I find it hard to understand the difference between these two methods from pyspark.sql.functions as the documentation on PySpark official website is not very informative. For example the following code:
import pyspark.sql.functions as F
print(F.col('col_name'))
print(F.lit('col_name'))
The results are:
Column<b'col_name'>
Column<b'col_name'>
so what are the difference between the two and when should I use one and not the other?

The doc says:
col:
Returns a Column based on the given column name.
lit:
Creates a Column of literal value
Say if we have a data frame as below:
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
>>> df.show()
+---+
| A|
+---+
| a|
| b|
| c|
+---+
If using col to create a new column from A:
>>> df.withColumn("new", F.col("A")).show()
+---+---+
| A|new|
+---+---+
| a| a|
| b| b|
| c| c|
+---+---+
So col grabs an existing column with the given name, F.col("A") is equivalent to df.A or df["A"] here.
If using F.lit("A") to create the column:
>>> df.withColumn("new", F.lit("A")).show()
+---+---+
| A|new|
+---+---+
| a| A|
| b| A|
| c| A|
+---+---+
While lit will create a constant column with the given string as the values.
Both of them return a Column object but the content and meaning are different.

To explain in a very succinct manner, col is typically used to refer to an existing column in a DataFrame, as opposed to lit which is typically used to set the value of a column to a literal
To illustrate with an example:
Assume i have a DataFrame df containing two columns of IntegerType, col_a and col_b
If i wanted a column total which were the sum of the two columns:
df.withColumn('total', col('col_a') + col('col_b'))
Instead of i wanted a column fixed_val having the value "Hello" for all rows of the DataFrame df:
df.withColumn('fixed_val', lit('Hello'))

Related

transform distinct row values to different columns with corresponding rows using Pyspark

I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----

show all the matched string in pyspark dataframe

I wanted to show all the filtered results of similar matched string.
codes:
# Since most of the stackoverflow questionaire-s and also answerer-s are all super SMART and leave out all the necessary import libraries and required codes before using pyspark so that the readers can crack their minds in researching more instead of getting direct answer, I share the codes from beginning as below in order for future reviewers.
# Import libraries
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pandas as pd
import numpy as np
# Initiate the session
spark = SparkSession\
.builder\
.appName('Operations')\
.getOrCreate()
# sc = SparkContext()
sc =SparkContext.getOrCreate()
# Create dataframe 1
sdataframe_temp = spark.createDataFrame([
(1,2,'3'),
(2,2,'yes')],
['a', 'b', 'c']
)
# Create Dataframe 2
sdataframe_temp2 = spark.createDataFrame([
(4,6,'yes'),
(5,7,'yes')],
['a', 'b', 'c']
)
# Combine dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
# Filter out the columns based on respective rules
sdataframe_temp\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
Output:
+---+---+
| a| b|
+---+---+
| 2| 2|
+---+---+
Expected output:
+---+---+
| a| b|
+---+---+
| 2| 2|
+---+---+
| 4| 6|
+---+---+
| 5| 7|
+---+---+
Can anyone please give some suggestions or improvement?
Here's a way using unionByName:
df = (sdataframe_temp1
.unionByName(sdataframe_temp2)
.where("c == 'yes'")
.drop('c'))
df.show()
+---+---+
| a| b|
+---+---+
| 2| 2|
| 4| 6|
| 5| 7|
+---+---+
you should change last line of code. for col function you should import from pyspark.sql.functions
from pyspark.sql.functions import *
sdataframe_union_1_2\
.filter(col('c') == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
or
sdataframe_union_1_2\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
you have to select data from sdataframe_union_1_2 and you are selecting from sdataframe_temp that's why you are getting one record.

How to copy the "first" row of a spark data frame to another data frame? Why does my minimal example fails?

Basic Problem :
I want to copy the "first row" of a Spark Dataframe sdf to another Spark dataframe sdfEmpty.
I do not understand what goes wrong in the following code.
Hence I am looking forward for a solution and an explanation what fails in my minimal example.
A minimal example :
// create a spark data frame
import org.apache.spark.sql._
val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
sdf.show()
+---+---+
| A| B|
+---+---+
| 1| a|
| 2| b|
| 3| b|
+---+---+
// create an empty spark data frame to store the row
// declare it as var, such that I can change it later
var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
+---+---+
// take the "first" row of sdf as a spark data frame
val row = sdf.limit(1)
// combine the two spark data frames
sdfEmpty = sdfEmpty.union(row)
As row is:
row.show()
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
the exspected result for sdfEmpty is:
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
But I get :
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
| 2| b|
+---+---+
Question:
What confused me is the following: Using val row = sdf.limit(1) I thought I created a permanent/ unchangeable/ well defined object. Such that when I print it once and add it to something, I get the same results.
Remark: (thanks a lot to Daniel's remarks)
I know that in the distributed world of scala there is no well defined notion of "first row". I put it there for simplicity and I hope that people struggling with something similar will "accidentially" use the term "first".
What I try to achieve is the following: (in a simplified example)
I have a data frame with 2 columns A and B. Column A is partially ordered and column B is totally ordered.
I want to filter the data w.r.t. the columns. So the idea is some kind of divide and conquer: split the data frame, such that into pieces both columns are totally ordered and than filter as usual. (and do the obvious iterations)
To achieve this I need to pick a well defined row and split the date w.r.t. row.A. But as the minimal example shows my comands do not produce a well defined object.
Thanks a lot
Spark is distributed, so the notion of 'first' is not something we can rely on. Dependently on partitioning we can get a different result when calling limit or first.
To have consistent results your data has to have an underlying order which we can use - what makes a lot of sense, since unless there is logical ordering to your data, we can't really say what does it mean to take the first row.
Assuming you want to take the first row with respect to column A, you can just run orderBy("A").first()(*) . Although if column A has more than one row with same smallest value there is no guarantee which row you will get.
(* I assume scala API has the same naming as Python so please correct me if they are differently named)
#Christian you can achieve this result by using take function.
take(num) Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
here the code snippet.
scala> import org.apache.spark.sql.types._
scala> val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
scala> import org.apache.spark.sql._
scala> var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
scala> var first1 =sdf.rdd.take(1)
scala> val first_row = spark.createDataFrame(sc.parallelize(first1), sdf.schema)
scala> sdfEmpty.union(first_row).show
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
for more about take() and first() function just read spark Documentation.let me know if you have any query related to this.
I am posting this answer as it contains the solution suggested by Daniel. Once I am through literature provided mahesh gupta or some more testing I'll update this answer and give remarks on the runtimes of the different approaches in "real life".
Basic Problem :
I want to copy the "first row" of a Spark Dataframe sdf to another Spark dataframe sdfEmpty.
As in the distributed world of spark there is not a well defined notion of first, but something similar might be achieved due to orderBy.
A minimal working example :
// create a spark data frame
import org.apache.spark.sql._
val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
sdf.show()
+---+---+
| A| B|
+---+---+
| 1| a|
| 2| b|
| 3| b|
+---+---+
// create an empty spark data frame to store the row
// declare it as var, such that I can change it later
var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
+---+---+
// take the "first" row of sdf as a spark data frame
val row = sdf.limit(1).collect()
// combine the two spark data frames
sdfEmpty = sdfEmpty.union(row)
The row is:
row.show()
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
** and the result for sdfEmpty is:**
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
Remark: Explanation given by Daniel (see comments above) .limit(n) is a transformation - it does not get evaluated until an action runs like show or collect. Hence depending on the context it can return different value. To use the result of .limit consistently one can .collect it to driver and use it as a local variable.

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

How to compose column name using another column's value for withColumn in Scala Spark

I'm trying to add a new column to a DataFrame. The value of this column is the value of another column whose name depends on other columns from the same DataFrame.
For instance, given this:
+---+---+----+----+
| A| B| A_1| B_2|
+---+---+----+----+
| A| 1| 0.1| 0.3|
| B| 2| 0.2| 0.4|
+---+---+----+----+
I'd like to obtain this:
+---+---+----+----+----+
| A| B| A_1| B_2| C|
+---+---+----+----+----+
| A| 1| 0.1| 0.3| 0.1|
| B| 2| 0.2| 0.4| 0.4|
+---+---+----+----+----+
That is, I added column C whose value came from either column A_1 or B_2. The name of the source column A_1 comes from concatenating the value of columns A and B.
I know that I can add a new column based on another and a constant like this:
df.withColumn("C", $"B" + 1)
I also know that the name of the column can come from a variable like this:
val name = "A_1"
df.withColumn("C", col(name) + 1)
However, what I'd like to do is something like this:
df.withColumn("C", col(s"${col("A")}_${col("B")}"))
Which doesn't work.
NOTE: I'm coding in Scala 2.11 and Spark 2.2.
You can achieve your requirement by writing a udf function. I am suggesting udf, as your requirement is to process dataframe row by row contradicting to inbuilt functions which functions column by column.
But before that you would need array of column names
val columns = df.columns
Then write a udf function as
import org.apache.spark.sql.functions._
def getValue = udf((A: String, B: String, array: mutable.WrappedArray[String]) => array(columns.indexOf(A+"_"+B)))
where
A is the first column value
B is the second column value
array is the Array of all the columns values
Now just call the udf function using withColumn api
df.withColumn("C", getValue($"A", $"B", array(columns.map(col): _*))).show(false)
You should get your desired output dataframe.
You can select from a map. Define map which translates name to column value:
import org.apache.spark.sql.functions.{col, concat_ws, lit, map}
val dataMap = map(
df.columns.diff(Seq("A", "B")).flatMap(c => lit(c) :: col(c) :: Nil): _*
)
df.select(dataMap).show(false)
+---------------------------+
|map(A_1, A_1, B_2, B_2) |
+---------------------------+
|Map(A_1 -> 0.1, B_2 -> 0.3)|
|Map(A_1 -> 0.2, B_2 -> 0.4)|
+---------------------------+
and select from it with apply:
df.withColumn("C", dataMap(concat_ws("_", $"A", $"B"))).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
You can also try mapping, but I suspect it won't perform well with very wide data:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val outputEncoder = RowEncoder(df.schema.add(StructField("C", DoubleType)))
df.map(row => {
val a = row.getAs[String]("A")
val b = row.getAs[String]("B")
val key = s"${a}_${b}"
Row.fromSeq(row.toSeq :+ row.getAs[Double](key))
})(outputEncoder).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
and in general I wouldn't recommend this approach.
If data comes from csv you might consider skipping default csv reader and use custom logic to push column selection directly into parsing process. With pseudocode:
spark.read.text(...).map { line => {
val a = ??? // parse A
val b = ??? // parse B
val c = ??? // find c, based on a and b
(a, b, c)
}}