Padding in a Pyspark Dataframe - pyspark

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype):
id Value
1 103
2 1504
3 1
I need to create a new modified dataframe with padding in value column, so that length of this column should be 4 characters. If length is less than 4 characters, then add 0's in data as shown below:
id Value
1 0103
2 1504
3 0001
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.

You can use lpad from functions module,
from pyspark.sql.functions import lpad
>>> df.select('id',lpad(df['value'],4,'0').alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| 0103|
| 2| 1504|
| 3| 0001|
+---+-----+

Using PySpark lpad function in conjunction with withColumn:
import pyspark.sql.functions as F
dfNew = dfOrigin.withColumn('Value', F.lpad(dfOrigin['Value'], 4, '0'))

Related

Add new column with values as one two three etc [duplicate]

This question already exists:
How can write pyspark code to add a column Project_id as in column it must have values as One, Two, Three so on
Closed 7 months ago.
eid ename
1 abc
2 def
3 ghi
4 jkl
Expected Result:
eid ename newCol
1 abc one
2 def two
3 ghi three
4 jkl four
As spark has no native library to convert number to words we have to use some external python library to achieve the same
I have use num2words details here
first install it using
pip install num2words
then it can be used as below
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
from num2words import num2words
data=[(1,"abc"),(2,"def"),(3,"efg"),(4,"jkl")]
tdf=spark.createDataFrame(data,["eid","ename"])
spark.udf.register("num2name",lambda s :num2words(s),StringType())
tdf.withColumn("new_col",num2name(F.col("eid"))).show()
#output
+---+-----+-------+
|eid|ename|new_col|
+---+-----+-------+
| 1| abc| one|
| 2| def| two|
| 3| efg| three|
| 4| jkl| four|
+---+-----+-------+

Pyspark Rename column based on column position

How do I rename the 3rd column of a dataframe in PySpark. I want to call the column index rather than the actual name.
Here is my attempt:
df
Col1 Col2 jfdklajfklfj
A B 2
df.withColumnRenamed([3], 'Row_Count')
Since python indexing starts at 0, you can index df.columns list by subtracting 1:
index_of_col = 3
df.withColumnRenamed(df.columns[index_of_col-1],'Row_Count').show()
+----+----+---------+
|Col1|Col2|Row_Count|
+----+----+---------+
| A| B| 2|
+----+----+---------+

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

Read fixed length file with implicit decimal point?

Suppose I have a data file like this:
foo12345
bar45612
I want to parse this into:
+----+-------+
| id| amt|
+----+-------+
| foo| 123.45|
| bar| 456.12|
+----+-------+
Which is to say, I need to select df.value.substr(4,5).alias('amt'), but I want the value to be interpreted as a five digit number where the last two digits are after the decimal point.
Surely there's a better way to do this than "divide by 100"?
from pyspark.sql.functions import substring, concat, lit
from pyspark.sql.types import DoubleType
#sample data
df = sc.parallelize([
['foo12345'],
['bar45612']]).toDF(["value"])
df = df.withColumn('id', substring('value',1,3)).\
withColumn('amt', concat(substring('value', 4, 3),lit('.'),substring('value', 7, 2)).cast(DoubleType()))
df.show()
Output is:
+--------+---+------+
| value| id| amt|
+--------+---+------+
|foo12345|foo|123.45|
|bar45612|bar|456.12|
+--------+---+------+

PySpark difference between pyspark.sql.functions.col and pyspark.sql.functions.lit

I find it hard to understand the difference between these two methods from pyspark.sql.functions as the documentation on PySpark official website is not very informative. For example the following code:
import pyspark.sql.functions as F
print(F.col('col_name'))
print(F.lit('col_name'))
The results are:
Column<b'col_name'>
Column<b'col_name'>
so what are the difference between the two and when should I use one and not the other?
The doc says:
col:
Returns a Column based on the given column name.
lit:
Creates a Column of literal value
Say if we have a data frame as below:
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
>>> df.show()
+---+
| A|
+---+
| a|
| b|
| c|
+---+
If using col to create a new column from A:
>>> df.withColumn("new", F.col("A")).show()
+---+---+
| A|new|
+---+---+
| a| a|
| b| b|
| c| c|
+---+---+
So col grabs an existing column with the given name, F.col("A") is equivalent to df.A or df["A"] here.
If using F.lit("A") to create the column:
>>> df.withColumn("new", F.lit("A")).show()
+---+---+
| A|new|
+---+---+
| a| A|
| b| A|
| c| A|
+---+---+
While lit will create a constant column with the given string as the values.
Both of them return a Column object but the content and meaning are different.
To explain in a very succinct manner, col is typically used to refer to an existing column in a DataFrame, as opposed to lit which is typically used to set the value of a column to a literal
To illustrate with an example:
Assume i have a DataFrame df containing two columns of IntegerType, col_a and col_b
If i wanted a column total which were the sum of the two columns:
df.withColumn('total', col('col_a') + col('col_b'))
Instead of i wanted a column fixed_val having the value "Hello" for all rows of the DataFrame df:
df.withColumn('fixed_val', lit('Hello'))