Suppose I have a data file like this:
foo12345
bar45612
I want to parse this into:
+----+-------+
| id| amt|
+----+-------+
| foo| 123.45|
| bar| 456.12|
+----+-------+
Which is to say, I need to select df.value.substr(4,5).alias('amt'), but I want the value to be interpreted as a five digit number where the last two digits are after the decimal point.
Surely there's a better way to do this than "divide by 100"?
from pyspark.sql.functions import substring, concat, lit
from pyspark.sql.types import DoubleType
#sample data
df = sc.parallelize([
['foo12345'],
['bar45612']]).toDF(["value"])
df = df.withColumn('id', substring('value',1,3)).\
withColumn('amt', concat(substring('value', 4, 3),lit('.'),substring('value', 7, 2)).cast(DoubleType()))
df.show()
Output is:
+--------+---+------+
| value| id| amt|
+--------+---+------+
|foo12345|foo|123.45|
|bar45612|bar|456.12|
+--------+---+------+
Related
I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----
I'm trying in vain to use a Pyspark substring function inside of an UDF. Below is my code snippet -
from pyspark.sql.functions import substring
def my_udf(my_str):
try:
my_sub_str = substring(my_str,1, 2)
except Exception:
pass
else:
return (my_sub_str)
apply_my_udf = udf(my_udf)
df = input_data.withColumn("sub_str", apply_my_udf(input_data.col0))
The sample data is-
ABC1234
DEF2345
GHI3456
But when I print the df, I don't get any value in the new column "sub_str" as shown below -
[Row(col0='ABC1234', sub_str=None), Row(col0='DEF2345', sub_str=None), Row(col0='GHI3456', sub_str=None)]
Can anyone please let me know what I'm doing wrong?
You don't need a udf to use substring, here's a cleaner and faster way:
>>> from pyspark.sql import functions as f
>>> df.show()
+-------+
| data|
+-------+
|ABC1234|
|DEF2345|
|GHI3456|
+-------+
>>> df.withColumn("sub_str", f.substring("data", 1, 2)).show()
+-------+-------+
| data|sub_str|
+-------+-------+
|ABC1234| AB|
|DEF2345| DE|
|GHI3456| GH|
+-------+-------+
If you need to use udf for that, you could also try something like:
input_data = spark.createDataFrame([
(1,"ABC1234"),
(2,"DEF2345"),
(3,"GHI3456")
], ("id","col0"))
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
udf1 = udf(lambda x:x[0:2],StringType())
df.withColumn('sub_str',udf1('col0')).show()
+---+-------+-------+
| id| col0|sub_str|
+---+-------+-------+
| 1|ABC1234| AB|
| 2|DEF2345| DE|
| 3|GHI3456| GH|
+---+-------+-------+
However, as Mohamed Ali JAMAOUI wrote - you could do without udf easily here.
pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+
I have a PySpark dataframe with a column contains Python list
id value
1 [1,2,3]
2 [1,2]
I want to remove all rows with len of the list in value column is less than 3.
So I tried:
df.filter(len(df.value) >= 3)
and indeed it does not work.
How can I filter the dataframe by the length of the inside data?
Refer to this link -
size() - It returns the length of the array or map stored in the column.
from pyspark.sql.functions import size
myValues = [(1,[1,2,3]),(2,[1,2])]
df = sqlContext.createDataFrame(myValues,['id','value'])
df.show()
+----+---------+
| id| value|
+--------------+
| 1| [1,2,3]|
| 2| [1,2]|
+----+---------+
df = df.filter(size(df.value) >= 3).show()
+----+---------+
| id| value|
+--------------+
| 1| [1,2,3]|
+----+---------+
I find it hard to understand the difference between these two methods from pyspark.sql.functions as the documentation on PySpark official website is not very informative. For example the following code:
import pyspark.sql.functions as F
print(F.col('col_name'))
print(F.lit('col_name'))
The results are:
Column<b'col_name'>
Column<b'col_name'>
so what are the difference between the two and when should I use one and not the other?
The doc says:
col:
Returns a Column based on the given column name.
lit:
Creates a Column of literal value
Say if we have a data frame as below:
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
>>> df.show()
+---+
| A|
+---+
| a|
| b|
| c|
+---+
If using col to create a new column from A:
>>> df.withColumn("new", F.col("A")).show()
+---+---+
| A|new|
+---+---+
| a| a|
| b| b|
| c| c|
+---+---+
So col grabs an existing column with the given name, F.col("A") is equivalent to df.A or df["A"] here.
If using F.lit("A") to create the column:
>>> df.withColumn("new", F.lit("A")).show()
+---+---+
| A|new|
+---+---+
| a| A|
| b| A|
| c| A|
+---+---+
While lit will create a constant column with the given string as the values.
Both of them return a Column object but the content and meaning are different.
To explain in a very succinct manner, col is typically used to refer to an existing column in a DataFrame, as opposed to lit which is typically used to set the value of a column to a literal
To illustrate with an example:
Assume i have a DataFrame df containing two columns of IntegerType, col_a and col_b
If i wanted a column total which were the sum of the two columns:
df.withColumn('total', col('col_a') + col('col_b'))
Instead of i wanted a column fixed_val having the value "Hello" for all rows of the DataFrame df:
df.withColumn('fixed_val', lit('Hello'))