Convert String to Pyspark Dataframe - pyspark

I have one string in List something like
ListofString = ['Column1,Column2,Column3,\nCol1Value1,Col2Value1,Col3Value1,\nCol1Value2,Col2Value2,Col3Value2']
How do i convert this string to pyspark Dataframe like below
'\n' being a new row
Column1 Column2 Column3
-----------------------------------------
Col1Value1 Col2Value1 Col3Value1
Col1Value2 Col2Value2 Col3Value2

You simply need to convert the list of string in the correct format like this:
# convert the list of string into proper format
>>> l = ' '.join(ListofString)
>>> l = l.replace(',',' ')
>>> l = [x.strip().split(' ') for x in l.split('\n')]
>>> print(l)
>>> [['Column1', 'Column2', 'Column3'], ['Col1Value1', 'Col2Value1', 'Col3Value1'], ['Col1Value2', 'Col2Value2', 'Col3Value2']]
>>> df = spark.createDataFrame(l[1:],l[0])
>>> df.show()
+----------+----------+----------+
| Column1| Column2| Column3|
+----------+----------+----------+
|Col1Value1|Col2Value1|Col3Value1|
|Col1Value2|Col2Value2|Col3Value2|
+----------+----------+----------+

Related

Convert date from "yyyy/mm/dd" format to "M/d/yyyy" format in pyspark dataframe

I am reading a table to a dataframe which has a column "day_dt" which is in date format "2022/01/08". I want the format to be in "1/8/2022" (M/d/yyyy) Is it possible in pyspark? I have tried using date_format() but resulting in null.
Did you cast day_dt column to timestamp before using date_format? Code below adds a null valued column as you described in your question because it is StringType. You can see it using df.printSchema()
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
d = ['2022/01/08']
df = spark.createDataFrame(d, StringType())
df.show()
df2 = df.withColumn("newDate", date_format(unix_timestamp(df.value ,
"yyyy/mm/dd").cast("timestamp"),"mm/dd/yyyy"))
df2.show()
+----------+
| value|
+----------+
|2022/01/08|
+----------+
+----------+-------+
| value|newDate|
+----------+-------+
|2022/01/08| null|
+----------+-------+
After casting string type to timestamp, date column is formatted properly:
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
d = ['2022/01/08']
df = spark.createDataFrame(d, StringType())
df.show()
df2 = df.withColumn("newDate", date_format(unix_timestamp(df.value , "yyyy/mm/dd").cast("timestamp"),"mm/dd/yyyy"))
df2.show()
+----------+
| value|
+----------+
|2022/01/08|
+----------+
+----------+----------+
| value| newDate|
+----------+----------+
|2022/01/08|01/08/2022|
+----------+----------+
Hope it helps.
If you mean you have date as string in format "yyyy/mm/dd" and you want to convert it to a string with format "M/d/yyyy", then:
First parse string to Date type using to_date().
Then, convert Date type to string using date_format.
df = spark.createDataFrame(data=[["2022/01/01",],["2022/12/31",]], schema=["date_str_in"])
df = df.withColumn("date_dt", F.to_date("date_str_in", format="yyyy/MM/dd"))
df = df.withColumn("date_str_out", F.date_format("date_dt", format="M/d/yyyy"))
+-----------+----------+------------+
|date_str_in| date_dt|date_str_out|
+-----------+----------+------------+
| 2022/01/01|2022-01-01| 1/1/2022|
| 2022/12/31|2022-12-31| 12/31/2022|
+-----------+----------+------------+

Spark fails to convert String to TIMESTAMP

I have a hive table that contains a String column: this is an example:
| DT |
|-------------------------------|
| 2019-05-07 00:03:53.837000000 |
when I try to import the table inside a Spark-Scala DF transforming the String to a timestamp I only have null values:
val df = spark.sql(s"""select to_timestamp(dt_maj, 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
| DT |
|------|
| null |
Doing
val df = spark.sql(s"""select dt from ${use_database}.pz_send_demande_diffusion""").show()
gives a good result (column with the String values). So Spark is importing te column normally.
I also tried:
val df = spark.sql(s"""select to_timestamp('2005-05-04 11:12:54.297', 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
And it worked! It returns a TIMESTAMPs column.
What is the problem ?
Trim your extra 0s. Then,
df.withColumn("new", to_timestamp($"date".substr(lit(1),length($"date") - 6), "yyyy-MM-dd HH:mm:ss.SSS")).show(false)
the result is:
+-----------------------------+-------------------+
|date |new |
+-----------------------------+-------------------+
|2019-05-07 00:03:53.837000000|2019-05-07 00:03:53|
+-----------------------------+-------------------+
The schema:
root
|-- date: string (nullable = true)
|-- new: timestamp (nullable = true)
I think you should use following format yyyy-MM-dd HH:mm:ss.SSSSSSSSS for this type of data 2019-05-07 00:03:53.837000000

How to replace leading 0 with 91 using regex in pyspark dataframe

In python I am doing this to replace leading 0 in column phone with 91.
But how to do it in pyspark.
con dataframe is :
id phone1
1 088976854667
2 089706790002
Outptut i want is
1 9188976854667
2 9189706790002
# Replace leading Zeros in a phone number with 91
con.filter(regex='[_]').replace('^0','385',regex=True)
You are looking for the regexp_replace function. This function takes 3 parameter:
column name
pattern
repleacement
from pyspark.sql import functions as F
columns = ['id', 'phone1']
vals = [(1, '088976854667'),(2, '089706790002' )]
df = spark.createDataFrame(vals, columns)
df = df.withColumn('phone1', F.regexp_replace('phone1',"^0", "91"))
df.show()
Output:
+---+-------------+
| id| phone1|
+---+-------------+
| 1|9188976854667|
| 2|9189706790002|
+---+-------------+

regular expression pyspark dataframe column

My dataframe looks like this.
I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.
A | A1 | A2
20-13-2012-monday 20-13-2012 monday
20-14-2012-tues 20-14-2012 tues
20-13-2012-wed 20-13-2012 wed
My code looks like this
import re
from pyspark.sql.functions import regexp_extract
reg = r'^([\d]+-[\d]+-[\d]+)'
df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())
df.show()
You can use the regex as an udf and achieve the required output like this:
>>> import re
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> def get_date_day(a):
... x, y = re.split('^([\d]+-[\d]+-[\d]+)', a)[1:]
... return [x, y[1:]]
>>> get_date_day('20-13-2012-monday')
['20-13-2012', 'monday']
>>> get_date_day('20-13-2012-monday')
['20-13-2012', '-monday']
>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))
>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
>>> df.show()
+-----------------+
| A|
+-----------------+
|20-13-2012-monday|
| 20-14-2012-tues|
| 20-13-2012-wed|
+-----------------+
>>> df = df.withColumn("A12", get_date_udf('A'))
>>> df.show(truncate=False)
+-----------------+--------------------+
|A |A12 |
+-----------------+--------------------+
|20-13-2012-monday|[20-13-2012, monday]|
|20-14-2012-tues |[20-14-2012, tues] |
|20-13-2012-wed |[20-13-2012, wed] |
+-----------------+--------------------+
>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
>>> df = df.drop('A12')
>>> df.show(truncate=False)
+-----------------+----------+------+
|A |A1 |A2 |
+-----------------+----------+------+
|20-13-2012-monday|20-13-2012|monday|
|20-14-2012-tues |20-14-2012|tues |
|20-13-2012-wed |20-13-2012|wed |
+-----------------+----------+------+
Hope this helps!

Pyspark Dataframe TypeError: expected string or buffer

I am creating a new coulmn to an existing dataframe in Pyspark by searching one of the filed 'script' and returning match as the entry for new column.
import re as re
def sw_fix(data_str):
if re.compile(r'gaussian').search(data_str):
cleaned_str = 'gaussian'
elif re.compile(r'gromacs').search(data_str):
cleaned_str = 'gromacs'
else:
cleaned_str = 'ns'
return cleaned_str
sw_fix_udf = udf(sw_fix, StringType())
k=df.withColumn("software_new", sw_fix_udf(df.script))
The code runs fine and generates dataframe k with the new column with correct match, however I am unable to do any operation on the the newly added column
k.filter(k.software_new=='gaussian').show()
throws an error, TypeError: expected string or buffer.
I chekced the datatype of the newly added column
f.dataType for f in k.schema.fields
which shows StringType.
However this one works, where sw_app is a existing column in the original dataframe.
k.filter(k.sw_app=='gaussian').select('sw_app','software_new').show(5)
+--------+------------+
| sw_app|software_new|
+--------+------------+
|gaussian| gaussian|
|gaussian| gaussian|
|gaussian| gaussian|
|gaussian| gaussian|
|gaussian| gaussian|
+--------+------------+
Any hints on why I can't process software_new field?
It is working fine for me without any issues. see below demo in pyspark repl.
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType
>>> import re as re
>>> def sw_fix(data_str):
... if re.compile(r'gaussian').search(data_str):
... cleaned_str = 'gaussian'
... elif re.compile(r'gromacs').search(data_str):
... cleaned_str = 'gromacs'
... else:
... cleaned_str = 'ns'
... return cleaned_str
...
>>>
>>> sw_fix_udf = udf(sw_fix, StringType())
>>> df = spark.createDataFrame(['gaussian text', 'gromacs text', 'someother text'], StringType())
>>>
>>> k=df.withColumn("software_new", sw_fix_udf(df.value))
>>> k.show()
+--------------+------------+
| value|software_new|
+--------------+------------+
| gaussian text| gaussian|
| gromacs text| gromacs|
|someother text| ns|
+--------------+------------+
>>> k.filter(k.software_new == 'ns').show()
+--------------+------------+
| value|software_new|
+--------------+------------+
|someother text| ns|
+--------------+------------+