Round off the data frame in Pyspark

Round off the data frame in Pyspark - pyspark

i am trying to round off "perc_of_count_total" column in pyspark, but i could not do it, below is my script,
Auto_data1 = Auto_data.groupBy("Make", "Fuel") \
.count() \
.withColumnRenamed('count', 'cnt_per_group') \
.withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 ) \
.show(10)
Auto_data1.select(round(col('cnt_per_group'),2)).show(5)
Output
+-----------+----+-------------+--------------------+
| Make|Fuel|cnt_per_group| perc_of_count|
+-----------+----+-------------+--------------------+
| C | I| 34748|0.027960585487965286|
| P | D| 489| 3.93482396213164E-4|
Error message
An error was encountered:
'NoneType' object has no attribute 'select'
Traceback (most recent call last):
AttributeError: 'NoneType' object has no attribute 'select'

Remove the last show function, it doesn't return anything.

Related

Column Renaming in pyspark dataframe

I have column names with special characters. I renamed the column and trying to save and it gives the save failed saying the columns have special characters. I ran the print schema on the dataframe and i am seeing the column names with out any special characters. Here is the code i tried.
for c in df_source.columns:
df_source = df_source.withColumnRenamed(c, c.replace( "(" , ""))
df_source = df_source.withColumnRenamed(c, c.replace( ")" , ""))
df_source = df_source.withColumnRenamed(c, c.replace( "." , ""))
df_source.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(stg_location)
and i get the following error
Caused by: org.apache.spark.sql.AnalysisException: Attribute name "Number_of_data_samples_(input)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
Also one more thing i noticed was when i do a df_source.show() or display(df_source), both shows the same error and printschema shows that there are not special characters.
Can someone help me in finding a solutions for this.

Try Using it as below -
Input_df
from pyspark.sql.types import *
from pyspark.sql.functions import *
data = [("xyz", 1)]
schema = StructType([StructField("Number_of_data_samples_(input)", StringType(), True), StructField("id", IntegerType())])
df = spark.createDataFrame(data=data, schema=schema)
df.show()
+------------------------------+---+
|Number_of_data_samples_(input)| id|
+------------------------------+---+
| xyz| 1|
+------------------------------+---+
Method 1
Using regular expressions to replace the special characters and then use toDF()
import re
cols=[re.sub("\.|\)|\(","",i) for i in df.columns]
df.toDF(*cols).show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 2
Using .withColumnRenamed()
for i,j in zip(df.columns,cols):
df=df.withColumnRenamed(i,j)
df.show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 3
Using .withColumn to create a new column and drop the existing column
df = df.withColumn("Number_of_data_samples_input", lit(col("Number_of_data_samples_(input)"))).drop(col("Number_of_data_samples_(input)"))
df.show()
+---+----------------------------+
| id|Number_of_data_samples_input|
+---+----------------------------+
| 1| xyz|
+---+----------------------------+

how to write substring to get the string from starting position to the end

I want to extract the code starting from the 25th position to the end.
I tried:
df_1.withColumn("code", f.col('index_key').substr(25, f.length(df_1.index_key))).show()
But I got the below error message,
TypeError: startPos and length must be the same type. Got <class 'int'>
respectively:
<class 'pyspark.sql.column.Column'>
Any suggestion will be very appreciated.

Using .substr:
Instead of integer value keep value in lit(<int>)(will be column type) so that we are passing both values of same type.
Example:
df.show()
#+---------+
#|index_key|
#+---------+
#| abcdef|
#+---------+
from pyspark.sql.functions import *
df.withColumn("code",col("index_key").substr(lit(1),length(col("index_key")))).\
show()
#+---------+-------+
#|index_key| code|
#+---------+-------+
#| abcdefg|abcdefg|
#+---------+-------+
Another option is using expr and substring function.
Example:
df.withColumn("code",expr('substring(index_key, 1,length(index_key))')).show()
#+---------+------+
#|index_key| code|
#+---------+------+
#| abcdef|abcdef|
#+---------+------+

pyspark- generating date sequence

I am trying to generate a date sequence
from pyspark.sql import functions as F
df1 = df.withColumn("start_dt", F.to_date(F.col("start_date"), "yyyy-mm-dd")) \
.withColumn("end_dt", F.to_date(F.col("end_date"), "yyyy-mm-dd"))
df1.select("start_dt", "end_dt").show()
print("type(start_dt)", type("start_dt"))
print("type(end_dt)", type("end_dt"))
df2 = df1.withColumn("lineoffdate", F.expr("""sequence(start_dt,end_dt,1)"""))
Below is the output
+---------------+----------+
| start_date | end_date|
+---------------+----------+
| 2020-02-01|2020-03-21|
+---------------+----------+
type(start_dt) <class 'str'>
type(end_dt) <class 'str'>
cannot resolve 'sequence(start_dt, end_dt, 1)' due to data type mismatch: sequence only supports integral, timestamp or date types; line 1 pos 0;
Even after converting the start dt and end dt to date or timestamp, I see the type of the column still str and getting above mentioned error while generating the date sequence.

You are correct in saying it should work with date or timestamp(calendar types), however, the only mistake you were making was you were putting the "step" in sequence as integer, when it should be calendar interval(like interval 1 day):
df.withColumn("start_date",F.to_date("start_date")) \
.withColumn("end_date", F.to_date("end_date")) \
.withColumn(
"lineofdate",
F.expr("""sequence(start_date,end_date,interval 1 day)""") \
) \
.show()
# output:
# +----------+----------+--------------------+
# |start_date| end_date| lineofdate|
# +----------+----------+--------------------+
# |2020-02-01|2020-03-21|[2020-02-01, 2020...|
# +----------+----------+--------------------+

pyspark filter with parameter value is not working

Below is the pyspark code that I tried to run. I am not able to substitute the value with filter. Please advise.
>>> coreWordFilter = "crawlResult.url.like('%"+IncoreWords[0]+"%')"
>>> coreWordFilter
"crawlResult.url.like('%furniture%')"
>>> preFilter = crawlResult.filter(coreWordFilter)
20/02/11 09:19:54 INFO execution.SparkSqlParser: Parsing command: crawlResult.url.like('%furniture%')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/dataframe.py", line 1078, in filter
jdf = self._jdf.filter(condition)
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py", line 73, in deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nUnsupported function name 'crawlResult.url.like'(line 1, pos 0)\n\n== SQL ==\ncrawlResult.url.like('%furniture%')\n^^^\n"
>>> preFilter = crawlResult.filter(crawlResult.url.like('%furniture%'))
>>>
I need some help with how to add more crawlResult.url.like logic:
Code from today 2/12/2020:
>>> coreWordFilter = crawlResult.url.like('%{}%'.format(IncoreWords[0]))
>>> coreWordFilter
Column<url LIKE %furniture%>
>>> InmoreWords
['couch', 'couches']
>>> for a in InmoreWords:
... coreWordFilter=coreWordFilter+" | crawlResult.url.like('%"+a+"%')"
>>> coreWordFilter
Column<((((((url LIKE %furniture% + | crawlResult.url.like('%) + couch) + %')) + | crawlResult.url.like('%) + couches) + %'))>
preFilter = crawlResult.filter(coreWordFilter) does not work with the above coreWordFilter.
I was hoping I could do the below but not able to do - got an error:
>>> coreWordFilter2 = "crawlResult.url.like('%"+IncoreWords[0]+"%')"
>>> coreWordFilter2
"crawlResult.url.like('%furniture%')"
>>> for a in InmoreWords:
... coreWordFilter2=coreWordFilter2+" | crawlResult.url.like('%"+a+"%')"
...
>>> coreWordFilter2
"crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%') |
crawlResult.url.like('%couches%')"
>>> preFilter = crawlResult.filter(coreWordFilter2)
20/02/12 08:55:26 INFO execution.SparkSqlParser: Parsing command:
crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%') |
crawlResult.url.like('%couches%')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/dataframe.py", line
1078, in filter
jdf = self._jdf.filter(condition)
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-
src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py", line 73, in
deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nUnsupported function name
'crawlResult.url.like'(line 1, pos 0)\n\n== SQL
==\ncrawlResult.url.like('%furniture%') |
crawlResult.url.like('%couch%') | crawlResult.url.like('%couches%')\n^^^\n"
I think the correct syntax is:
preFilter = crawlResult.filter(crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%'))

Since you want dynamic or condition i think filtering based on String operator (AND, OR, NOT etc) would be easy compare to Column based logical operators (&, |, ~ etc).
Dummy dataframe and lists:
crawlResult.show()
+---+--------------+
| id| url|
+---+--------------+
| 1|test-furniture|
| 1| table|
| 1| test-test|
| 1| couch|
+---+--------------+
# IncoreWords
# ['furniture', 'office-table', 'counch', 'blah']
# InmoreWords
# ['couch', 'couches']
Now, I am just following your OP sequence for building dynamic filter clause but it will give you broad idea.
coreWordFilter2 = "url like ('%"+IncoreWords[0]+"%')"
# coreWordFilter2
#"url like ('%furniture%')"
for a in InmoreWords:
coreWordFilter2=coreWordFilter2+" or url like('%"+a+"%')"
# coreWordFilter2
# "url like ('%furniture%') or url like('%couch%') or url like('%couches%')"
crawlResult.filter(coreWordFilter2).show()
+---+--------------+
| id| url|
+---+--------------+
| 1|test-furniture|
| 1| couch|
+---+--------------+

PySpark, create line graph from a dataframe without a "category" on databricks

I'm running the following code on databricks:
dataToShow = jDataJoined.\
withColumn('id', monotonically_increasing_id()).\
filter(
(jDataJoined.containerNumber == 'SUDU8108536')).\
select(col('id'), col('returnTemperature'), col('supplyTemperature'))
This will give me tabular data like
Now I would like to display a line graph with this returnTemperature and supplyTemperature as categories.
As far as I understood, the method display in databricks wants as second argument the category, so basically what I should have is something like
id - temperatureCategory - value
1 - returnTemperature - 25.0
1 - supplyTemperature - 27.0
2 - returnTemperature - 24.0
2 - supplyTemperature - 28.0
How can I transform the dataframe in this way?

I don't know if your format is what the display method is expecting, but you can do this transformation with the sql functions create_map and explode:
#creates a example df
from pyspark.sql import functions as F
l1 = [(1,25.0,27.0),(2,24.0,28.0)]
df = spark.createDataFrame(l1,['id','returnTemperature','supplyTemperature'])
#creates a map column which contains the values of the returnTemperature and supplyTemperature
df = df.withColumn('mapCol', F.create_map(
F.lit('returnTemperature'),df.returnTemperature
,F.lit('supplyTemperature'),df.supplyTemperature
)
)
#The explode function creates a new row for each element of the map
df = df.select('id',F.explode(df.mapCol).alias('temperatureCategory','value'))
df.show()
Output:
+---+-------------------+-----+
| id|temperatureCategory|value|
+---+-------------------+-----+
| 1 | returnTemperature| 25.0|
| 1 | supplyTemperature| 27.0|
| 2 | returnTemperature| 24.0|
| 2 | supplyTemperature| 28.0|
+---+-------------------+-----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Round off the data frame in Pyspark - pyspark

Remove the last show function, it doesn't return anything.

Related

Column Renaming in pyspark dataframe

how to write substring to get the string from starting position to the end

pyspark- generating date sequence

pyspark filter with parameter value is not working

PySpark, create line graph from a dataframe without a "category" on databricks

Categories

Resources