I am very new to pyspark and getting below error, even if drop all date related columns or selecting only one column. Date format stored in my data frame like "". Can anyone please suggest changes I could made in dataframe to resolve this/date formats supported by new parser.
It's working if I set "spark.sql.legacy.timeParserPolicy" to "LEGACY"
[INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Caused by: DateTimeParseException: Text '1/1/2023 3:57:22 AM' could not be parsed at index 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 15.0 failed 4 times, most recent failure: Lost task 4.3 in stage 15.0 (TID 355) (10.139.64.5 executor 0): org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '1/1/2023 3:57:22 AM' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
Example:
#spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.functions import *
from pyspark.sql import functions as F
emp = [(1, "AAA", "dept1", 1000, "12/22/2022 3:11:44 AM"),
(2, "BBB", "dept1", 1100, "12/22/2022 3:11:44 AM"),
(3, "CCC", "dept1", 3000, "12/22/2022 3:11:44 AM"),
(4, "DDD", "dept1", 1500, "12/22/2022 3:11:44 AM"),
(5, "EEE", "dept2", 8000, "12/22/2022 3:11:44 AM"),
(6, "FFF", "dept2", 7200, "12/22/2022 3:11:44 AM"),
(7, "GGG", "dept3", 7100, "12/22/2022 3:11:44 AM"),
(8, "HHH", "dept3", 3700, "12/22/2022 3:11:44 PM"),
(9, "III", "dept3", 4500, "12/22/2022 3:11:44 PM"),
(10, "JJJ", "dept5", 3400,"12/22/2022 3:11:44 PM")]
empdf = spark.createDataFrame(emp, ["id", "name", "dept", "salary",
"date"])
#empdf.printSchema()
df = empdf.withColumn("date", F.to_timestamp(col("date"),
"MM/dd/yyyy hh:mm:ss a"))
df.show(12,False)
Thanks a lot, in Advance
Just wanted to share update on this, After updating df = empdf.withColumn("date", F.to_timestamp(col("date"), "MM/dd/yyyy hh:mm:ss a")) to df = empdf.withColumn("date", F.to_timestamp(col("date"), "M/d/yyyy h:m:s a")) it's working for me.
Related
I am trying to filter the rows that have an specific date on a dataframe. they are in the form of month and day but I keep getting different errors. Not sure what is happening of how to solve it.
This is how my table looks like
And this is how I am trying to filter the Date_Created rows for Jan 21:
df4 = df3.select("*").filter(Date_Created = 'Jan 21')
I am getting this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-30-a4124a5c0058> in <module>()
----> 1 df4 = df3.select("*").filter(Date_Created = 'Jan 21')
TypeError: filter() got an unexpected keyword argument 'Date_Created'
I tried also changing to double quotes and using '' in the name of the column but nothing is working... I am kind of guessing right now...
You could use df.filter(df["Date_Created"] == "Jan 21")
Here's an example:
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
df = spark.createDataFrame(
[
(1, "Jan 21", 566),
(2, "Nov 22", 234),
(3, "Dec 1", 123),
(4, "Jan 21", 5466),
(5, "Jan 21", 4566),
(3, "Dec 4", 123),
(3, "Dec 2", 123),
],
["id", "Date_Created", "Number"],
)
df = df.filter(df["Date_Created"] == "Jan 21")
Result:
+---+------------+------+
| id|Date_Created|Number|
+---+------------+------+
| 1| Jan 21| 566|
| 4| Jan 21| 5466|
| 5| Jan 21| 4566|
+---+------------+------+
I am trying to run a test spark/scala code to find employees who is having salary more than the avarage salary with a test data using below spark dataframe . But this is failing while executing :
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot evaluate expression: avg(input[4, double, false])
What might be the correct syntax to achieve this ?
val dataDF20 = spark.createDataFrame(Seq(
(11, "emp1", 2, 45, 1000.0),
(12, "emp2", 1, 34, 2000.0),
(13, "emp3", 1, 33, 3245.0),
(14, "emp4", 1, 54, 4356.0),
(15, "emp5", 2, 76, 56789.0)
)).toDF("empid", "name", "deptid", "age", "sal")
val condition1 : Column = col("sal") > avg(col("sal"))
val d0 = dataDF20.filter(condition1)
println("------ d0.show()----", d0.show())
You can get this done in two steps:
val avgVal = dataDF20.select(avg($"sal")).take(1)(0)(0)
dataDF20.filter($"sal" > avgVal).show()
+-----+----+------+---+-------+
|empid|name|deptid|age| sal|
+-----+----+------+---+-------+
| 15|emp5| 2| 76|56789.0|
+-----+----+------+---+-------+
I am building a simple Network Graph with PySpark and GraphFrames (running on Google Dataproc)
vertices = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)],
["id", "name", "age"])
edges = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
Then, I try to run `label progation'
result = g.labelPropagation(maxIter=5)
But I get the following error:
Py4JJavaError: An error occurred while calling o164.run.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 829, cluster-network-graph-w-12.c.myproject-bi.internal, executor 2): java.lang.ClassNotFoundException: org.graphframes.GraphFrame$$anonfun$5
It looks like the package 'GraphFrame' isn't available - but only if I run label propagation. How can I fix it?
I have solved using the following parameters
import pyspark
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAll([('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar'),
('spark.jars.packages', 'graphframes:graphframes:0.7.0-spark2.3-s_2.11')])
spark = SparkSession.builder \
.appName('testing bq')\
.config(conf=conf) \
.getOrCreate()
Seems like it is a known issue of graphframes in google Dataproc.
Create a python file and add the following lines and then run it:
from setuptools import setup
setup(name='graphframes',
version='0.5.10',
packages=['graphframes', 'graphframes.lib']
)
You can visit this for details:
https://github.com/graphframes/graphframes/issues/238, https://github.com/graphframes/graphframes/issues/172
This question already has answers here:
Multiple Aggregate operations on the same column of a spark dataframe
(6 answers)
Closed 4 years ago.
I would like to calculate avg and count in a single group by statement in Pyspark. How can I do that?
df = spark.createDataFrame([(1, 'John', 1.79, 28,'M', 'Doctor'),
(2, 'Steve', 1.78, 45,'M', None),
(3, 'Emma', 1.75, None, None, None),
(4, 'Ashley',1.6, 33,'F', 'Analyst'),
(5, 'Olivia', 1.8, 54,'F', 'Teacher'),
(6, 'Hannah', 1.82, None, 'F', None),
(7, 'William', 1.7, 42,'M', 'Engineer'),
(None,None,None,None,None,None),
(8,'Ethan',1.55,38,'M','Doctor'),
(9,'Hannah',1.65,None,'F','Doctor')]
, ['Id', 'Name', 'Height', 'Age', 'Gender', 'Profession'])
#This only shows avg but also I need count right next to it. How can I do that?
df.groupBy("Profession").agg({"Age":"avg"}).show()
df.show()
Thank you.
For the same column:
from pyspark.sql import functions as F
df.groupBy("Profession").agg(F.mean('Age'), F.count('Age')).show()
If you're able to use different columns:
df.groupBy("Profession").agg({'Age':'avg', 'Gender':'count'}).show()
I use postgresql 8.4 to route a river network, and I want to use psycopg2 to loop through all data points in my river network.
#set up python and postgresql connection
import psycopg2
query = """
select *
from driving_distance ($$
select
gid as id,
start_id::int4 as source,
end_id::int4 as target,
shape_leng::double precision as cost
from network
$$, %s, %s, %s, %s
)
;"""
conn = psycopg2.connect("dbname = 'routing_template' user = 'postgres' host = 'localhost' password = '****'")
cur = conn.cursor()
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
i = i + 1
else:
break
rs = cur.fetchall()
conn.close()
print rs
The code above costs a lot of time to run even though I have set the maximum iterator i equals to 2, and the output is an error message contains garbage,
I am thinking that if postgresql can accept only one result at one time, so I tried to put this line in my loop,
rs(i) = cur.fetchall()
and the error message said that this line has bugs,
I know that I can't write code like rs(i), but I don't know the replacement to validate my assumption.
So should I save one result to a file first then use the next iterator to run the loop, and again and again?
I am working with postgresql 8.4, python 2.7.6 under Windows 8.1 x64.
Update#1
I can do loop using Clodoaldo Neto's code(thanks), and the result is like this,
[(1, 2, 0.0), (2, 2, 4729.33082850235), (3, 19, 4874.27571718902), (4, 3, 7397.215962901), (5, 4,
6640.31749097187), (6, 7, 10285.3869655786), (7, 7, 14376.1087618696), (8, 5, 15053.164236979), (9, 10, 16243.5973710466), (10, 8, 19307.3024368889), (11, 9, 21654.8669532788), (12, 11, 23522.6224229233), (13, 18, 29706.6964721152), (14, 21, 24034.6792693279), (15, 18, 25408.306370489), (16, 20, 34204.1769580924), (17, 11, 26465.8348728118), (18, 20, 38596.7313209197), (19, 13, 35184.9925532175), (20, 16, 36530.059646027), (21, 15, 35789.4069722436), (22, 15, 38168.1750567026)]
[(1, 2, 4729.33082850235), (2, 2, 0.0), (3, 19, 144.944888686669), (4, 3, 2667.88513439865), (5, 4, 1910.98666246952), (6, 7, 5556.05613707624), (7, 7, 9646.77793336723), (8, 5, 10323.8334084767), (9, 10, 11514.2665425442), (10, 8, 14577.9716083866), (11, 9, 16925.5361247765), (12, 11, 18793.2915944209), (13, 18, 24977.3656436129), (14, 21, 19305.3484408255), (15, 18, 20678.9755419867), (16, 20, 29474.8461295901), (17, 11, 21736.5040443094), (18, 20, 33867.4004924174), (19, 13, 30455.6617247151), (20, 16, 31800.7288175247), (21, 15, 31060.0761437413), (22, 15, 33438.8442282003)]
but if I want to get this look of output,
(1, 2, 7397.215962901)
(2, 2, 2667.88513439865)
(3, 19, 2522.94024571198)
(4, 3, 0.0)
(5, 4, 4288.98201949483)
(6, 7, 7934.05149410155)
(7, 7, 12024.7732903925)
(8, 5, 12701.828765502)
(9, 10, 13892.2618995696)
(10, 8, 16955.9669654119)
(11, 9, 19303.5314818018)
(12, 11, 21171.2869514462)
(13, 18, 27355.3610006382)
(14, 21, 21683.3437978508)
(15, 18, 23056.970899012)
(16, 20, 31852.8414866154)
(17, 11, 24114.4994013347)
(18, 20, 36245.3958494427)
(19, 13, 32833.6570817404)
(20, 16, 34178.72417455)
(21, 15, 33438.0715007666)
(22, 15, 35816.8395852256)
What should I make a little change in the code?
rs = []
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
else:
break
conn.close()
print rs
If it is just a counter that breaks that loop then
rs = []
i = 1
while i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
conn.close()
print rs