Running into TypeError: 'int' object is not subscriptable when calling update on ruamel.yaml.comments import CommentedMap - ruamel.yaml

When I am trying to update my code to the new version of ruamel.yaml, I am running into issues.
code:
import sys
import ruamel.yaml
print('Python', tuple(sys.version_info), ', ruamel.yaml', ruamel.yaml.version_info)
yaml_str = """\
number_to_name:
1: name1
2: name2
"""
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
print('before:', data)
data.update({4: 'name4'})
print('after: ', data)
print('==========')
yaml.dump(data, sys.stdout)
output with ruamel.yaml (0, 17, 4):
Python (3, 6, 13, 'final', 0) , ruamel.yaml (0, 17, 4)
before: ordereddict([('number_to_name', ordereddict([(1, 'name1'), (2, 'name2')]))])
Traceback (most recent call last):
File "/home/lib/python3.6/site-packages/ruamel/yaml/comments.py", line 779, in update
self._ok.update(vals.keys()) # type: ignore
AttributeError: 'tuple' object has no attribute 'keys'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bin/runamel.py", line 15, in <module>
data.update({4: 'name4'})
File "/home/lib/python3.6/site-packages/ruamel/yaml/comments.py", line 783, in update
self._ok.add(x[0])
TypeError: 'int' object is not subscriptable
The same code with the old version is working fine.
output with ruamel.yaml (0, 16, 10)
Python (3, 6, 13, 'final', 0) , ruamel.yaml (0, 16, 10)
before: ordereddict([('number_to_name', ordereddict([(1, 'name1'), (2, 'name2')]))])
after: ordereddict([('number_to_name', ordereddict([(1, 'name1'), (2, 'name2')])), (4, 'name4')])
==========
number_to_name:
1: name1
2: name2
4: name4
What I am doing wrong? (I Also suspect vals.keys() at line 779 is always going to raise Attribute error as vals is a tuple)

This is an issue introduced between ruamel.yaml versions 0.6.12 and 0.6.13. It has been fixed
in version 0.17.9
import sys
import ruamel.yaml
print('Python', tuple(sys.version_info), ', ruamel.yaml', ruamel.yaml.version_info)
yaml_str = """\
number_to_name:
1: name1
2: name2
"""
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
print('before:', data)
data.update({4: 'name4'})
print('after: ', data)
print('==========')
yaml.dump(data, sys.stdout)
which gives:
Python (3, 9, 4, 'final', 0) , ruamel.yaml (0, 17, 9)
before: ordereddict([('number_to_name', ordereddict([(1, 'name1'), (2, 'name2')]))])
after: ordereddict([('number_to_name', ordereddict([(1, 'name1'), (2, 'name2')])), (4, 'name4')])
==========
number_to_name:
1: name1
2: name2
4: name4

Related

TypeError: filter() got an unexpected keyword argument

I am trying to filter the rows that have an specific date on a dataframe. they are in the form of month and day but I keep getting different errors. Not sure what is happening of how to solve it.
This is how my table looks like
And this is how I am trying to filter the Date_Created rows for Jan 21:
df4 = df3.select("*").filter(Date_Created = 'Jan 21')
I am getting this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-30-a4124a5c0058> in <module>()
----> 1 df4 = df3.select("*").filter(Date_Created = 'Jan 21')
TypeError: filter() got an unexpected keyword argument 'Date_Created'
I tried also changing to double quotes and using '' in the name of the column but nothing is working... I am kind of guessing right now...
You could use df.filter(df["Date_Created"] == "Jan 21")
Here's an example:
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
df = spark.createDataFrame(
[
(1, "Jan 21", 566),
(2, "Nov 22", 234),
(3, "Dec 1", 123),
(4, "Jan 21", 5466),
(5, "Jan 21", 4566),
(3, "Dec 4", 123),
(3, "Dec 2", 123),
],
["id", "Date_Created", "Number"],
)
df = df.filter(df["Date_Created"] == "Jan 21")
Result:
+---+------------+------+
| id|Date_Created|Number|
+---+------------+------+
| 1| Jan 21| 566|
| 4| Jan 21| 5466|
| 5| Jan 21| 4566|
+---+------------+------+

How to store bigint array in file in Pyspark?

I've a UDF that returns bigint array. I want to store that in a file on Pyspark cluster.
Sample Data -
[
Row(Id='ABCD505936', array=[0, 2, 5, 6, 8, 10, 12, 13, 14, 15, 18]),
Row(Id='EFGHI155784', array=[1, 2, 4, 10, 16, 22, 27, 32, 36, 38, 39, 40])
]
I tried saving it like this -
df.write.save("/home/data", format="text", header="true", mode="overwrite")
But it throws an error saying -
py4j.protocol.Py4JJavaError: An error occurred while calling
o101.save. : org.apache.spark.sql.AnalysisException: Text data source
does not support array data type.;
Can anyone please help me?
try this :
from pyspark.sql import functions as F
df.withColumn(
"array",
F.col("array").cast("string")
).write.save(
"/home/data", format="text", header="true", mode="overwrite"
)

group data in pyspark and get the topn data in each group

I have a data, may be simply shown as:
conf = SparkConf().setMaster("local[*]").setAppName("test")
sc = SparkContext(conf=conf).getOrCreate()
spark = SparkSession(sparkContext=sc).builder.getOrCreate()
rdd = sc.parallelize([(1, 10), (3, 11), (1, 8), (1, 12), (3, 7), (3, 9)])
data = spark.createDataFrame(rdd, ['x', 'y'])
data.show()
def f(x):
y = sorted(x, reverse=True)[:2]
return y
h_f = udf(f, IntegerType())
h_f = spark.udf.register("h_f", h_f)
data.groupBy('x').agg({"y": h_f}).show()
But it went wrong: AttributeError: 'function' object has no attribute '_get_object_id', how can I get the topn item in each group?
Considering you are looking for top n 'y' elements which belongs to the each group of 'x'.
from pyspark.sql import Window
from pyspark.sql import functions as F
import sys
rdd = sc.parallelize([(1, 10), (3, 11), (1, 8), (1, 12), (3, 7), (3, 9)])
df = spark.createDataFrame(rdd, ['x', 'y'])
df.show()
df_g = df.groupBy('x').agg(F.collect_list('y').alias('y'))
df_g = df_g.withColumn('y_sorted', F.sort_array('y', asc = False))
df_g.withColumn('y_slice', F.slice(df_g.y_sorted, 1, 2)).show()
Output
+---+-----------+-----------+--------+
| x| y| y_sorted| y_slice|
+---+-----------+-----------+--------+
| 1|[10, 8, 12]|[12, 10, 8]|[12, 10]|
| 3| [11, 7, 9]| [11, 9, 7]| [11, 9]|
+---+-----------+-----------+--------+

spark dataframe : finding employees who is having salary more than the average salary of the organization

I am trying to run a test spark/scala code to find employees who is having salary more than the avarage salary with a test data using below spark dataframe . But this is failing while executing :
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot evaluate expression: avg(input[4, double, false])
What might be the correct syntax to achieve this ?
val dataDF20 = spark.createDataFrame(Seq(
(11, "emp1", 2, 45, 1000.0),
(12, "emp2", 1, 34, 2000.0),
(13, "emp3", 1, 33, 3245.0),
(14, "emp4", 1, 54, 4356.0),
(15, "emp5", 2, 76, 56789.0)
)).toDF("empid", "name", "deptid", "age", "sal")
val condition1 : Column = col("sal") > avg(col("sal"))
val d0 = dataDF20.filter(condition1)
println("------ d0.show()----", d0.show())
You can get this done in two steps:
val avgVal = dataDF20.select(avg($"sal")).take(1)(0)(0)
dataDF20.filter($"sal" > avgVal).show()
+-----+----+------+---+-------+
|empid|name|deptid|age| sal|
+-----+----+------+---+-------+
| 15|emp5| 2| 76|56789.0|
+-----+----+------+---+-------+

Use psycopg2 to do loop in postgresql

I use postgresql 8.4 to route a river network, and I want to use psycopg2 to loop through all data points in my river network.
#set up python and postgresql connection
import psycopg2
query = """
select *
from driving_distance ($$
select
gid as id,
start_id::int4 as source,
end_id::int4 as target,
shape_leng::double precision as cost
from network
$$, %s, %s, %s, %s
)
;"""
conn = psycopg2.connect("dbname = 'routing_template' user = 'postgres' host = 'localhost' password = '****'")
cur = conn.cursor()
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
i = i + 1
else:
break
rs = cur.fetchall()
conn.close()
print rs
The code above costs a lot of time to run even though I have set the maximum iterator i equals to 2, and the output is an error message contains garbage,
I am thinking that if postgresql can accept only one result at one time, so I tried to put this line in my loop,
rs(i) = cur.fetchall()
and the error message said that this line has bugs,
I know that I can't write code like rs(i), but I don't know the replacement to validate my assumption.
So should I save one result to a file first then use the next iterator to run the loop, and again and again?
I am working with postgresql 8.4, python 2.7.6 under Windows 8.1 x64.
Update#1
I can do loop using Clodoaldo Neto's code(thanks), and the result is like this,
[(1, 2, 0.0), (2, 2, 4729.33082850235), (3, 19, 4874.27571718902), (4, 3, 7397.215962901), (5, 4,
6640.31749097187), (6, 7, 10285.3869655786), (7, 7, 14376.1087618696), (8, 5, 15053.164236979), (9, 10, 16243.5973710466), (10, 8, 19307.3024368889), (11, 9, 21654.8669532788), (12, 11, 23522.6224229233), (13, 18, 29706.6964721152), (14, 21, 24034.6792693279), (15, 18, 25408.306370489), (16, 20, 34204.1769580924), (17, 11, 26465.8348728118), (18, 20, 38596.7313209197), (19, 13, 35184.9925532175), (20, 16, 36530.059646027), (21, 15, 35789.4069722436), (22, 15, 38168.1750567026)]
[(1, 2, 4729.33082850235), (2, 2, 0.0), (3, 19, 144.944888686669), (4, 3, 2667.88513439865), (5, 4, 1910.98666246952), (6, 7, 5556.05613707624), (7, 7, 9646.77793336723), (8, 5, 10323.8334084767), (9, 10, 11514.2665425442), (10, 8, 14577.9716083866), (11, 9, 16925.5361247765), (12, 11, 18793.2915944209), (13, 18, 24977.3656436129), (14, 21, 19305.3484408255), (15, 18, 20678.9755419867), (16, 20, 29474.8461295901), (17, 11, 21736.5040443094), (18, 20, 33867.4004924174), (19, 13, 30455.6617247151), (20, 16, 31800.7288175247), (21, 15, 31060.0761437413), (22, 15, 33438.8442282003)]
but if I want to get this look of output,
(1, 2, 7397.215962901)
(2, 2, 2667.88513439865)
(3, 19, 2522.94024571198)
(4, 3, 0.0)
(5, 4, 4288.98201949483)
(6, 7, 7934.05149410155)
(7, 7, 12024.7732903925)
(8, 5, 12701.828765502)
(9, 10, 13892.2618995696)
(10, 8, 16955.9669654119)
(11, 9, 19303.5314818018)
(12, 11, 21171.2869514462)
(13, 18, 27355.3610006382)
(14, 21, 21683.3437978508)
(15, 18, 23056.970899012)
(16, 20, 31852.8414866154)
(17, 11, 24114.4994013347)
(18, 20, 36245.3958494427)
(19, 13, 32833.6570817404)
(20, 16, 34178.72417455)
(21, 15, 33438.0715007666)
(22, 15, 35816.8395852256)
What should I make a little change in the code?
rs = []
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
else:
break
conn.close()
print rs
If it is just a counter that breaks that loop then
rs = []
i = 1
while i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
conn.close()
print rs