Use psycopg2 to do loop in postgresql - postgresql

I use postgresql 8.4 to route a river network, and I want to use psycopg2 to loop through all data points in my river network.
#set up python and postgresql connection
import psycopg2
query = """
select *
from driving_distance ($$
select
gid as id,
start_id::int4 as source,
end_id::int4 as target,
shape_leng::double precision as cost
from network
$$, %s, %s, %s, %s
)
;"""
conn = psycopg2.connect("dbname = 'routing_template' user = 'postgres' host = 'localhost' password = '****'")
cur = conn.cursor()
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
i = i + 1
else:
break
rs = cur.fetchall()
conn.close()
print rs
The code above costs a lot of time to run even though I have set the maximum iterator i equals to 2, and the output is an error message contains garbage,
I am thinking that if postgresql can accept only one result at one time, so I tried to put this line in my loop,
rs(i) = cur.fetchall()
and the error message said that this line has bugs,
I know that I can't write code like rs(i), but I don't know the replacement to validate my assumption.
So should I save one result to a file first then use the next iterator to run the loop, and again and again?
I am working with postgresql 8.4, python 2.7.6 under Windows 8.1 x64.
Update#1
I can do loop using Clodoaldo Neto's code(thanks), and the result is like this,
[(1, 2, 0.0), (2, 2, 4729.33082850235), (3, 19, 4874.27571718902), (4, 3, 7397.215962901), (5, 4,
6640.31749097187), (6, 7, 10285.3869655786), (7, 7, 14376.1087618696), (8, 5, 15053.164236979), (9, 10, 16243.5973710466), (10, 8, 19307.3024368889), (11, 9, 21654.8669532788), (12, 11, 23522.6224229233), (13, 18, 29706.6964721152), (14, 21, 24034.6792693279), (15, 18, 25408.306370489), (16, 20, 34204.1769580924), (17, 11, 26465.8348728118), (18, 20, 38596.7313209197), (19, 13, 35184.9925532175), (20, 16, 36530.059646027), (21, 15, 35789.4069722436), (22, 15, 38168.1750567026)]
[(1, 2, 4729.33082850235), (2, 2, 0.0), (3, 19, 144.944888686669), (4, 3, 2667.88513439865), (5, 4, 1910.98666246952), (6, 7, 5556.05613707624), (7, 7, 9646.77793336723), (8, 5, 10323.8334084767), (9, 10, 11514.2665425442), (10, 8, 14577.9716083866), (11, 9, 16925.5361247765), (12, 11, 18793.2915944209), (13, 18, 24977.3656436129), (14, 21, 19305.3484408255), (15, 18, 20678.9755419867), (16, 20, 29474.8461295901), (17, 11, 21736.5040443094), (18, 20, 33867.4004924174), (19, 13, 30455.6617247151), (20, 16, 31800.7288175247), (21, 15, 31060.0761437413), (22, 15, 33438.8442282003)]
but if I want to get this look of output,
(1, 2, 7397.215962901)
(2, 2, 2667.88513439865)
(3, 19, 2522.94024571198)
(4, 3, 0.0)
(5, 4, 4288.98201949483)
(6, 7, 7934.05149410155)
(7, 7, 12024.7732903925)
(8, 5, 12701.828765502)
(9, 10, 13892.2618995696)
(10, 8, 16955.9669654119)
(11, 9, 19303.5314818018)
(12, 11, 21171.2869514462)
(13, 18, 27355.3610006382)
(14, 21, 21683.3437978508)
(15, 18, 23056.970899012)
(16, 20, 31852.8414866154)
(17, 11, 24114.4994013347)
(18, 20, 36245.3958494427)
(19, 13, 32833.6570817404)
(20, 16, 34178.72417455)
(21, 15, 33438.0715007666)
(22, 15, 35816.8395852256)
What should I make a little change in the code?

rs = []
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
else:
break
conn.close()
print rs
If it is just a counter that breaks that loop then
rs = []
i = 1
while i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
conn.close()
print rs

Related

Grouping by an alternating sequence in Spark

I have a set of data where I can identify an alternating sequence. However, I want to group this data into one chunk leaving all the other data unchanged. That is, where ever flickering is occurring in id, I want to overwrite that group of id with the single id that makes since to the order. As a small example, consider
val dataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 4), // flickering; id=0
("a", "silom", 1, 5), // flickering; id=0
("a", "silom", 0, 6), // flickering; id=0
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 10),
("a", "silom", 3, 11), // flickering and so on
("a", "silom", 4, 12),
("a", "silom", 3, 13),
("a", "silom", 4, 14),
("a", "silom", 5, 15)
).toDF("user", "cat", "id", "time_sec")
val resultDataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 15), // grouped by flickering summing on time_sec
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 60),
("a", "silom", 5, 15). // grouped by flickering summing on time_sec
).toDF("user", "cat", "id", "time_sec")
Now a more realistic MWE. In this case, we can have multiple users and cat; unfortunately, this approach doesnt use the dataframe API and needs to collect data to the driver. This isn't scalable and needs to recursively call getGrps by dropping the length of the returned array indices.
How can I implement this using the dataframe API so as not to need collect the data to the driver which would be impossible due to size? Also, if there is a better way to do this, what would that be?
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.DoubleType
import scala.collection.mutable.WrappedArray
val dataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 4),
("a", "silom", 1, 5),
("a", "silom", 0, 6),
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 10),
("a", "silom", 3, 11),
("a", "silom", 4, 12),
("a", "silom", 3, 13),
("a", "silom", 4, 14),
("a", "silom", 5, 15),
("a", "suk", 18, 1),
("a", "suk", 19, 2),
("a", "suk", 20, 3),
("a", "suk", 21, 4),
("a", "suk", 20, 5),
("a", "suk", 21, 6),
("a", "suk", 0, 7),
("a", "suk", 1, 8),
("a", "suk", 2, 9),
("a", "suk", 3, 10),
("a", "suk", 4, 11),
("a", "suk", 3, 12),
("a", "suk", 4, 13),
("a", "suk", 3, 14),
("a", "suk", 5, 15),
("b", "silom", 4, 1),
("b", "silom", 3, 2),
("b", "silom", 2, 3),
("b", "silom", 1, 4),
("b", "silom", 0, 5),
("b", "silom", 1, 6),
("b", "silom", 2, 7),
("b", "silom", 3, 8),
("b", "silom", 4, 9),
("b", "silom", 5, 10),
("b", "silom", 6, 11),
("b", "silom", 7, 12),
("b", "silom", 8, 13),
("b", "silom", 9, 14),
("b", "silom", 10, 15),
("b", "suk", 11, 1),
("b", "suk", 12, 2),
("b", "suk", 13, 3),
("b", "suk", 14, 4),
("b", "suk", 13, 5),
("b", "suk", 14, 6),
("b", "suk", 13, 7),
("b", "suk", 12, 8),
("b", "suk", 11, 9),
("b", "suk", 10, 10),
("b", "suk", 9, 11),
("b", "suk", 8, 12),
("b", "suk", 7, 13),
("b", "suk", 6, 14),
("b", "suk", 5, 15)
).toDF("user", "cat", "id", "time_sec")
val recastDataDF = dataDF.withColumn("id", $"id".cast(DoubleType))
val category = recastDataDF.select("cat").distinct.collect.map(x => x(0).toString)
val data = recastDataDF
.select($"*" +: category.map(
name =>
lag("id", 1).over(
Window.partitionBy("user", "cat").orderBy("time_sec")
)
.alias(s"lag_${name}_id")): _*)
.withColumn("sequencing_diff", when($"cat" === "silom", ($"lag_silom_id" - $"id").cast(DoubleType))
.otherwise(($"lag_suk_id" - $"id")))
.drop("lag_silom_id", "lag_suk_id")
.withColumn("rn", row_number.over(Window.partitionBy("user", "cat").orderBy("time_sec")).cast(DoubleType))
.withColumn("zipped", array("user", "cat", "sequencing_diff", "rn", "id"))
// non dataframe API approach (not scalable)
// needs to collect data to driver to process
val iterTuples = data.select("zipped").collect.map(x => x(0).asInstanceOf[WrappedArray[Any]]).map(x => x.toArray)
val shifted: Array[Array[Any]] = iterTuples.drop(1)
val combined = iterTuples
.zipAll(shifted, Array("", "", Double.NaN, Double.NaN, Double.NaN), Array("", "", Double.NaN, Double.NaN, Double.NaN))
val testArr = combined.map{
case (data0, data1) =>
if(data1(3).toString.toDouble > 2 && data0(3).toString.toDouble > 2 && data1(0) == data0(0) && data1(1) == data0(1)) {
if(data0(2) != data1(2) && data0(2).toString.toDouble + data1(2).toString.toDouble == 0) {
(data1(0), data1(1), data1(3), data0(4))
}
else ("", "", Double.NaN, Double.NaN)
}
else ("", "", Double.NaN, Double.NaN)
}
.filter(t => t._1 != "" && t._2 != "" && t._3 == t._3 && t._4 == t._4) // fast NaN removal
val typeMappedArray = testArr.map(x => (x._1.toString, x._2.toString, x._3.toString.toDouble, x._4.toString.toDouble))
def getGrps(arr: Array[(String, String, Double, Double)]): (Array[Double], Double, String, String) = {
if(arr.nonEmpty) {
val user = arr.take(1)(0)._1
val cat = arr.take(1)(0)._2
val rowNum = arr.take(1)(0)._3
val keepID = arr.take(1)(0)._4
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._3) {
rowNum + 1 + idx
}
else Double.NaN
}
.filter(v => v == v)
(rowNums, keepID, user, cat)
}
else (Array(Double.NaN), Double.NaN, "", "")
}
// after overwriting, this would allow me to group by user, cat, id to sum the time
getGrps(typeMappedArray) // returns rows number to overwrite, value to overwrite id with, user, cat
res0: (Array(5.0, 6.0, 7.0),0.0,a,silom)
getGrps(typeMappedArray.drop(3))
res1: (Array(11.0, 12.0, 13.0, 14.0),4.0,a,silom)
A second approach using collect_list but this relies on getGrps working recursively which I cannot get working properly. Here is the code I have so far with a modified getGrps for the the collect_list minus recursive.
val data = recastDataDF
.select($"*" +: category.map(
name =>
lag("id", 1).over(
Window.partitionBy("user", "cat").orderBy("time_sec")
)
.alias(s"lag_${name}_id")): _*)
.withColumn("sequencing_diff", when($"cat" === "silom", ($"lag_silom_id" - $"id").cast(DoubleType))
.otherwise(($"lag_suk_id" - $"id")))
.drop("lag_silom_id", "lag_suk_id")
.withColumn("rn", row_number.over(Window.partitionBy("user", "cat").orderBy("time_sec")).cast(DoubleType))
.withColumn("id_rn", array($"id", $"rn", $"sequencing_diff"))
.groupBy($"user", $"cat").agg(collect_list($"id_rn").alias("array_data"))
// collect one row to develop how the UDF would work
val testList = data.where($"user" === "a" && $"cat" === "silom").select("array_data").collect
.map(x => x(0).asInstanceOf[WrappedArray[WrappedArray[Any]]])
.map(x => x.toArray)
.head
.map(x => (x(0).toString.toDouble, x(1).toString.toDouble, x(2).asInstanceOf[Double]))
// this code would be in the UDF; that is, we would pass array_data to the UDF
scala.util.Sorting.stableSort(testList, (e1: (Double, Double, Double), e2: (Double, Double, Double)) => e1._2 < e2._2)
val shifted: Array[(Double, Double, Double)] = testList.drop(1)
val combined = testList
.zipAll(shifted, (Double.NaN, Double.NaN, Double.NaN), (Double.NaN, Double.NaN, Double.NaN))
val testArr = combined.map{
case (data0, data1) =>
if(data0._3 != data1._3 && data0._2 > 1) {
(data0._2, data0._1)
}
else (Double.NaN, Double.NaN)
}
.filter(t => t._1 == t._1 && t._1 == t._1)
// called inside the UDF
def getGrps2(arr: Array[(Double, Double)]): (Array[Double], Double) = {
// no need for user or cat
if(arr.nonEmpty) {
val rowNum = arr.take(1)(0)._1
val keepID = arr.take(1)(0)._2
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._1) {
rowNum + 1 + idx
}
else Double.NaN
}
.filter(v => v == v)
(rowNums, keepID)
}
else (Array(Double.NaN), Double.NaN)
}
We would .withColumn("data_to_update", udf) and the data_to_update column would be a WrappedArray[Tuple2[Array[Double], Double]] with row_numbers to id to overwrite. The result for user a, cat silom would be
WrappedArray((Array(4.0, 5.0, 6.0),0.0), (Array(10.0, 11.0, 12.0, 13.0),4.0))
The array pieces are row numbers and the Double is id to update those rows with
The following recursive method applied in a UDF operating on the array_data column will create the desired results
def getGrps(arr: Array[(Double, Double)]): Array[(Array[Double], Double)] = {
def returnAlternatingIDs(arr: Array[(Double, Double)],
altIDs: Array[(Array[Double], Double)]): Array[(Array[Double], Double)] = arr match {
case arr if arr.nonEmpty =>
val rowNum = arr.take(1)(0)._1
val keepID = arr.take(1)(0)._2
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._1) {
rowNum + 1 + idx
}
else {
Double.NaN
}
}
.filter(v => v == v)
val updateArray = altIDs ++ Array((rowNums, keepID))
returnAlternatingIDs(arr.drop(rowNums.length), updateArray)
case _ => altIDs
}
returnAlternatingIDs(arr, Array((Array(Double.NaN), Double.NaN))).drop(1)
}
The return value for the first collect_list is Array((Array(5.0, 6.0, 7.0),0.0), (Array(11.0, 12.0, 13.0, 14.0),4.0)) as desired.
Full UDF
val identifyFlickeringIDs: UserDefinedFunction = udf {
(colArrayData: WrappedArray[WrappedArray[Double]]) =>
val newArray: Array[(Double, Double, Double)] = colArrayData.toArray
.map(x => (x(0).toDouble, x(1).toDouble, x(2).toDouble))
// sort array by rn via less than relation
stableSort(newArray, (e1: (Double, Double, Double), e2: (Double, Double, Double)) => e1._2 < e2._2)
val shifted: Array[(Double, Double, Double)] = newArray.toArray.drop(1)
val combined = newArray
.zipAll(shifted, (Double.NaN, Double.NaN, Double.NaN), (Double.NaN, Double.NaN, Double.NaN))
val parsedArray = combined.map{
case (data0, data1) =>
if(data0._3 != data1._3 && data0._2 > 1 && data0._3 + data1._3 == 0) {
(data0._2, data0._1)
}
else (Double.NaN, Double.NaN)
}
.filter(t => t._1 == t._1 && t._1 == t._1)
getGrps(parsedArray).filter(data => data._1.length > 1)
}

group data in pyspark and get the topn data in each group

I have a data, may be simply shown as:
conf = SparkConf().setMaster("local[*]").setAppName("test")
sc = SparkContext(conf=conf).getOrCreate()
spark = SparkSession(sparkContext=sc).builder.getOrCreate()
rdd = sc.parallelize([(1, 10), (3, 11), (1, 8), (1, 12), (3, 7), (3, 9)])
data = spark.createDataFrame(rdd, ['x', 'y'])
data.show()
def f(x):
y = sorted(x, reverse=True)[:2]
return y
h_f = udf(f, IntegerType())
h_f = spark.udf.register("h_f", h_f)
data.groupBy('x').agg({"y": h_f}).show()
But it went wrong: AttributeError: 'function' object has no attribute '_get_object_id', how can I get the topn item in each group?
Considering you are looking for top n 'y' elements which belongs to the each group of 'x'.
from pyspark.sql import Window
from pyspark.sql import functions as F
import sys
rdd = sc.parallelize([(1, 10), (3, 11), (1, 8), (1, 12), (3, 7), (3, 9)])
df = spark.createDataFrame(rdd, ['x', 'y'])
df.show()
df_g = df.groupBy('x').agg(F.collect_list('y').alias('y'))
df_g = df_g.withColumn('y_sorted', F.sort_array('y', asc = False))
df_g.withColumn('y_slice', F.slice(df_g.y_sorted, 1, 2)).show()
Output
+---+-----------+-----------+--------+
| x| y| y_sorted| y_slice|
+---+-----------+-----------+--------+
| 1|[10, 8, 12]|[12, 10, 8]|[12, 10]|
| 3| [11, 7, 9]| [11, 9, 7]| [11, 9]|
+---+-----------+-----------+--------+

How to further extract list to make a triple nested list from a double nested list

I have a list of lists that I want to separate using one of the internal values
I was thinking hash map would work but I am not that familiar with it so a list would look like this
val data: List[(Int, Int, Int, Int)] = List((0, 1, 2, 3), (1, 1, 2, 7), (2, 1, 5, 5), (3, 1, 3, 7), (4, 1, 2, 8), (5, 1, 5, 4), (6, 1, 3, 5))
and I want to get something like:
List(((0, 1, 2, 3), (1, 1, 2, 7), (4, 1, 2, 8)), ((3, 1, 3, 7),(6, 1, 3, 5)),((5, 1, 5, 4), (2, 1, 5, 5)))
I separate it by the 3rd element in each list
This is a solution to what you are looking for but you will have List of list not a List of Tuples:
val list : List[(Int, Int, Int, Int)] = List((0, 1, 2, 3), (1, 1, 2, 7), (2, 1, 5, 5), (3, 1, 3, 7), (4, 1, 2, 8), (5, 1, 5, 4), (6, 1, 3, 5))
list.groupBy(_._3).values.toList
> res = List(List((0,1,2,3), (1,1,2,7), (4,1,2,8)), List((2,1,5,5), (5,1,5,4)), List((3,1,3,7), (6,1,3,5)))
You can use the groupBy function on a list:
list.groupBy( i => i._3 )
will create a Hashmap. You will want to massage the values of the Map afterwards.
Good luck !

pyspark Sortby didn't work on multiple values?

Suppose I have rdd contain data of 4 tuples (a,b,c,d) in which that a,b,c,and d are all integer variable
I'm try to sort data on assending order based on only d variable ( but it not finalized so I try to do something else )
This is current code I type
sortedRDD = RDD.sortBy(lambda (a, b, c, d): d)
however I check the finalize data but it seem that the result is still not corrected
# I check with this code
sortedRDD.takeOrdered(15)
You should specify the sorting order again in takeOrdered:
RDD.takeOrdered(15, lambda (a, b, c, d): d)
As you do not collect the data after the sort, the order is not guaranteed in subsequent operations, see the example below:
rdd = sc.parallelize([(1,2,3,4),(5,6,7,3),(9,10,11,2),(12,13,14,1)])
result = rdd.sortBy(lambda (a,b,c,d): d)
#when using collect output is sorted
#[(12, 13, 14, 1), (9, 10, 11, 2), (5, 6, 7, 3), (1, 2, 3, 4)]
result.collect
#results are not necessarily sorted in subsequent operations
#[(1, 2, 3, 4), (5, 6, 7, 3), (9, 10, 11, 2), (12, 13, 14, 1)]
result.takeOrdered(5)
#result are sorted when specifying the sort order in takeOrdered
#[(12, 13, 14, 1), (9, 10, 11, 2), (5, 6, 7, 3), (1, 2, 3, 4)]
result.takeOrdered(5,lambda (a,b,c,d): d)

scala array filtering based on information of another array

I have 2 types of array like this:
array one,
Array(productId, categoryId)
(2, 423)
(6, 859)
(3, 423)
(5, 859)
and another array Array((productId1, productId2), count)
((2, 6), 1), ((2, 3), 1), ((6, 5), 1), ((6, 3), 1)
I would like to filter the second array based on the first array,
firstly I want to check array 2 to see if productId1 and productId2 having the same category, if yes, will keep, otherwise will filter out this element.
So the list above will be filtered to remain:
( ((2, 3), 1), ((6, 5), 1) )
Can anybody help me with this? Thank you very much.
If you don't mind working with the first array as a map, ie:
scala> val categ_info = cats = Array((2, 423), (6, 859), (3, 423), (5, 859)).toMap
categ_info: Map[Int, Int] = Map(2 -> 423, 6 -> 859, 3 -> 423, 5 -> 859)
then we have (setting up example data as simple Ints for convenience):
val data = Array(((2, 6), 1), ((2, 3), 1), ((6, 5), 1), ((6, 3), 1))
data.filter { case ((prod1_id, prod2_id), _) =>
categ_info(prod1_id) == categ_info(prod2_id)
}
producing:
res2: Array[((Int, Int), Int)] = Array(((2, 3), 1), ((6, 5), 1))
as requested.