pyspark Sortby didn't work on multiple values? - pyspark

Suppose I have rdd contain data of 4 tuples (a,b,c,d) in which that a,b,c,and d are all integer variable
I'm try to sort data on assending order based on only d variable ( but it not finalized so I try to do something else )
This is current code I type
sortedRDD = RDD.sortBy(lambda (a, b, c, d): d)
however I check the finalize data but it seem that the result is still not corrected
# I check with this code
sortedRDD.takeOrdered(15)

You should specify the sorting order again in takeOrdered:
RDD.takeOrdered(15, lambda (a, b, c, d): d)
As you do not collect the data after the sort, the order is not guaranteed in subsequent operations, see the example below:
rdd = sc.parallelize([(1,2,3,4),(5,6,7,3),(9,10,11,2),(12,13,14,1)])
result = rdd.sortBy(lambda (a,b,c,d): d)
#when using collect output is sorted
#[(12, 13, 14, 1), (9, 10, 11, 2), (5, 6, 7, 3), (1, 2, 3, 4)]
result.collect
#results are not necessarily sorted in subsequent operations
#[(1, 2, 3, 4), (5, 6, 7, 3), (9, 10, 11, 2), (12, 13, 14, 1)]
result.takeOrdered(5)
#result are sorted when specifying the sort order in takeOrdered
#[(12, 13, 14, 1), (9, 10, 11, 2), (5, 6, 7, 3), (1, 2, 3, 4)]
result.takeOrdered(5,lambda (a,b,c,d): d)

Related

OR-Tools: Is there any way to get raw combinations and permutations?

For example, you call something like permutations([1,2,3,4],2) or combinations([1,2,3,4],2) and get list of permutations/combinations like [(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), ...]. Is it possible using exactly OR-Tools?

How to further extract list to make a triple nested list from a double nested list

I have a list of lists that I want to separate using one of the internal values
I was thinking hash map would work but I am not that familiar with it so a list would look like this
val data: List[(Int, Int, Int, Int)] = List((0, 1, 2, 3), (1, 1, 2, 7), (2, 1, 5, 5), (3, 1, 3, 7), (4, 1, 2, 8), (5, 1, 5, 4), (6, 1, 3, 5))
and I want to get something like:
List(((0, 1, 2, 3), (1, 1, 2, 7), (4, 1, 2, 8)), ((3, 1, 3, 7),(6, 1, 3, 5)),((5, 1, 5, 4), (2, 1, 5, 5)))
I separate it by the 3rd element in each list
This is a solution to what you are looking for but you will have List of list not a List of Tuples:
val list : List[(Int, Int, Int, Int)] = List((0, 1, 2, 3), (1, 1, 2, 7), (2, 1, 5, 5), (3, 1, 3, 7), (4, 1, 2, 8), (5, 1, 5, 4), (6, 1, 3, 5))
list.groupBy(_._3).values.toList
> res = List(List((0,1,2,3), (1,1,2,7), (4,1,2,8)), List((2,1,5,5), (5,1,5,4)), List((3,1,3,7), (6,1,3,5)))
You can use the groupBy function on a list:
list.groupBy( i => i._3 )
will create a Hashmap. You will want to massage the values of the Map afterwards.
Good luck !

sortByKey() by composite key in PySpark

In an RDD with composite key, is it possible to sort in ascending order with the first element and in descending order with the second order when both of them are string type? I have provided some dummy data below.
z = [(('a','b'), 3), (('a','c'), -2), (('d','b'), 4), (('e','b'), 6), (('a','g'), 8)]
rdd = sc.parallelize(z)
rdd.sortByKey(False).collect()
Maybe there's more efficient way, but here is one:
str_to_ints = lambda s, i: [ord(c) * i for c in s]
rdd.sortByKey(keyfunc=lambda x: (str_to_ints(x[0], 1), str_to_ints(x[1], -1))).collect()
# [(('a', 'g'), 8), (('a', 'c'), -2), (('a', 'b'), 3), (('d', 'b'), 4), (('e', 'b'), 6)]
Basically convert the strings in the key to list of integers with first element multiplied by 1 and second element multiplied by -1.

scala array filtering based on information of another array

I have 2 types of array like this:
array one,
Array(productId, categoryId)
(2, 423)
(6, 859)
(3, 423)
(5, 859)
and another array Array((productId1, productId2), count)
((2, 6), 1), ((2, 3), 1), ((6, 5), 1), ((6, 3), 1)
I would like to filter the second array based on the first array,
firstly I want to check array 2 to see if productId1 and productId2 having the same category, if yes, will keep, otherwise will filter out this element.
So the list above will be filtered to remain:
( ((2, 3), 1), ((6, 5), 1) )
Can anybody help me with this? Thank you very much.
If you don't mind working with the first array as a map, ie:
scala> val categ_info = cats = Array((2, 423), (6, 859), (3, 423), (5, 859)).toMap
categ_info: Map[Int, Int] = Map(2 -> 423, 6 -> 859, 3 -> 423, 5 -> 859)
then we have (setting up example data as simple Ints for convenience):
val data = Array(((2, 6), 1), ((2, 3), 1), ((6, 5), 1), ((6, 3), 1))
data.filter { case ((prod1_id, prod2_id), _) =>
categ_info(prod1_id) == categ_info(prod2_id)
}
producing:
res2: Array[((Int, Int), Int)] = Array(((2, 3), 1), ((6, 5), 1))
as requested.

Use psycopg2 to do loop in postgresql

I use postgresql 8.4 to route a river network, and I want to use psycopg2 to loop through all data points in my river network.
#set up python and postgresql connection
import psycopg2
query = """
select *
from driving_distance ($$
select
gid as id,
start_id::int4 as source,
end_id::int4 as target,
shape_leng::double precision as cost
from network
$$, %s, %s, %s, %s
)
;"""
conn = psycopg2.connect("dbname = 'routing_template' user = 'postgres' host = 'localhost' password = '****'")
cur = conn.cursor()
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
i = i + 1
else:
break
rs = cur.fetchall()
conn.close()
print rs
The code above costs a lot of time to run even though I have set the maximum iterator i equals to 2, and the output is an error message contains garbage,
I am thinking that if postgresql can accept only one result at one time, so I tried to put this line in my loop,
rs(i) = cur.fetchall()
and the error message said that this line has bugs,
I know that I can't write code like rs(i), but I don't know the replacement to validate my assumption.
So should I save one result to a file first then use the next iterator to run the loop, and again and again?
I am working with postgresql 8.4, python 2.7.6 under Windows 8.1 x64.
Update#1
I can do loop using Clodoaldo Neto's code(thanks), and the result is like this,
[(1, 2, 0.0), (2, 2, 4729.33082850235), (3, 19, 4874.27571718902), (4, 3, 7397.215962901), (5, 4,
6640.31749097187), (6, 7, 10285.3869655786), (7, 7, 14376.1087618696), (8, 5, 15053.164236979), (9, 10, 16243.5973710466), (10, 8, 19307.3024368889), (11, 9, 21654.8669532788), (12, 11, 23522.6224229233), (13, 18, 29706.6964721152), (14, 21, 24034.6792693279), (15, 18, 25408.306370489), (16, 20, 34204.1769580924), (17, 11, 26465.8348728118), (18, 20, 38596.7313209197), (19, 13, 35184.9925532175), (20, 16, 36530.059646027), (21, 15, 35789.4069722436), (22, 15, 38168.1750567026)]
[(1, 2, 4729.33082850235), (2, 2, 0.0), (3, 19, 144.944888686669), (4, 3, 2667.88513439865), (5, 4, 1910.98666246952), (6, 7, 5556.05613707624), (7, 7, 9646.77793336723), (8, 5, 10323.8334084767), (9, 10, 11514.2665425442), (10, 8, 14577.9716083866), (11, 9, 16925.5361247765), (12, 11, 18793.2915944209), (13, 18, 24977.3656436129), (14, 21, 19305.3484408255), (15, 18, 20678.9755419867), (16, 20, 29474.8461295901), (17, 11, 21736.5040443094), (18, 20, 33867.4004924174), (19, 13, 30455.6617247151), (20, 16, 31800.7288175247), (21, 15, 31060.0761437413), (22, 15, 33438.8442282003)]
but if I want to get this look of output,
(1, 2, 7397.215962901)
(2, 2, 2667.88513439865)
(3, 19, 2522.94024571198)
(4, 3, 0.0)
(5, 4, 4288.98201949483)
(6, 7, 7934.05149410155)
(7, 7, 12024.7732903925)
(8, 5, 12701.828765502)
(9, 10, 13892.2618995696)
(10, 8, 16955.9669654119)
(11, 9, 19303.5314818018)
(12, 11, 21171.2869514462)
(13, 18, 27355.3610006382)
(14, 21, 21683.3437978508)
(15, 18, 23056.970899012)
(16, 20, 31852.8414866154)
(17, 11, 24114.4994013347)
(18, 20, 36245.3958494427)
(19, 13, 32833.6570817404)
(20, 16, 34178.72417455)
(21, 15, 33438.0715007666)
(22, 15, 35816.8395852256)
What should I make a little change in the code?
rs = []
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
else:
break
conn.close()
print rs
If it is just a counter that breaks that loop then
rs = []
i = 1
while i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
conn.close()
print rs