Consider the following DF:
df = spark.createDataFrame(
[
Row(
x='a',
y=[
{'f1': 1, 'f2': 2},
{'f1': 3, 'f2': 4}
],
z=[
{'f3': 1, 'f4': '2'},
{'f3': 1, 'f4': '4', 'f5': [1,2,3]}
]
)
]
)
I wish to combine y and z index-wise, so I may get:
[
Row(x='a', y={'f1': 1, 'f2': 2}, z={'f3': 1, 'f4': 2}),
Row(x='a', y={'f1': 3, 'f2': 4}, z={'f3': 1, 'f4': 4, 'f5': [1,2,3]})
]
How can it be done without converting to rdd?
This is output and a little difference with your expectation: the value of z column is changed to string whatever int, string, list.
[Row(x='a', y={'f2': 2, 'f1': 1}, z={'f3': '1', 'f4': '2'}), Row(x='a', y={'f2': 4, 'f1': 3}, z={'f3': '1', 'f4': '4', 'f5': '[1, 2, 3]'})]
This is output
[Row(x='a', y={'f2': 2, 'f1': 1}, z={'f3': '1', 'f4': '2'}), Row(x='a', y={'f2': 4, 'f1': 3}, z={'f3': '1', 'f4': '4', 'f5': '[1, 2, 3]'})]
from code
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import explode,monotonically_increasing_id
df = spark.createDataFrame(<br>[Row(x='a',y=[{'f1': 1, 'f2': 2}, {'f1': 3, 'f2': 4}],z=[{'f3': 1, 'f4': '2'}, {'f3': 1, 'f4': '4', 'f5': [1,2,3]}])]
,StructType([StructField('x', StringType(), True),
StructField('y', ArrayType(MapType(StringType(), IntegerType(), True), True),True),
StructField('z', ArrayType(MapType(StringType(), StringType(), True), True),True)]))
df1 = df.select('x',explode(df.y).alias("y")).withColumn("id", monotonically_increasing_id())
df2 = df.select(explode(df.z).alias("z")).withColumn("id", monotonically_increasing_id())
df3 = df1.join(df2, "id", "outer").drop("id")
df3.collect()
Related
I have two lists, 1 is a list of Map items, and another list which is the order.
I would like to sort the items based on their description attribute and compare them with the order list and have them inserted at the top.
import 'package:collection/collection.dart';
void main() {
List<String> order = [
'top european',
'top usa',
'top rest of the world'
];
List<Map> items = [
{'id': 0, 'id2': 5, 'description': 'Top USA'},
{'id': 2, 'id2': 2, 'description': 'Top A'},
{'id': 3, 'id2': 0, 'description': 'Top Z'},
{'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
{'id': 4, 'id2': 4, 'description': 'Top C'},
{'id': 5, 'id2': 1, 'description': 'Top D'},
{'id': 1, 'id2': 3, 'description': 'Top European'},
];
//this works but adds the items at the end
items.sort((a,b) {
return order.indexOf(a['description'].toLowerCase()) -
order.indexOf(b['description'].toLowerCase());
});
///Results: print(items);
// List<Map> items = [
// {'id': 2, 'id2': 2, 'description': 'Top A'},
// {'id': 3, 'id2': 0, 'description': 'Top Z'},
// {'id': 4, 'id2': 4, 'description': 'Top C'},
// {'id': 5, 'id2': 1, 'description': 'Top D'},
// {'id': 1, 'id2': 3, 'description': 'Top European'},
// {'id': 0, 'id2': 5, 'description': 'Top USA'},
// {'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
// ];
}
SOLUTION: I also tried this approach which is not ideal, but it works.
List <Map> itemsOrder = items
.where(
(ele) => order.contains(ele['description'].toString().toLowerCase()))
.toList();
itemsOrder.sort((a, b) {
return order.indexOf(a['description'].toLowerCase()) -
order.indexOf(b['description'].toLowerCase());
});
items.removeWhere(
(ele) => order.contains(ele['description'].toString().toLowerCase()));
itemsOrder = itemsOrder.reversed.toList();
for (int i = 0; i < itemsOrder.length; i++) {
items.insert(0, itemsOrder[i]);
}
///Results: print(items);
// List<Map> items = [
// {'id': 1, 'id2': 3, 'description': 'Top European'},
// {'id': 0, 'id2': 5, 'description': 'Top USA'},
// {'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
// {'id': 2, 'id2': 2, 'description': 'Top A'},
// {'id': 3, 'id2': 0, 'description': 'Top Z'},
// {'id': 4, 'id2': 4, 'description': 'Top C'},
// {'id': 5, 'id2': 1, 'description': 'Top D'},
// ];
Ideally, I would like to use sortBy or sortByCompare but unfortunately, I cannot find a proper example or get a grasp of how to use it.
The way I would fix this is to find the index of the description in the order list and if it cannot be found, I would use a number that is out of index inside the order list to indicate that this item should be at the bottom of the list.
This would be my solution:
void testIt() {
final outOfBounds = order.length + 1;
const description = 'description';
items.sort(
(lhs, rhs) {
final lhsDesc = (lhs[description] as String).toLowerCase();
final rhsDesc = (rhs[description] as String).toLowerCase();
final lhsIndex =
order.contains(lhsDesc) ? order.indexOf(lhsDesc) : outOfBounds;
final rhsIndex =
order.contains(rhsDesc) ? order.indexOf(rhsDesc) : outOfBounds;
return lhsIndex.compareTo(rhsIndex);
},
);
}
And the result is:
[{id: 1, id2: 3, description: Top European}, {id: 0, id2: 5, description: Top USA}, {id: 6, id2: 6, description: Top Rest of the world}, {id: 2, id2: 2, description: Top A}, {id: 3, id2: 0, description: Top Z}, {id: 4, id2: 4, description: Top C}, {id: 5, id2: 1, description: Top D}]
I executed the following code:
temp = rdd.map( lambda p: ( p[0], (p[1],p[2],p[3],p[4],p[5]) ) ).groupByKey().mapValues(list).collect()
print(temp)
and I could get data:
[ ("A", [("a", 1, 2, 3, 4), ("b", 2, 3, 4, 5), ("c", 4, 5, 6, 7)]) ]
I'm trying to make a dictionary with second list argument.
For example I want to reconstruct temp like this format:
("A", {"a": [1, 2, 3, 4], "b":[2, 3, 4, 5], "c":[4, 5, 6, 7]})
Is there any clear way to do this?
If I understood you correctly you need something like this:
spark = SparkSession.builder.getOrCreate()
data = [
["A", "a", 1, 2, 5, 6],
["A", "b", 3, 4, 6, 9],
["A", "c", 7, 5, 6, 0],
]
rdd = spark.sparkContext.parallelize(data)
temp = (
rdd.map(lambda x: (x[0], ({x[1]: [x[2], x[3], x[4], x[5]]})))
.groupByKey()
.mapValues(list)
.mapValues(lambda x: {k: v for y in x for k, v in y.items()})
)
print(temp.collect())
# [('A', {'a': [1, 2, 5, 6], 'b': [3, 4, 6, 9], 'c': [7, 5, 6, 0]})]
This is easily doable with a custom Python function once you obtain the temp object. You just need to use tuple, list and dict manipulation.
def my_format(l):
# get tuple inside list
tup = l[0]
# create dictionary with key equal to first value of each sub-tuple
dct = {}
for e in tup[1]:
dct2 = {e[0]: list(e[1:])}
dct.update(dct2)
# combine first element of list with dictionary
return (tup[0], dct)
my_format(temp)
# ('A', {'a': [1, 2, 3, 4], 'b': [2, 3, 4, 5], 'c': [4, 5, 6, 7]})
As you can see, I have 30 golf scorecards. 3 scorecards from course 1 and 27 scorecards from course 2.
I need a function/method that can calculate the average score for s1,s2,s3…s18, separated by courseID and teeID.
scorecards
[
{courseID: 1, teeID: 1 , s1: 4, s2: 3, s3: 5, …s18: 4},
{courseID: 1, teeID: 1 , s1: 4, s2: 3, s3: 5, …s18: 4},
{courseID: 1, teeID: 1 , s1: 4, s2: 3, s3: 5, …s18: 4},
…3 scorecards
],
[
{courseID: 2, teeID: 1 , s1: 4, s2: 3, s3: 5, …s18: 4},
{courseID: 2, teeID: 1 , s1: 4, s2: 3, s3: 5, …s18: 4},
{courseID: 2, teeID: 1 , s1: 4, s2: 3, s3: 5, …s18: 4},
…27 scorecards
];
I have the below dictionary:
{'Closed': {'High': 33, 'Medium': 474, 'Low': 47, 'Critical': 6}, 'Impact Statement Pending': {'Low': 3, 'Medium': 1, 'Critical': 0, 'High': 0}, 'New': {'Low': 1, 'High': 2, 'Critical': 2, 'Medium': 2}, 'Remediation Plan Pending': {'Medium': 10, 'Low': 1, 'Critical': 1, 'High': 0}, 'Remedy in Progress': {'Medium': 36, 'Low': 18, 'High': 4, 'Critical': 1}}
How might I accomplish creating a list comprised of all values for a specified key? A list for all high values, or another list for all medium values?
The way I am currently accomplishing this doesn't seem like the best way. I've got a list of all severity levels, which I iterate over and compare such as shown below:
trace_list = ['High', 'Medium', 'Critical', 'Low']
total_status_dict = {'Closed': {'High': 33, 'Medium': 474, 'Low': 47, 'Critical': 6}, 'Impact Statement Pending': {'Low': 3, 'Medium': 1, 'Critical': 0, 'High': 0}, 'New': {'Low': 1, 'High': 2, 'Critical': 2, 'Medium': 2}, 'Remediation Plan Pending': {'Medium': 10, 'Low': 1, 'Critical': 1, 'High': 0}, 'Remedy in Progress': {'Medium': 36, 'Low': 18, 'High': 4, 'Critical': 1}}
for item in trace_labels:
y_values = []
for key, val in total_status_dict.items():
for ke in total_status_dict[key]:
if item is ke:
y_values.append(total_status_dict[key][ke])
Note: you are iterating over total_status_dict keys and appending results to a list. Remember that even if dictionaries are officially ordered in Python since 3.7 (see https://docs.python.org/3/whatsnew/3.7.html) you do not always control the Python version of the user. I would rather build a dict key -> item -> value, where key is Closed, Impact Statement Pending, ... and item is one of the trace_labels than a dict key -> [values] where values is supposed to be ordered as in trace_labels.
Your code is not efficient because you iterate over trace_labels twice:
for item in trace_labels:
for ke intotal_status_dict[key]: if item is ke:`
How to iterate only once? Instead of building y_values lists one by one (with a whole iteration over total_status_dict each time), you can build several lists at once:
>>> trace_labels = ['High', 'Medium', 'Critical', 'Low']
>>> total_status_dict = {'Closed': {'High': 33, 'Medium': 474, 'Low': 47, 'Critical': 6}, 'Impact Statement Pending': {'Low': 3, 'Medium': 1, 'Critical': 0, 'High': 0}, 'New': {'Low': 1, 'High': 2, 'Critical': 2, 'Medium': 2}, 'Remediation Plan Pending': {'Medium': 10, 'Low': 1, 'Critical': 1, 'High': 0}, 'Remedy in Progress': {'Medium': 36, 'Low': 18, 'High': 4, 'Critical': 1}}
>>> y_values_by_label = {}
>>> for key, value_by_label in total_status_dict.items():
... for label, value in value_by_label.items(): # total_status_dict[key] is value_by_label
... y_values_by_label.setdefault(label, {})[key] = value
...
>>> y_values_by_label
{'High': {'Closed': 33, 'Impact Statement Pending': 0, 'New': 2, 'Remediation Plan Pending': 0, 'Remedy in Progress': 4}, 'Medium': {'Closed': 474, 'Impact Statement Pending': 1, 'New': 2, 'Remediation Plan Pending': 10, 'Remedy in Progress': 36}, 'Low': {'Closed': 47, 'Impact Statement Pending': 3, 'New': 1, 'Remediation Plan Pending': 1, 'Remedy in Progress': 18}, 'Critical': {'Closed': 6, 'Impact Statement Pending': 0, 'New': 2, 'Remediation Plan Pending': 1, 'Remedy in Progress': 1}}
setdefault(label, {}) creates a empty dict y_values_by_label[label] = {} if y_values_by_label does not have the key label.
If you want to turn this in a dict comprehension, you have to use your inefficient method:
>>> {label:{k:v for k, value_by_label in total_status_dict.items() for l, v in value_by_label.items() if l==label} for label in trace_labels}
{'High': {'Closed': 33, 'Impact Statement Pending': 0, 'New': 2, 'Remediation Plan Pending': 0, 'Remedy in Progress': 4}, 'Medium': {'Closed': 474, 'Impact Statement Pending': 1, 'New': 2, 'Remediation Plan Pending': 10, 'Remedy in Progress': 36}, 'Critical': {'Closed': 6, 'Impact Statement Pending': 0, 'New': 2, 'Remediation Plan Pending': 1, 'Remedy in Progress': 1}, 'Low': {'Closed': 47, 'Impact Statement Pending': 3, 'New': 1, 'Remediation Plan Pending': 1, 'Remedy in Progress': 18}}
Take an example collection with these documents:
client.test.foo.insert_one({
'name': 'clientA',
'locations': [
{'name': 'a', 'sales': 0, 'leads': 2},
{'name': 'b', 'sales': 5, 'leads': 1},
{'name': 'c', 'sales': 3.3, 'leads': 1}]})
client.test.foo.insert_one({
'name': 'clientB',
'locations': [
{'name': 'a', 'sales': 6, 'leads': 1},
{'name': 'b', 'sales': 6, 'leads': 3},
{'name': 'c', 'sales': 1.3, 'leads': 4}]})
How does $max determine which item in the location array is maximal?
client.test.foo.aggregate([{'$project': {'maxItem': {'$max': '$locations'}}}]))
Returns:
[{'_id': ObjectId('5b995d72eabb0f0d86dceda5'),
'maxItem': {'leads': 1, 'name': 'b', 'sales': 5}},
{'_id': ObjectId('5b995d72eabb0f0d86dceda6'),
'maxItem': {'leads': 3, 'name': 'b', 'sales': 6}}]
It looks like $max is picking to sort on sales but I am not sure why?
I discovered this
https://docs.mongodb.com/manual/reference/bson-type-comparison-order/#objects
which states:
MongoDB’s comparison of BSON objects uses the following order:
Recursively compare key-value pairs in the order that they appear
within the BSON object.
Compare the key field names.
If the key field names are equal, compare the field values.
If the field values are equal, compare the next key/value pair (return to step 1). An object without further pairs is less than an
object with further pairs.
which means that if sales is the first key in the bson object then I have my answer. I'm using pymongo and python dictionaries aren't ordered, so I switched to bson.son.SON and re-did the example:
client.test.foo.delete_many({})
client.test.foo.insert_one({
'name': 'clientA',
'locations': [
bson.son.SON([('name', 'a'), ('sales', 0), ('leads', 2)]),
bson.son.SON([('name', 'b'), ('sales', 5), ('leads', 1)]),
bson.son.SON([('name', 'c'), ('sales', 3.3), ('leads', 1)])]})
client.test.foo.insert_one({
'name': 'clientB',
'locations': [
bson.son.SON([('name', 'a'), ('sales', 6), ('leads', 1)]),
bson.son.SON([('name', 'b'), ('sales', 6), ('leads', 3)]),
bson.son.SON([('name', 'c'), ('sales', 1.3), ('leads', 4)])]})
And now its sorting by name:
client.test.foo.aggregate([{'$project': {'maxItem': {'$max': '$locations'}}}]))
Returns:
[{'_id': ObjectId('5b99619beabb0f0d86dcedaf'),
'maxItem': {'leads': 1, 'name': 'c', 'sales': 3.3}},
{'_id': ObjectId('5b99619beabb0f0d86dcedb0'),
'maxItem': {'leads': 4, 'name': 'c', 'sales': 1.3}}]