Combine multiple rows as JSON object in Pyspark - pyspark

I am very new to pyspark and want to perform following operation on the Data Frame. For rows having similar id I need to combine the associated columns in a JSON block. As shown in example below the output should be 1 JSON block with columns secId, names and path.
id
secId
names
path
bin
1
12
[{“area” : “en”, “value” : “name1” }, {“area” : “sp”, “value” : “name2”}]
[abc, xyz]
bin1
1
13
[{“area” : “en”, “value” : “name3” }, {“area” : “sp”, “value” : “name4”}]
[klm, nop]
bin1
Need output as
id
bin
json
1
bin1
[{“secId” : 12,“names” : [{“area” : “en”, “value” : “name1” }, {“area” : “sp”, “value” : “name2”}],“path” : [abc, xyz]},{“secId” : 13,“names” : [{“area” : “en”, “value” : “name3” }, {“area” : “sp”, “value” : “name4”}],“path” : [klm, mno]}]
It would be helpful if anyone can provide some guidelines on doing this.
Thank you

Spark 'struct' function to creates Scala map structure (key -> value).
Spark 'to_json' function to create a json structure.
Do a groupby on id and bin columns and use collect_list function to create result you want.
import pyspark.sql.functions as F
df.withColumn('json', F.to_json(F.struct("secId", "names", "path"))).groupby('id', 'bin').agg(F.collect_list('json')).show(5, False)

Related

Confusion with the order of logical operators in Mongo

I am new to Mongo DB and while doing some practising, I came across a weird problem. The schema being:
{
"_id" : ObjectId("5c8eccc1caa187d17ca6ed29"),
"city" : "CLEVELAND",
"zip" : "35049",
"loc" : {
"y" : 33.992106,
"x" : 86.559355
},
"pop" : 2369,
"state" : "AL"
} ...
I want to find the number of cities, that have a population of more than 5000 but less than 1000000.
Both these queries, this:
db.zips.find({"$nor":[{"pop":{"$lt":5000}},{"pop":{"$gt":"1000000"}}]}).count()
and this:
db.zips.find({"$nor":[{"pop":{"$gt":1000000}},{"pop":{"$lt":"5000"}}]}).count()
give different results.
The first one gives 11193 and the second one gives 29470. Since I am from MySql background, both the queries are making no difference to me. According to me, both are the same and should return the number of zip codes with a population of less than 1000000 and more than 5000. Please help me understand.
Thanks in advance.
$gte and $lte should be used to compare same data type.
your first query quoted "100000" and your second query quoted "5000", the two queries ended up as not the same, since you are comparing Numeric data type in one, and string in another.

Auto bucket separation in aggregate query in Mongo 3.2

I have data in a collection 'Test' with following entries like,
{
"_id" : ObjectId("588b65f1e9a1e01dfc55a5ff"),
"SALES_AMOUNT" : 4500
},
{
"_id" : ObjectId("588b65f1e9a1e01dfct5a5ff"),
"SALES_AMOUNT" : 500
}
and so on.
I want to equally separate into 10 buckets.
Eg :
If my total entries has 50, then it should be like,
first_bucket : First 5 entries from test collection second_bucket :
Next 5 entries from test collection .....
....
tenth_bucket : Last 5 entries from test collection.
Suppose, If total entries count has 101, then it should be like
first_bucket : First 10 entries from test collection second_bucket :
Next 10 entries from test collection .....
....
tenth_bucket : Last 11 entries from test collection.(Because 1
additional entry is there).
$bucket & $bucketAuto is in mongo 3.4.. But I use mongo 3.2.

mongo search sub string

Hello all I am working on some application using MongoDB and I have to find sub string in the collection.
I have a collection Query, as shown below.
> db.Query.find();
>{ "_id" : ObjectId("54c9ec8ead38d420d87743b0"), "QueryID" : 1, "QueryString" : "
List my games", "QueryFrequency" : 9, "QueryResultset" : 3 }
>...
Now I want to search sub string in the QueryString.
e.g. here "games" in "QueryString" : "List my games"
For this I enabled indexing on QueryString and after running the following command I am getting some results also.
> db.Query.runCommand("text" , {search : "games"});
"text" is name given to my index.
Now the problem is that I get result only when the word that I am searching has length greater than 3 (i.e. has more than 3 characters)
For my example I get results when I search with word "List" or "games",
But when I use "my" or any other word having less than 4 character gives no result.
Is there any way to solve this or am I missing some settings.
You can use regex for this. Your query would be like
db.collection.find({"query_string":{$regex: your_regex }})
For case in-sensitive search you could use this
db.users.find ({ "name" : /my/i } )
Where the i stands for insensitive

Sort collection and assign ranked values to documents over multiple columns (MongoDB)

I have a collection of documents that looks like:
[{ id : 1, a : 123, b : 342, name : 'test'}, { id : 2, a : 23, b : 32, name : 'another'}]
I am trying to sort over column a and then add another column to each document containing the rank of each value (where ties are allowed and the average is taken). It seems like I should be using the MongoDB aggregation framework, but I cannot figure out how to sort, and then assign the rank to another column. In the end I should end up with :
[
{ id : 1, a : 123, b : 32, name : 'test', aRank : 1, bRank : 1.5},
{ id : 2, a : 23, b : 32, name : 'another', aRank : 2, bRank : 1.5}
]
Any help? Thank you
Since no answer was ever given -- tied to searches bringing nothing up. I built an npm module that provides standard and fractional ranking for any js array over a numeric column.
I never did come up with a way of doing it with aggregation in mongodb, but this solution works well enough for me.
npm module: https://www.npmjs.org/package/rank.js

MongoDB Type Order

I have this collection in MongoDB. It contains
values of different types under the val key.
Also, note that I am sorting it by val ascending.
[test] 2014-02-20 08:53:11.857 >>> db.account.find().sort({val:1});
{
"_id" : ObjectId("5304d25786dd4b348bcc2b2e"),
"username" : "usr10",
"password" : "123",
"val" : [ ]
}
{
"_id" : ObjectId("5304d29986dd4b348bcc2b2f"),
"username" : "usr20",
"password" : "456",
"val" : null
}
{
"_id" : ObjectId("5304e31686dd4b348bcc2b37"),
"username" : "usr80",
"password" : "555",
"val" : 1
}
{
"_id" : ObjectId("5304d50a86dd4b348bcc2b32"),
"username" : "usr50",
"password" : "555",
"val" : [
40
]
}
{
"_id" : ObjectId("5304d4c886dd4b348bcc2b31"),
"username" : "usr40",
"password" : "777",
"val" : 200
}
{
"_id" : ObjectId("5304d2a186dd4b348bcc2b30"),
"username" : "usr30",
"password" : "888",
"val" : {
}
}
{
"_id" : ObjectId("5304d97786dd4b348bcc2b33"),
"username" : "usr50",
"password" : "555",
"val" : {
"ok" : 1
}
}
{
"_id" : ObjectId("5304e2dc86dd4b348bcc2b36"),
"username" : "usr80",
"password" : "555",
"val" : true
}
{
"_id" : ObjectId("5304e22f86dd4b348bcc2b34"),
"username" : "usr60",
"password" : "555",
"val" : ISODate("2014-02-19T16:56:15.787Z")
}
{
"_id" : ObjectId("5304e2c786dd4b348bcc2b35"),
"username" : "usr70",
"password" : "555",
"val" : /abc/
}
[test] 2014-02-20 08:53:19.357 >>>
I am reading a book which says the following.
MongoDB has a hierarchy as to how types compare. Sometimes you will have
a single key with multiple types: for instance, integers and booleans, or strings
and nulls. If you do a sort on a key with a mix of types, there is a predefined
order that they will be sorted in. From least to greatest value, this ordering
is as follows:
1. Minimum value
2. null
3. Numbers (integers, longs, doubles)
4. Strings
5. Object/document
6. Array
7. Binary data
8. Object ID
9. Boolean
10. Date
11. Timestamp
12. Regular expression
13. Maximum value
So why is my sorting order different? For example,
when I sort (see above) I see these strange things:
1) I have no idea what 'minimum value' and 'maximum value' mean.
2) An array comes before a number. And an empty
array comes even before null.
3) The number 1 comes before an array
4) The array [40] comes between numbers 1 and 200.
Could someone just explain this result in some details?
Many thanks in advance.
Your book says the same as the official documentation. But this also does not explain the obscure sorting order of the two arrays. At least the two types Minimum value and Maximum value are explained. They are internal.
The type order is only used when there isn't another supported way of ordering elements. Array fields have their own sorting behavior where the minimum value of their elements are used on an ascending sort, and the maximum value on a descending sort. The type of that minimum or maximum value is then used to order the docs with fields of that type.
So [40] comes after 1, but before 200 because the minimum value of that array is 40.
The empty array has no value at all, which is why it ends up with the doc where the value is null. If I reverse the sort, they stay in the same order which implies that MongoDB considers them equal.
Where is the sort clause in your query? Your sort order appears to be the default order - notice the ascending ObjectIds. You mentioned you're sorting by val so I would expect your query to be
db.account.find({val:1})
MongoDB type comparison
Why
MongoDB is a schemaless database, which allows you to store pieces of information (document) without defining a structure (schema of fields), like in SQL where we have to define a schema in the form of columns and their data types.
In the case of sorting or data retrival this may be problematic. In order to be predictable, MongoDB has a fixed type order to sort documents of different types as you already mentioned:
Minimum value (MinKey)
Null
Numbers (ints, longs, doubles, decimals)
Strings
Object/document
Array
Binary data
ObjectId
Boolean
Date
Timestamp
Regex
Maximum value (MaxKey)
In the process of sorting values get compared, in order to decide how they should be positioned.
This list defines from lowest to highest what should happen when theses data types get compared.
{ "a": 0 }
{ "a": "a" }
{ "a": 1 }
When sorted by a ascending numbers precide strings, as the list states Strings (4) and Numbers (3).
Conversions
For certain data types MongoDB tries to convert them when compared (If you know JavaScript this should feel familiar).
[] -> null
[[]] -> Array
// but
"" -> String
0 -> Number
...
For one dimensional Arrays, their conversions depend on the order of sorting.
// ascending
[1, 2, 3, 4, 5] (Array) -> 1 (Int)
[5, 8, 10] (Array) -> 5 (Int)
// descending
[1, 2, 3, 4, 5] (Array) -> 5 (Int)
[5, 8, 10] (Array) -> 10 (Int)
MinKey / MaxKey
MinKey is the lowest possible value in every comparison, MaxKey the highest. Both are used internally, as they are always at the beginning or end of a collection when sorted.
Questions
I have no idea what 'minimum value' and 'maximum value' mean.
These are MinKey and MaxKey. See above.
An array comes before a number. And an empty array comes even before null.
It does not. As already explained an [] is always converted to null. When comparing equal values like, null and null or 40 and [40], the documents are sorted by natural order, which is in your case the timestamp in the ObjectId.
Try creating first the null entry and then the empty array.
The number 1 comes before an array
[40] is not an Array since it's converted to a Number. [[40]] would be an Array.
The array [40] comes between numbers 1 and 200.
If think you've got it. See 3.
Sources and further reading
https://www.w3resource.com/mongodb/mongodb-data-types.php
https://docs.mongodb.com/manual/reference/bson-type-comparison-order/
https://docs.mongodb.com/manual/reference/bson-types/