Using .join to remove certain lines in a db in spark scala - scala

If I had these two datasets:
db1:
{item: "keyboard", cus_id: 1234}
{item: "mouse", cus_id: 2345}
{item: "laptop", cus_id: 3456}
{item: "charger", cus_id: 4567}
{item: "headset", cus_id: 5678}
db2:
{item: "keyboard", cus_id: 1234}
{item: "mouse", cus_id: 2345}
{item: "laptop", cus_id: 3456}
{item: "charger", cus_id: 1234}
I want to return the dataset
{item: "charger", cus_id: 4567}
{item: "headset", cus_id: 5678}
because they are entries that are contained in db1 but not db2 based on the item and cus_id. Here's what I have so far:
db1.join(db2, db1.col("item") =!= bd2.col("item") && db1.col("cus_id") === db2.col("cus_id"), "inner")
but I think the logic is flawed somewhere in the join function. How can I do this correctly?

I think you should instead perform a left join db2 on the condition that cus_ids are equal and items are equal too, so if there is a row that exists in db1 and not in db2, will appear in the result but the right columns will be NULL and you can use this to filter the desired data:
scala> df1.join(df2, df1.col("cus_id") === df2.col("cus_id") && df1.col("item") === df2.col("item"), "left").toDF("id1", "item1", "id2", "item2").select("id1", "item1").where("id2 is null").toDF("id", "item").show
+-------+----+
| id|item|
+-------+----+
|charger|4567|
|headset|5678|
+-------+----+
This is a cleaner version of the same code:
df1
.join(
df2,
df1.col("cus_id") === df2.col("cus_id") && df1.col("item") === df2.col("item"),
"left"
)
.toDF("id1", "item1", "id2", "item2")
.select("id1", "item1").where("id2 is null")
.toDF("id", "item")
.show

Are you looking for a left anti join?
db1.join(db2, on=["item", "cus_id"], how="left_anti")

Related

how to change the df column name in struct with colum value

df.withColumn("storeInfo", struct($"store", struct($"inhand", $"storeQuantity")))
.groupBy("sku").agg(collect_list("storeInfo").as("info"))
.show(false)
+---+---------------------------------------------------+
|sku|info |
+---+---------------------------------------------------+
|1 |[{2222, {3, 34}}, {3333, {5, 45}}] |
|2 |[{4444, {5, 56}}, {5555, {6, 67}}, {6666, {7, 67}}]|
+---+---------------------------------------------------+
when I am sending it to couchbase
{
"SKU": "1",
"info": [
{
"col2": {
"inhand": "3",
"storeQuantity": "34"
},
"Store": "2222"
},
{
"col2": {
"inhand": "5",
"storeQuantity": "45"
},
"Store": "3333"
}}
]}
can we rename the col2 with the value to the value of store? I want it to look like something as below. So the key of every struct is the value of store value.
{
"SKU": "1",
"info": [
{
"2222": {
"inhand": "3",
"storeQuantity": "34"
},
"Store": "2222"
},
{
"3333": {
"inhand": "5",
"storeQuantity": "45"
},
"Store": "3333"
}}
]}
Simply, we can't construct a column as you want. two limitation:
The field name of struct type must be fixed, we can change 'col2' to another name (eg. 'fixedFieldName' in demo 1), but it can't be dynamic(similar to Java class field name)
The key of map type could be dynamic, but the value of map must be same type, see the exception in demo 2.
maybe you should change the schema, see the outputs of demo 1, 3
demo 1
df.withColumn(
"storeInfo", struct($"store", struct($"inhand", $"storeQuantity").as("fixedFieldName"))).
groupBy("sku").agg(collect_list("storeInfo").as("info")).
toJSON.show(false)
// output:
//+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|value |
//+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"sku":1,"info":[{"store":2222,"fixedFieldName":{"inhand":3,"storeQuantity":34}},{"store":3333,"fixedFieldName":{"inhand":5,"storeQuantity":45}}]} |
//|{"sku":2,"info":[{"store":4444,"fixedFieldName":{"inhand":5,"storeQuantity":56}},{"store":5555,"fixedFieldName":{"inhand":6,"storeQuantity":67}},{"store":6666,"fixedFieldName":{"inhand":7,"storeQuantity":67}}]}|
//+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
demo 2
df.withColumn(
"storeInfo",
map($"store", struct($"inhand", $"storeQuantity"), lit("Store"), $"store")).
groupBy("sku").agg(collect_list("storeInfo").as("info")).
toJSON.show(false)
// output exception:
// The given values of function map should all be the same type, but they are [struct<inhand:int,storeQuantity:int>, int]
demo 3
df.withColumn(
"storeInfo",
map($"store", struct($"inhand", $"storeQuantity"))).
groupBy("sku").agg(collect_list("storeInfo").as("info")).
toJSON.show(false)
//+---------------------------------------------------------------------------------------------------------------------------------------------+
//|value |
//+---------------------------------------------------------------------------------------------------------------------------------------------+
//|{"sku":1,"info":[{"2222":{"inhand":3,"storeQuantity":34}},{"3333":{"inhand":5,"storeQuantity":45}}]} |
//|{"sku":2,"info":[{"4444":{"inhand":5,"storeQuantity":56}},{"5555":{"inhand":6,"storeQuantity":67}},{"6666":{"inhand":7,"storeQuantity":67}}]}|
//+---------------------------------------------------------------------------------------------------------------------------------------------+

how to aggregate multiple columns and generate a formated string in python

I have table/dateframe with the following structure
**userid item createdon**
user1 item1 2020-10-01
user1 item2 2020-10-02
user1 item3 2020-10-03
user2 item4 2020-01-01
user2 item1 2020-03-03
...
for each userid, I need to generate a string in a format like this:
for user1:
{ "date": "2020-10-01",
"item": {"display": "item1"}
},
{ "date": "2020-10-02",
"item": {"display": "item2"}
},
{ "date": "2020-10-03",
"item": {"display": "item3"}
}
for user2:
{ "date": "2020-01-01",
"item": {"display": "item4"}
},
{ "date": "2020-03-03",
"item": {"display": "item1"}
}
I am using pyspark. I wonder if I could achieve this by utilizing map transformation? Any help would be appreciated.
I could get the example working. It writes out one file per user.
Ref : Spark: write JSON several files from DataFrame based on separation by column value
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
from pyspark.sql.functions import *
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
("user1", "item1", "2020-10-01"),
("user1", "item2", "2020-10-02"),
("user1", "item3", "2020-10-03"),
("user2", "item4", "2020-01-01"),
("user2", "item1", "2020-03-03")
]
df1Columns = ["userid", "item", "createdon"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1.printSchema()
df1.show(truncate=False)
partialSchema = StructType([StructField("date", StringType(), nullable=True)
, StructField("item", StructType([StructField("display", StringType(), nullable=True)]), nullable=True)
])
actualSchema = StructType([StructField("userid", StringType(), nullable=True)
, StructField("dict", partialSchema, nullable=True)
])
res_df = df1.rdd.map(lambda row: Row(row[0], {"date": row[2], "item" : {'display':row[1]}}))\
.toDF(actualSchema)
res_df.show(20, False)
res_df.repartition(col("userid")).select(col("userid"), col("dict.*")).write.partitionBy("userid").json("./helloworld/data/")
The last line writes out two files one per each user.
Content of first user file:
{"date":"2020-10-01","item":{"display":"item1"}}
{"date":"2020-10-02","item":{"display":"item2"}}
{"date":"2020-10-03","item":{"display":"item3"}}
Content of second user file:
{"date":"2020-01-01","item":{"display":"item4"}}
{"date":"2020-03-03","item":{"display":"item1"}}

Is it possible to do a subquery to return an array for the $nin operator in MongoDB?

I have a data set that looks something like:
{"key": "abc", "val": 1, "status": "np"}
{"key": "abc", "val": 2, "status": "p"}
{"key": "def", "val": 3, "status": "np"}
{"key": "ghi", "val": 4, "status": "np"}
{"key": "ghi", "val": 5, "status": "p"}
I want a query that returns document(s) that have a status="np" but only where there are other documents with the same key that do not have a status value of "p". So the document returned from the data set above would be key="def" since "abc" has a value of "np" but "abc" also has a document with a value of "p". This is also true for key="ghi". I came up with something close but I don't think the $nin operator supports q distinct query.
db.test2.find({$and: [{"status":"np"}, {"key": {$nin:[<distinct value query>]]})
If I were to hardcode the value in the $nin array, it would work:
db.test2.find({$and: [{"status":"np"}, {"key": {$nin:['abc', 'ghi']}}]})
I just need to be able to write a find inside the square brackets. I could do something like:
var res=[];
res = db.test2.distinct("key", {"status": "p"});
db.test2.find({$and: [{"status":"np"}, {"key": {$nin:res}}]});
But the problem with this is that in the time between the two queries, another process may update the "status" of a document and then I'd have inconsistent data.
Try this
db.so.aggregate([
{$group: {'_id': '$key', 'st': {$push: '$status'}}},
{$project :{st: 1, '$val':1, '$status':1, 'hasNp':{$in:['np', '$st']}, hasP: {$in:['p', '$st']}}},
{$match: {hasNp: true, hasP: false}}
]);

Add a new field with large number of rows to existing collection in Mongodb

I have an existing collection with close to 1 million number of docs, now I'd like to append a new field data to this collection. (I'm using PyMongo)
For example, my existing collection db.actions looks like:
...
{'_id':12345, 'A': 'apple', 'B': 'milk'}
{'_id':12346, 'A': 'pear', 'B': 'juice'}
...
Now I want to append a new column field data to this existing collection:
...
{'_id':12345, 'C': 'beef'}
{'_id':12346, 'C': 'chicken'}
...
such that the resulting collection should look like this:
...
{'_id':12345, 'A': 'apple', 'B': 'milk', 'C': 'beef'}
{'_id':12346, 'A': 'pear', 'B': 'juice', 'C': 'chicken'}
...
I know we can do this with update_one with a for loop, e.g
for doc in values:
collection.update_one({'_id': doc['_id']},
{'$set': {k: doc[k] for k in fields}},
upsert=True
)
where values is a list of dictionary each containing two items, the _id key-value pair and new field key-value pair. fields contains all the new fields I'd like to add.
However, the issue is that I have a million number of docs to update, anything with a for loop is way too slow, is there a way to append this new field faster? something similar to insert_many except it's appending to an existing collection?
===============================================
Update1:
So this is what I have for now,
bulk = self.get_collection().initialize_unordered_bulk_op()
for doc in values:
bulk.find({'_id': doc['_id']}).update_one({'$set': {k: doc[k] for k in fields} })
bulk.execute()
I first wrote a sample dataframe into the db with insert_many, the performance:
Time spent in insert_many: total: 0.0457min
then I use update_one with bulk operation to add extra two fields onto the collection, I got:
Time spent: for loop: 0.0283min | execute: 0.0713min | total: 0.0996min
Update2:
I added an extra column to both the existing collection and the new column data, for the purpose of using left join to solve this. If you use left join you can ignore the _id field.
For example, my existing collection db.actions looks like:
...
{'A': 'apple', 'B': 'milk', 'dateTime': '2017-10-12 15:20:00'}
{'A': 'pear', 'B': 'juice', 'dateTime': '2017-12-15 06:10:50'}
{'A': 'orange', 'B': 'pop', 'dateTime': '2017-12-15 16:09:10'}
...
Now I want to append a new column field data to this existing collection:
...
{'C': 'beef', 'dateTime': '2017-10-12 09:08:20'}
{'C': 'chicken', 'dateTime': '2017-12-15 22:40:00'}
...
such that the resulting collection should look like this:
...
{'A': 'apple', 'B': 'milk', 'C': 'beef', 'dateTime': '2017-10-12'}
{'A': 'pear', 'B': 'juice', 'C': 'chicken', 'dateTime': '2017-12-15'}
{'A': 'orange', 'B': 'pop', 'C': 'chicken', 'dateTime': '2017-12-15'}
...
If your updates are really unique per document there is nothing faster than the bulk write API. Neither MongoDB nor the driver can guess what you want to update so you will need to loop through your update definitions and then batch your bulk changes which is pretty much described here:
Bulk update in Pymongo using multiple ObjectId
The "unordered" bulk writes can be slightly faster (although in my tests they weren't) but I'd still vote for the ordered approach for error handling reasons mainly).
If, however, you can group your changes into specific recurring patterns then you're certainly better off defining a bunch of update queries (effectively one update per unique value in your dictionary) and then issue those each targeting a number of documents. My Python is too poor at this point to write that entire code for you but here's a pseudocode example of what I mean:
Let's say you've got the following update dictionary:
{
key: "doc1",
value:
[
{ "field1", "value1" },
{ "field2", "value2" },
]
}, {
key: "doc2",
value:
[
// same fields again as for "doc1"
{ "field1", "value1" },
{ "field2", "value2" },
]
}, {
key: "doc3",
value:
[
{ "someotherfield", "someothervalue" },
]
}
then instead of updating the three documents separately you would send one update to update the first two documents (since they require the identical changes) and then one update to update "doc3". The more knowledge you have upfront about the structure of your update patterns the more you can optimize that even by grouping updates of subsets of fields but that's probably getting a little complicated at some point...
UPDATE:
As per your below request let's give it a shot.
fields = ['C']
values = [
{'_id': 'doc1a', 'C': 'v1'},
{'_id': 'doc1b', 'C': 'v1'},
{'_id': 'doc2a', 'C': 'v2'},
{'_id': 'doc2b', 'C': 'v2'}
]
print 'before transformation:'
for doc in values:
print('_id ' + doc['_id'])
for k in fields:
print(doc[k])
transposed_values = {}
for doc in values:
transposed_values[doc['C']] = transposed_values.get(doc['C'], [])
transposed_values[doc['C']].append(doc['_id'])
print 'after transformation:'
for k, v in transposed_values.iteritems():
print k, v
for k, v in transposed_values.iteritems():
collection.update_many({'_id': { '$in': v}}, {'$set': {'C': k}})
Since your join collection having less documents, you can convert the dateTime to date
db.new.find().forEach(function(d){
d.date = d.dateTime.substring(0,10);
db.new.update({_id : d._id}, d);
})
and do multiple field lookup based on date (substring of dateTime) and _id,
and out to a new collection (enhanced)
db.old.aggregate(
[
{$lookup: {
from : "new",
let : {id : "$_id", date : {$substr : ["$dateTime", 0, 10]}},
pipeline : [
{$match : {
$expr : {
$and : [
{$eq : ["$$id", "$_id"]},
{$eq : ["$$date", "$date"]}
]
}
}},
{$project : {_id : 0, C : "$C"}}
],
as : "newFields"
}
},
{$project : {
_id : 1,
A : 1,
B : 1,
C : {$arrayElemAt : ["$newFields.C", 0]},
date : {$substr : ["$dateTime", 0, 10]}
}},
{$out : "enhanced"}
]
).pretty()
result
> db.enhanced.find()
{ "_id" : 12345, "A" : "apple", "B" : "milk", "C" : "beef", "date" : "2017-10-12" }
{ "_id" : 12346, "A" : "pear", "B" : "juice", "C" : "chicken", "date" : "2017-12-15" }
{ "_id" : 12347, "A" : "orange", "B" : "pop", "date" : "2017-12-15" }
>

How to filter in mongodb dynamically?

I am creating app that can filter data dynamically.
If i select "John", "US", and leave the sex as blank it will return no result because the query will search a sex that is blank. I can't figure out how can i dynamically filter that in the mongodb.
ex:
var fName="John",
fCountry="US",
fSex="";
db.users.find({ $and[ {sex: fSex}, {first_name: fName}, {country: fCountry} ]})
That query will return none.
I want the code to return a answer like this if i select "John", "US":
{"_id": <object>, "first_name": "John", "sex": "Male", "country": "US"}
users:
{"_id": <object>, "first_name": "John", "sex": "Male", "country": "US"},
{"_id": <object>, "first_name": "Rex", "sex": "Male", "country": "Mexico"},
{"_id": <object>, "first_name": "Jane", "sex": "Female", "country": "Canada"}
Thank You in advance!, btw i am new to mongo
Edited
You can build the query object instead of assuming it's going to looks like a specific structure.
You can add whatever checks you'd like.
var fName="John",
fCountry="US",
fSex="";
var query = { $and: [] };
if (fName !== "") { query.$and.push({name: fName}); }
if (fCountry !== "") { query.$and.push({country: fCountry}); }
if (fSex !== "") { query.$and.push({sex: fSex}); }
db.users.find(query);
Update:
As per #semicolon's comment, the $and here is unnecessary as mongo, by default, will "and" different fields together. A simpler solution reads:
var query = {};
...
query.name = fName; // et cetera.
I'll add that it may become necessary to use the key $and and other operators when building more elaborate queries.