Iterate over row and create batch: DataFrame - scala

I have a DataFrame with millions of row and I am iterating over them using following code:
df.foreachPartition { dataSetPartition => {
dataSetPartition.foreach(row => {
// DO SOMETHING like DB write/ s3 publish
})
}
}
Now I want to create batch operation for rows, so I change code with
df.foreachPartition { dataSetPartition => {
val rowBuffer = scala.collection.mutable.ListBuffer[Row]()
dataSetPartition.foreach(row => {
rowBuffer += row
if (rows.size == 1000) {
// DO ACTION like DB write/s3 publish <- DO_ACTION
rowBuffer.clear
}
})
if (rowBuffer.size > 0) {
// DO ACTION like DB write/s3 publish <-DO_ACTION
rowBuffer.clear
}
}
}
Problem in this approach is that DO_ACTION is repeated twice. I do not want to call dataSetPartition.size to get row count beforehand as it is lazy evaluated and might be costly operation.
Version:
Scala: 2.11
Spark: 2.2.1

I would suggest to use Scalas grouped method to create batches :
df.foreachPartition { dataSetPartition => {
dataSetPartition.grouped(1000).foreach(batch => {
// DO ACTION like DB write/s3 publish <- DO_ACTION
})
}
}

Related

Cloud Functions: Transaction with FieldValue.increment() not running atomically

I have a Cloud Functions transaction that uses FieldValue.increment() to update a nested map, but it isn't running atomically, thus the value updates aren't accurate (running the transaction in quick succession results in an incorrect count).
The function is fired via:
export const updateCategoryAndSendMessage= functions.firestore.document('events/{any}').onUpdate((event, context) => {
which include the following transaction:
db.runTransaction(tx => {
const categoryCounterRef = db.collection("data").doc("categoryCount")
const intToIncrement = event.after.data().published == true ? 1 : -1;
const location = event.after.data().location;
await tx.get(categoryCounterRef).then(doc => {
for (const key in event.after.data().category) {
event.after.data().category[key].forEach(async (subCategory) => {
const map = { [key]: { [subCategory]: FieldValue.increment(intToIncrement) } };
await tx.set(categoryCounterRef, { [location]: map }, { merge: true })
})
}
},
).then(result => {
console.info('Transaction success!')
})
.catch(err => {
console.error('Transaction failure:', err)
})
}).catch((error) => console.log(error));
Example:
Value of field to increment: 0
Tap on button that performs the function multiple times in quick succession (to switch between true and false for "Published")
Expected value: 0 or 1 (depending on whether reference document value is true or false)
Actual value: -3, 5, -2 etc.
As far as I'm aware, transactions should be performed "first come, first served" to avoid inaccurate data. It seems like the function isn't "queuing up" correctly - for lack of a better word.
I'm a bit stumped, would greatly appreciate any guidance with this.
Oh goodness, I was missing return...
return db.runTransaction(tx => {

Massive Insert from a JSON file using Pg-Promise

Load huge JSON file using Pg-Promise helpers and fs stream.
I'm using pg-promise and I want to make massive inserts into a table using pgp.helpers. I've seen solution like Multi-row insert with pg-promise and also followed Data import for streams (Spex) but still it fails with the same error as in this post https://github.com/vitaly-t/spex/issues/8
I tried using a example from the other post on CSV stream(rs.csv()) but when i replaced the same with JSonStream parser I still get the same error.
Can you please share a working example?
db.tx(t => {
return streamRead.call(t, stream.pipe(parser), receiver)
})
There might be a better way to do it, but the below code sure works!
I have the chunks(row.length) at 20,000 per insert statement you can adjust accordingly based on your needs.
stream.pipe(parser)
parser.on('data', data =>{
row.push(data)
if (row.length === 20000) {
parser.pause()
//console.log(row.length)
db.tx('inserting-products', t => {
const insert = pgp.helpers.insert(row, cs)
t.none(insert).then(() => {
row =[]
parser.resume()
})
})
}
})
parser.on('end', () =>{
//console.log(row.length)
if(row.length != 0){
db.tx('inserting-products', t => {
const insert = pgp.helpers.insert(row, cs)
t.none(insert).then(() => {
console.log('success')
db.$pool.end()
})
})
}
if(row.length === 0) {
console.log('success')
db.$pool.end()
}
})
Please let me know in comments if this helps or any other ways to improve the process.

mongodb: collection.update inside cursor.each resolves before update is completed

First of all, I'm using mongodb-promise as a wrapper to MongoClient.
I need to fetch all records from a collection "people" that matches specific criteria and then update each of them.
For that I have this code to find all people:
return db.collection('people')
.then( (collection) => {
// Store reference to collection for future use
peopleCollection = collection;
return collection.find({a:1})
})
An then invoke this to update each record:
.then( (people) => {
// Process each people
return people.each( (person) => {
person.b = 2;
// Where peopleCollection is a reference to my collection
return peopleCollection.update({_id: person._id}, person)
})
})
I then add another promise chain to fetch all people where b != 2 and I find many records and I counted them. But when I execute this script repeatedly, the count decreases which means mongo is still updating other records when the promise has already resolved. What am I missing here?
Maybe:
.then( (people) => {
// Process each people
return people.each( (person) => {
// Where peopleCollection is a reference to my collection
return peopleCollection.update({_id: person._id}, {$set:{b:2}})
})
})

Query key with value anywhere in object hierarchy in Mongo

In Mongo how can I find all documents that have a given key and value, regardless of where that key appears in the document's key/value hierarchy?
For example the input key roID and value 5 would match both:
{
roID: '5'
}
and
{
other: {
roID: '5'
}
}
There is no built in way to do this. You might have to scan each matched document recursively to try and locate that attribute. Not recommended. You might want to think about restructuring your data or perhaps manipulating it into a more unified format so that it will be easier (and faster) to query.
If your desired key appears in a fixed number of different locations, you could use the $or operator to scan all the possibilities.
Taking your sample documents as an example, your query would look something like this:
db.data.find( { "$or": [
{ "roID": 5 },
{ "other.roID": 5 },
{ "foo.bar.roID": 5 },
{ any other possbile locations of roID },
...
] } )
If the number of documents in collection is not so large, then it can be done by this:
db.system.js.save({_id:"keyValueExisted", value: function (key, value) {
function findme(obj) {
for (var x in obj) {
var v = obj[x];
if (x == key && v == value) {
return true;
} else if (v instanceof Object) {
if (findme(v)) return true;
}
}
return false;
}
return findme(this);
}});
var param = ['roID', '5'];
db.c.find({$where: "keyValueExisted.apply(this, " + tojsononeline(param) + ");"});

How should I structure my nested reactivemongo calls in my play2 application?

I'm in the process of trying to combine some nested calls with reactivemongo in my play2 application.
I get a list of objects returned from createObjects. I then loop over them, check if the object exist in the collection and if not insert them:
def dostuff() = Action {
implicit request =>
form.bindFromRequest.fold(
errors => BadRequest(views.html.invite(errors)),
form => {
val objectsReadyForSave = createObjects(form.companyId, form.companyName, sms_pattern.findAllIn(form.phoneNumbers).toSet)
Async {
for(object <- objectsReadyForSave) {
collection.find(BSONDocument("cId" -> object.get.cId,"userId" ->
object.userId.get)).cursor.headOption.map { maybeFound =>
maybeFound.map { found =>
Logger.info("Found record, do not insert")
} getOrElse {
collection.insert(object)
}
}
}
Future(Ok(views.html.invite(form)))
}
})
}
I feel that this way is not as good as it can be and feels not "play2" and "reactivemongo".
So my question is: How should I structure my nested calls to get the result I want
and get the information of which objects that have been inserted?
I am not an expert in mongoDB neither in ReactiveMongo but it seems that you are trying to use a NoSQL database in the same way as you would use standard SQL databases. Note that mongoDB is asynchronous which means that operations may be executed in some future, this is why insertion/update operations do not return affected documents. Regarding your questions:
1 To insert the objects if they do not exist and get the information of which objects that have been inserted?
You should probably look at the mongoDB db.collection.update() method and call it with the upsert parameter as true. If you can afford it, this will either update documents if they already exist in database or insert them otherwise. Again, this operation does not return affected documents but you can check how many documents have been affected by accessing the last error. See reactivemongo.api.collections.GenericCollection#update which returns a Future[LastError].
2 For all the objects that are inserted, add them to a list and then return it with the Ok() call.
Once again, inserted/updated documents will not be returned. If you really need to return the complete affected document back, you will need to make another query to retrieve matching documents.
I would probably rewrite your code this way (without error/failure handling):
def dostuff() = Action {
implicit request =>
form.bindFromRequest.fold(
errors => BadRequest(views.html.invite(errors)),
form => {
val objectsReadyForSave = createObjects(form.companyId, form.companyName, sms_pattern.findAllIn(form.phoneNumbers).toSet)
Async {
val operations = for {
data <- objectsReadyForSave
} yield collection.update(BSONDocument("cId" -> data.cId.get, "userId" -> data.userId.get), data, upsert = true)
Future.sequence(operations).map {
lastErrors =>
Ok("Documents probably inserted/updated!")
}
}
}
)
}
See also Scala Futures: http://docs.scala-lang.org/overviews/core/futures.html
This is really useful! ;)
Here's how I'd rewrote it.
def dostuff() = Action { implicit request =>
form.bindFromRequest.fold(
errors => BadRequest(views.html.invite(errors)),
form => {
createObjects(form.companyId,
form.companyName,
sms_pattern.findAllIn(form.phoneNumbers).toSet).map(ƒ)
Ok(views.html.invite(form))
}
)
}
// ...
// In the model
// ...
def ƒ(cId: Option[String], userId: Option[String], logger: Logger) = {
// You need to handle the case where obj.cId or obj.userId are None
collection.find(BSONDocument("cId" -> obj.cId.get, "userId" -> obj.userId.get))
.cursor
.headOption
.map { maybeFound =>
maybeFound map { _ =>
logger.info("Record found, do not insert")
} getOrElse {
collection.insert(obj)
}
}
}
There may be some syntax errors, but the idea is there.