Spark mapPartitionsToPai execution time - mongodb

In the current project I am working, we are using spark as computation engine for one of workflows.
Workflow is as follows
We have product catalog being served from several pincodes. User logged in from any particular pin code should be able to see least available cost from all available serving pincodes.
Least cost is calculated as follows
product price+dist(pincode1,pincode2) -
pincode2 being user pincode and pincode1 being source pincode. Apply the above formula for all source pincodes and identify the least available one.
My Core spark logic looks like this
pincodes.javaRDD().cartesian(pincodePrices.javaRDD()).mapPartitionsToPair(new PairFlatMapFunction<Iterator<Tuple2<Row,Row>>, Row, Row>() {
#Override
public Iterator<Tuple2<Row, Row>> call(Iterator<Tuple2<Row, Row>> t)
throws Exception {
MongoClient mongoclient = MongoClients.create("mongodb://localhost");
MongoDatabase database = mongoclient.getDatabase("catalogue");
MongoCollection<Document>pincodeCollection = database.getCollection("pincodedistances");
List<Tuple2<Row,Row>> list =new LinkedList<>();
while (t.hasNext()) {
Tuple2<Row, Row>tuple2 = t.next();
Row pinRow = tuple2._1;
Integer srcPincode = pinRow.getAs("pincode");
Row pricesRow = tuple2._2;
Row pricesRow1 = (Row)pricesRow.getAs("leastPrice");
Integer buyingPrice = pricesRow1.getAs("buyingPrice");
Integer quantity = pricesRow1.getAs("quantity");
Integer destPincode = pricesRow1.getAs("pincodeNum");
if(buyingPrice!=null && quantity>0) {
BasicDBObject dbObject = new BasicDBObject();
dbObject.append("sourcePincode", srcPincode);
dbObject.append("destPincode", destPincode);
//System.out.println(srcPincode+","+destPincode);
Number distance;
if(srcPincode.intValue()==destPincode.intValue()) {
distance = 0;
}else {
Document document = pincodeCollection.find(dbObject).first();
distance = document.get("distance", Number.class);
}
double margin = 0.02;
Long finalPrice = Math.round(buyingPrice+(margin*buyingPrice)+distance.doubleValue());
//Row finalPriceRow = RowFactory.create(finalPrice,quantity);
StructType structType = new StructType();
structType = structType.add("finalPrice", DataTypes.LongType, false);
structType = structType.add("quantity", DataTypes.LongType, false);
Object values[] = {finalPrice,quantity};
Row finalPriceRow = new GenericRowWithSchema(values, structType);
list.add(new Tuple2<Row, Row>(pinRow, finalPriceRow));
}
}
mongoclient.close();
return list.iterator();
}
}).reduceByKey((priceRow1,priceRow2)->{
Long finalPrice1 = priceRow1.getAs("finalPrice");
Long finalPrice2 = priceRow2.getAs("finalPrice");
if(finalPrice1.longValue()<finalPrice2.longValue())return priceRow1;
return priceRow2;
}).collect().forEach(tuple2->{
// Business logic to push computed price to mongodb
}
I am able to get the answer correctly, however mapPartitionsToPair is taking a bit of time(~22 secs for just 12k records).
After browsing internet I found that mapPartitions performs better than mapPartitionsToPair, but I am not sure how to emit (key,value) from mapPartitions and then sort it.
Is there any alternative for above transformations or any better approach is highly appreciated.
Spark Cluster: Standalone(1 executor, 6 cores)

Related

mybatis cursor cannot solve memory issue

I am using spring 2 and mybatis cursor to minimize the impact out of memory when selecting and process > 100k records at once, but I am not sure I am doing the right thing.
Mapper:
#SelectProvider(type = provider.class, method = "retrieveTx")
Cursor<TxModel> retrieveTx();
DAO:
#Transactional(readOnly = true)
public List<TxModel> retrieveTx() {
Iterator<TxModel> iterator = mapper.retrieveTx().iterator();
List<TxModel> actualList = new ArrayList<>();
iterator.forEachRemaining(actualList::add);
return actualList;
}
Using it :
List<TxModel> txs = dao.retrieveTx()
for (TxModel tx: txs) {
....
}
Anyone can suggest what I am doing is right? Because I felt that this method doesn't solve my problem when the db records is at 6 digits.

Can some one help me on below apex code i want to update this code as per governor limits

public static void updatecasefields(List<Case> lstcase) {
//List<Case> lstcase = new list<case>();
ID devRecordTypeId = Schema.SObjectType.Case.getRecordTypeInfosByDeveloperName().get('CRM_CSR_Case').getRecordTypeId();
for (Case cs: lstcase) {
if(cs.ID != null && cs.RecordTypeId == devRecordTypeId) {
}
List<CRM_CasePick__c> Casp = [SELECT Id, CRM_Carrier_Name__c,CRM_LOB__c, CRM_SLA_Turnaround_Time__c,CRM_Category__c, CRM_Issue_Sub_Type__c,CRM_Issue_Type__c,CRM_Turnaround_Time_Days__c FROM CRM_CasePick__c WHERE CRM_Carrier_Name__c = :cs.GiDP_CarrierName__c AND CRM_Category__c = :cs.CRM_Category__c AND CRM_Issue_Type__c = :cs.CRM_Issue_Type__c AND CRM_Issue_Sub_Type__c = :cs.CRM_Issue_Sub_Type__c AND CRM_LOB__c = :cs.CRM_Line_of_Business__c];
for(CRM_CasePick__c CP: Casp) {
cs.CRM_Turnaround_Time_Days__c = cp.CRM_Turnaround_Time_Days__c;
cs.CRM_SLA_Turnaround_time__c = cp.CRM_SLA_Turnaround_Time__c;
}
}
}
Remove the SOQL query from the for loop - best practice is to never run a query within a loop.
Right now it is running that query for every value of the initial list. If the list is over 100 records, it will exceed the governor limit.

How to calculate mean of distributed data?

How I can calculate the arithmetic mean of a large vector(series) in distributed computing where I partition the data on multiple nodes. I do not want to use map reduce paradigm. Is there any distributed algorithm to efficiently compute the mean besides the trivial computation of individual sum on each node and then bringing the result at master node and dividing with the size of the vector(series).
distributed average consensus is an alternative.
The problem with the trivial approach of map-reduce with a master is that if you have a vast set of data, in essence to make everything dependent on each other, it could take a very long time to calculate the data, by which time the information is very out of date, and therefore wrong, unless you lock the entire dataset - impractical for a massive set of distributed data. Using distributed average consensus (the same methods work for alternative algorithms to Mean), you get a more up to date, better guess at the current value of the Mean without locking the data, and in real time.
Here is a link to a paper on it, but it's math heavy :
http://web.stanford.edu/~boyd/papers/pdf/lms_consensus.pdf
You can google for many papers on it.
The general concept is like this: say on each node you have a socket listener. You evaluate your local sum and average, then publish it to the other nodes. Each node listens for the other nodes, and receives their sum and averages on a timescale that makes sense. You can then evaluate a good guess at the total average by (sumForAllNodes(storedAverage[node] * storedCount[node]) / (sumForAllNodes(storedCount[node])). If you have a truly large dataset, you could just listen for new values as they are stored in the node, and amend the local count and average, then publish them.
If even this is taking too long, you could average over a random subset of the data in each node.
Here is some c# code that gives you an idea (uses fleck to run on more versions of windows than windows-10-only microsoft websockets implementation). Run this on two nodes, one with
<appSettings>
<add key="thisNodeName" value="UK" />
</appSettings>
in the app.config, and use "EU-North" in the other. Here is some sample code. The two instances exchange means using websockets. You just need to add your back end enumeration of the database.
using Fleck;
namespace WebSocketServer
{
class Program
{
static List<IWebSocketConnection> _allSockets;
static Dictionary<string,decimal> _allMeans;
static Dictionary<string,decimal> _allCounts;
private static decimal _localMean;
private static decimal _localCount;
private static decimal _localAggregate_count;
private static decimal _localAggregate_average;
static void Main(string[] args)
{
_allSockets = new List<IWebSocketConnection>();
_allMeans = new Dictionary<string, decimal>();
_allCounts = new Dictionary<string, decimal>();
var serverAddresses = new Dictionary<string,string>();
//serverAddresses.Add("USA-WestCoast", "ws://127.0.0.1:58951");
//serverAddresses.Add("USA-EastCoast", "ws://127.0.0.1:58952");
serverAddresses.Add("UK", "ws://127.0.0.1:58953");
serverAddresses.Add("EU-North", "ws://127.0.0.1:58954");
//serverAddresses.Add("EU-South", "ws://127.0.0.1:58955");
foreach (var serverAddress in serverAddresses)
{
_allMeans.Add(serverAddress.Key, 0m);
_allCounts.Add(serverAddress.Key, 0m);
}
var thisNodeName = ConfigurationSettings.AppSettings["thisNodeName"]; //for example "UK"
var serverSocketAddress = serverAddresses.First(x=>x.Key==thisNodeName);
serverAddresses.Remove(thisNodeName);
var websocketServer = new Fleck.WebSocketServer(serverSocketAddress.Value);
websocketServer.Start(socket =>
{
socket.OnOpen = () =>
{
Console.WriteLine("Open!");
_allSockets.Add(socket);
};
socket.OnClose = () =>
{
Console.WriteLine("Close!");
_allSockets.Remove(socket);
};
socket.OnMessage = message =>
{
Console.WriteLine(message + " received");
var parameters = message.Split('~');
var remoteHost = parameters[0];
var remoteMean = decimal.Parse(parameters[1]);
var remoteCount = decimal.Parse(parameters[2]);
_allMeans[remoteHost] = remoteMean;
_allCounts[remoteHost] = remoteCount;
};
});
while (true)
{
//evaluate my local average and count
Random rand = new Random(DateTime.Now.Millisecond);
_localMean = 234.00m + (rand.Next(0, 100) - 50)/10.0m;
_localCount = 222m + rand.Next(0, 100);
//evaluate my local aggregate average using means and counts sent from all other nodes
//could publish aggregate averages to other nodes, if you wanted to monitor disagreement between nodes
var total_mean_times_count = 0m;
var total_count = 0m;
foreach (var server in serverAddresses)
{
total_mean_times_count += _allCounts[server.Key]*_allMeans[server.Key];
total_count += _allCounts[server.Key];
}
//add on local mean and count which were removed from the server list earlier, so won't be processed
total_mean_times_count += (_localMean * _localCount);
total_count = total_count + _localCount;
_localAggregate_average = (total_mean_times_count/total_count);
_localAggregate_count = total_count;
Console.WriteLine("local aggregate average = {0}", _localAggregate_average);
System.Threading.Thread.Sleep(10000);
foreach (var serverAddress in serverAddresses)
{
using (var wscli = new ClientWebSocket())
{
var tokSrc = new CancellationTokenSource();
using (var task = wscli.ConnectAsync(new Uri(serverAddress.Value), tokSrc.Token))
{
task.Wait();
}
using (var task = wscli.SendAsync(new ArraySegment<byte>(Encoding.UTF8.GetBytes(thisNodeName+"~"+_localMean + "~"+_localCount)),
WebSocketMessageType.Text,
false,
tokSrc.Token
))
{
task.Wait();
}
}
}
}
}
}
}
Don't forget to add static lock or separate activity by synchronising at given times. (not shown for simplicity)
There are two simple approaches you can use.
One is, as you correctly noted, to calculate the sum on every node and then combine the sums and divide by the total amount of data:
avg = (sum1+sum2+sum3)/(cnt1+cnt2+cnt3)
Another possibility is to calculate the average on every node and then use weighted average:
avg = (avg1*cnt1 + avg2*cnt2 + avg3*cnt3) / (cnt1+cnt2+cnt3)
= avg1*cnt1/(cnt1+cnt2+cnt3) + avg2*cnt2/(cnt1+cnt2+cnt3) + avg3*cnt3/(cnt1+cnt2+cnt3)
I don't see anything wrong with these trivial ways and am wondering why you would want to use a different approach.

Graph processing increasingly gets slower on titan + dynamoDB (local) as more vertices/edges are added

I am working with titan 1.0 using AWS dynamoDB local implementation as storage backend on a 16GB machine. My use case involves generating graphs periodically containing vertices & edges in the order of 120K. Every time I generate a new graph in-memory, I check the graph stored in DB and either (i) add vertices/edges that do not exist, or (ii) update properties if they already exist (existence is determined by 'Label' and a 'Value' attribute). Note that the 'Value' property is indexed. Transactions are committed in batches of 500 vertices.
Problem: I find that this process gets slower each time I process a new graph (1st graph finished in 45 mins with empty db initially, 2nd took 2.5 hours, 3rd in 3.5 hours, 4th in 6 hours, 5th in 10 hours and so on). In fact, when processing a given graph, it is fairly quick at start time but progressively gets slower (initial batches take 2-4 secs and later on it increases to 100s of seconds for same batch size of 500 nodes; I also see sometimes it takes 1000-2000 secs for a batch). This is the processing time alone (see approach below); commit takes between 8-10 secs always. I configured the jvm heap size to 10G, and I notice that when the app is running it is eventually using up all of it.
Question: Is this behavior to be expected? It seems to me something is wrong here (either in my config / approach?). Any help or suggestions would be greatly appreciated.
Approach:
Starting from the root node of the in-memory graph, I retrieve all child nodes and maintain a queue
For each child node, I check to see if it exists in DB, else create new node, and update some properties
Vertex dbVertex = dbgraph.traversal().V()
.has(currentVertexInMem.label(), "Value",
(String) currentVertexInMem.value("Value"))
.tryNext()
.orElseGet(() -> createVertex(dbgraph, currentVertexInMem));
if (dbVertex != null) {
// Update Properties
updateVertexProperties(dbgraph, currentVertexInMem, dbVertex);
}
// Add edge if necessary
if (parentDBVertex != null) {
GraphTraversal<Vertex, Edge> edgeIt = graph.traversal().V(parentDBVertex).outE()
.has("EdgeProperty1", eProperty1) // eProperty1 is String input parameter
.has("EdgeProperty2", eProperty2); // eProperty2 is Long input parameter
Boolean doCreateEdge = true;
Edge e = null;
while (edgeIt.hasNext()) {
e = edgeIt.next();
if (e.inVertex().equals(dbVertex)) {
doCreateEdge = false;
break;
}
if (doCreateEdge) {
e = parentDBVertex.addEdge("EdgeLabel", dbVertex, "EdgeProperty1", eProperty1, "EdgeProperty2", eProperty2);
}
e = null;
it = null;
}
...
if ((processedVertexCount.get() % 500 == 0)
|| processedVertexCount.get() == verticesToProcess.get()) {
graph.tx().commit();
}
Create function:
public static Vertex createVertex(Graph graph, Vertex clientVertex) {
Vertex newVertex = null;
switch (clientVertex.label()) {
case "Label 1":
newVertex = graph.addVertex(T.label, clientVertex.label(), "Value",
clientVertex.value("Value"),
"Property1-1", clientVertex.value("Property1-1"),
"Property1-2", clientVertex.value("Property1-2"));
break;
case "Label 2":
newVertex = graph.addVertex(T.label, clientVertex.label(), "Value",
clientVertex.value("Value"), "Property2-1",
clientVertex.value("Property2-1"),
"Property2-2", clientVertex.value("Property2-2"));
break;
default:
newVertex = graph.addVertex(T.label, clientVertex.label(), "Value",
clientVertex.value("Value"));
break;
}
return newVertex;
}
Schema Def: (Showing some of the indexes)
Note:
"EdgeLabel" = Constants.EdgeLabels.Uses
"EdgeProperty1" = Constants.EdgePropertyKeys.EndpointId
"EdgeProperty2" = Constants.EdgePropertyKeys.Timestamp
public void createSchema() {
// Create Schema
TitanManagement mgmt = dbgraph.openManagement();
mgmt.set("cache.db-cache",true);
// Vertex Properties
PropertyKey value = mgmt.getPropertyKey(Constants.VertexPropertyKeys.Value);
if (value == null) {
value = mgmt.makePropertyKey(Constants.VertexPropertyKeys.Value).dataType(String.class).make();
mgmt.buildIndex(Constants.GraphIndexes.ByValue, Vertex.class).addKey(value).buildCompositeIndex(); // INDEX
}
PropertyKey shapeSet = mgmt.getPropertyKey(Constants.VertexPropertyKeys.ShapeSet);
if (shapeSet == null) {
shapeSet = mgmt.makePropertyKey(Constants.VertexPropertyKeys.ShapeSet).dataType(String.class).cardinality(Cardinality.SET).make();
mgmt.buildIndex(Constants.GraphIndexes.ByShape, Vertex.class).addKey(shapeSet).buildCompositeIndex();
}
...
// Edge Labels and Properties
EdgeLabel uses = mgmt.getEdgeLabel(Constants.EdgeLabels.Uses);
if (uses == null) {
uses = mgmt.makeEdgeLabel(Constants.EdgeLabels.Uses).multiplicity(Multiplicity.MULTI).make();
PropertyKey timestampE = mgmt.getPropertyKey(Constants.EdgePropertyKeys.Timestamp);
if (timestampE == null) {
timestampE = mgmt.makePropertyKey(Constants.EdgePropertyKeys.Timestamp).dataType(Long.class).make();
}
PropertyKey endpointIDE = mgmt.getPropertyKey(Constants.EdgePropertyKeys.EndpointId);
if (endpointIDE == null) {
endpointIDE = mgmt.makePropertyKey(Constants.EdgePropertyKeys.EndpointId).dataType(String.class).make();
}
// Indexes
mgmt.buildEdgeIndex(uses, Constants.EdgeIndexes.ByEndpointIDAndTimestamp, Direction.BOTH, endpointIDE,
timestampE);
}
mgmt.commit();
}
The behavior you experience is expected. Today, DynamoDB Local is a testing tool built on SQLite. If you need to support high TPS for large and periodic data loads, I recommend you use the DynamoDB service.

How to do time-series simple forecast?

I have a time-series uni-variate data. So just TimeStamp and Value. Now I want to extrapolate(forecast) this Value for next day/month/year. I know there are methods such as Box-jenkins (ARIMA) etc.
Spark has Linear Regression and I tried it, but I did not get satisfactory results. Did anybody tried time-series simple forecast in Spark. Can share their implementation approach?
PS: I check at User Mailing list for this issue, Almost all the questions regarding this issue are unanswered there.
Yes I have been already applied ARIMA in spark for uni variate time series.
public static void main(String args[])
{
System.setProperty("hadoop.home.dir", "C:/winutils");
SparkSession spark = SparkSession
.builder().master("local")
.appName("Spark-TS Example")
.config("spark.sql.warehouse.dir", "file:///C:/Users/abc/Downloads/Spark/sparkdemo/spark-warehouse/")
.getOrCreate();
Dataset<String> lines = spark.read().textFile("C:/Users/abc/Downloads/thunderbird/Time series/trainingvector_arima.csv");
Dataset<Double> doubleDataset = lines.map(line>Double.parseDouble(line.toString()),
Encoders.DOUBLE());
List<Double> doubleList = doubleDataset.collectAsList();
//scala.collection.immutable.List<Object> scalaList = new
Double[] doubleArray = new Double[doubleList.size()];
doubleArray = doubleList.toArray(doubleArray);
double[] values = new double[doubleArray.length];
for(int i = 0; i< doubleArray.length; i++)
{
values[i] = doubleArray[i];
}
Vector tsvector = Vectors.dense(values);
System.out.println("Ts vector:" + tsvector.toString());
//ARIMAModel arimamodel = ARIMA.fitModel(1, 0, 1, tsvector, true, "css-bobyqa", null);
ARIMAModel arimamodel = ARIMA.autoFit(tsvector, 1, 1, 1);
Vector forcst = arimamodel.forecast(tsvector, 10);
System.out.println("forecast of next 10 observations: " + forcst);
}
This code works for me. Here any values which you want to forecast pass as input data.