Orientdb performance issue on multi-threaded system - orientdb

When you go through OrientDb website, they give some fancy statistics about
number of document than can be created per second.
I am not in need of any these fancy speed, a moderate will work for my use case.
My use case :
My system is multi-threaded
On per request I am receiving
Db-Name
Current_Vertex_Name
Previous_Vertex_Name
Then execute the below pseudo code :-
I did tried my use-case with the below pseudo code but I found speed very slow.
Pseudo code of my use case is following below:-
DB_Name = getFromSource()
createGraphDb(DB_Name ) using OServerAdmin : if db do not exist
gFactory = OrientGraphFactory(DB_Name ) : if db exist
graph = OrientGraphFactory.getTx()
currentVertexName = getFromSource()
previousVertexName = getFromSource()
if(previousVertexName and currentVertexName != null)
{
- if not exist
- create Vertex named 'currentVertexName' , 'previousVertexName'
- else
- update the existing vertexes e.g "update counter"
- create edges between them(from previous to current event)
}
graph.shutdown()
Can any one, please share Java code creating at-least 1k Vertex + edges per second..!
Thanks..!

Related

Get all outging connectors from a shape programmatically

I want to rename a connector after a shape has been dropped.
Lets say I have a shape1 and I dropped a shape2 connected with shape1.
I want the connector shape between shape1 and shape2 so that I can rename it.
I guess it depends on what stage you intercept the drop. If it's immediately, you might make some assumptions about how many connectors might be involved, but if if some time after the drop then you might want to determine how many connections are involved.
As an example, with the following shapes:
...you could approach this in a number of ways:
Use the GluedShapes method working back from ShapeTwo
Use the GluedShapes method including the 'from' shape
Iterate through the Connects collection of the Page
Iterate over the Connect objects in on your target shape (ShapeOne)
I would definitely try and use the GluedShapes method (which came into Visio in 2010) over the Connect objects, but I'm adding them here as they can be useful depending on what you're trying to achieve.
Here's an example using LINQPad:
void Main()
{
var vApp = MyExtensions.GetRunningVisio();
var vPag = vApp.ActivePage;
//For demo purposes I'm assuming the following shape IDs
//but in reality you'd get a reference by other methods
//such as Window.Selection, Page index or ID
var shpOne = vPag.Shapes.ItemFromID[1];
var shpTwo = vPag.Shapes.ItemFromID[2];
Array gluedIds;
Console.WriteLine("1) using GluedShapes with the 'to' shape only");
gluedIds = shpTwo.GluedShapes(Visio.VisGluedShapesFlags.visGluedShapesIncoming1D,"");
IterateByIds(vPag, gluedIds);
Console.WriteLine("\n2) using GluedShapes with the 'to' and 'from' shapes");
gluedIds = shpTwo.GluedShapes(Visio.VisGluedShapesFlags.visGluedShapesIncoming1D, "", shpOne);
IterateByIds(vPag, gluedIds);
Console.WriteLine("\n3) using the Connects collection on Page");
var pageConns = from c in vPag.Connects.Cast<Visio.Connect>()
where c.FromSheet.OneD != 0
group c by c.FromSheet into connectPair
where connectPair.Any(p => p.ToSheet.ID == shpOne.ID) && connectPair.Any(p => p.ToSheet.ID == shpTwo.ID)
select connectPair.Key.Text;
pageConns.Dump();
Console.WriteLine("\n4) using FromConnects and Linq to navigate from shpOne to shpTwo finding the connector in the middle");
var shpConns = from c in shpOne.FromConnects.Cast<Visio.Connect>()
where c.FromSheet.OneD != 0
let targetConnector = c.FromSheet
from c2 in targetConnector.Connects.Cast<Visio.Connect>()
where c2.ToSheet.Equals(shpTwo)
select targetConnector.Text;
shpConns.Dump();
}
private void IterateByIds(Visio.Page hostPage, Array shpIds)
{
if (shpIds.Length > 0)
{
for (int i = 0; i < shpIds.Length; i++)
{
//Report on the shape text (or change it as required)
Console.WriteLine(hostPage.Shapes.ItemFromID[(int)shpIds.GetValue(i)].Text);
}
}
}
Running the above will result in this output:
It's worth bearing in mind that the Connects code (3 and 4) makes the assumption that connector shape (1D) are being connected to the flowchart shapes (2D) and not the other way round (which is possible).
You can think of the connect objects as being analgous to connection points, so in the diagram, the three connector shapes generate six connect objects:
Anyway, hope that gets you unstuck.
UPDATE - Just to be clear (and to answer the original question properly), the code to get all outgoing connectors from ShapeOne would be:
Console.WriteLine("using GluedShapes to report outgoing connectors");
gluedIds = shpOne.GluedShapes(Visio.VisGluedShapesFlags.visGluedShapesOutgoing1D, "");
IterateByIds(vPag, gluedIds);

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

How to make Appstats show both small and read operations?

I'm profiling my application locally (using the Dev server) to get more information about how GAE works. My tests are comparing the common full Entity query and the Projection Query. In my tests both queries do the same query, but the Projection is specified with 2 properties. The test kind has 100 properties, all with the same value for each Entity, with a total of 10 Entities. An image with the Datastore viewer and the Appstats generated data is shown bellow. In the Appstats image, Request 4 is a memcache flush, Request 3 is the test database creation (it was already created, so no costs here), Request 2 is the full Entity query and Request 1 is the projection query.
I'm surprised that both queries resulted in the same amount of reads. My guess is that small and read operations and being reported the same by Appstats. If this is the case, I want to separate them in the reports. That's the queries related functions:
// Full Entity Query
public ReturnCodes doQuery() {
DatastoreService dataStore = DatastoreServiceFactory.getDatastoreService();
for(int i = 0; i < numIters; ++i) {
Filter filter = new FilterPredicate(DBCreation.PROPERTY_NAME_PREFIX + i,
FilterOperator.NOT_EQUAL, i);
Query query = new Query(DBCreation.ENTITY_NAME).setFilter(filter);
PreparedQuery prepQuery = dataStore.prepare(query);
Iterable<Entity> results = prepQuery.asIterable();
for(Entity result : results) {
log.info(result.toString());
}
}
return ReturnCodes.SUCCESS;
}
// Projection Query
public ReturnCodes doQuery() {
DatastoreService dataStore = DatastoreServiceFactory.getDatastoreService();
for(int i = 0; i < numIters; ++i) {
String projectionPropName = DBCreation.PROPERTY_NAME_PREFIX + i;
Filter filter = new FilterPredicate(DBCreation.PROPERTY_NAME_PREFIX + i,
FilterOperator.NOT_EQUAL, i);
Query query = new Query(DBCreation.ENTITY_NAME).setFilter(filter);
query.addProjection(new PropertyProjection(DBCreation.PROPERTY_NAME_PREFIX + 0, Integer.class));
query.addProjection(new PropertyProjection(DBCreation.PROPERTY_NAME_PREFIX + 1, Integer.class));
PreparedQuery prepQuery = dataStore.prepare(query);
Iterable<Entity> results = prepQuery.asIterable();
for(Entity result : results) {
log.info(result.toString());
}
}
return ReturnCodes.SUCCESS;
}
Any ideas?
EDIT: To get a better overview of the problem I have created another test, which do the same query but uses the keys only query instead. For this case, Appstats is correctly showing DATASTORE_SMALL operations in the report. I'm still pretty confused about the behavior of the projection query which should also be reporting DATASTORE_SMALL operations. Please help!
[I wrote the go port of appstats, so this is based on my experience and recollection.]
My guess is this is a bug in appstats, which is a relatively unmaintained program. Projection queries are new, so appstats may not be aware of them, and treats them as normal read queries.
For some background, calculating costs is difficult. For write ops, the cost are returned with the results, as they must be, since the app has no way of knowing what changed (which is where the write costs happen). For reads and small ops, however, there is a formula to calculate the cost. Each appstats implementation (python, java, go) must implement this calculation, including reflection or whatever is needed over the request object to determine what's going on. The APIs for doing this are not entirely obvious, and there's lots of little things, so it's easy to get it wrong, and annoying to get it right.

how to implementing this simple logic in spring batch?

i tried to make this as simple as possible. i`m new to spring batch, i have a small isuue with understanding how to relate spring items together especially when it comes to multi-steps jobs however this is my logic not code(simplified) and i dont know to impliment it in spring batch so i thought this might be the right structure
reader_money
reader_details
tasklet
reader_profit
tasklet_calculation
writer
however please correct me if i`m wrong and provide some code if possible.
thank you very much
LOGIC:
sql = "select * from MONEY where id= user input"; //the user will input the condition
while (records are available) {
int currency= resultset(currency column);
sql= "select * from DETAILS where D_currency = currency";
while (records are available) {
int amount= resultset(amount column);
string money_flag= resultset(money_type column);
sql= "select * from PROFIT where Mtypes = money_type";
while (records are available) {
int revenue= resultset(revenue);
if (money_type== 1) {
int net_profit= revenue * 3.75;
sql = "update PROFIT set Nprofit = net_profit";
}
else (money_type== 2) {
int net_profit = (revenue - 5 ) * 3.7 ;
sql = "update PROFIT set Nprofit = net_profit";
}
}
sql="update DETAILS set detail_falg = 001 ";
}
sql = "update MONEY set currency_flag = 009";
}
to fit this into a 'conventional' spring batch configuration, you would need to flatten the three loops into one if possible.
perhaps a sql statement that would return it in one loop similiar to;
select p.revenue, d.amount from PROFIT p, DETAILS d, MONEY m where p.MTypes = d.money_type and d.D_currency = m.currency and m.id = :?
once you've "flattened" it, you then fall into the more 'conventional' read/process/write of a chunk pattern where the reader retrieves a record from the resultset, the processor performs the money_type logic, and the writer then executes the 'update' statement.
Check for the use of ItemReaderAdapter where you could place all your SQL in some kind of DAO that could return a list of aggregated object containing all the info you need for your calculation.
Or
You could use the CompositeItemReader pattern. You basicaly define multiple ItemReader into one master ItemReader. The read() method will invoke all the inner ItemReader before going to the Processor /writer phase.
I could post you some example.. but i have to leave :-(..
Leave a comment if you need some example

Moodle Database API error : Get quiz marks for all sections of one course for one user

I am trying to get total marks obtained by a particular user, for a particular course for all the sections of that course.
The following query works and gives correct results with mysql, but not with Databse API calls
$sql = "SELECT d.section as section_id,d.name as section_name, sum(a.sumgrades) AS marks FROM mdl_quiz_attempts a, mdl_quiz b, mdl_course_modules c, mdl_course_sections d WHERE a.userid=6 AND b.course=4 AND a.quiz=b.id AND c.instance=a.quiz AND c.module=14 AND a.sumgrades>0 AND d.id=c.section GROUP BY d.section"
I tried different API calls, mainly I would want
$DB->get_records_sql($sql);
The results from API calls are meaningless. Any suggestion?
PS : This is moodle 2.2.
I just tried to do something similar, only without getting the sections. You only need the course and user id. I hope this helps you.
global $DB;
// get all attempts & grades from a user from every quiz of one course
$sql = "SELECT qa.id, qa.attempt, qa.quiz, qa.sumgrades AS grade, qa.timefinish, qa.timemodified, q.sumgrades, q.grade AS maxgrade
FROM {quiz} q, {quiz_attempts} qa
WHERE q.course=".$courseid."
AND qa.quiz = q.id
AND qa.userid = ".$userid."
AND state = 'finished'
ORDER BY qa.timefinish ASC";
$exams = $DB->get_records_sql($sql);
// calculate final grades from sum grades
$grades = array();
foreach($exams as $exam) {
$grade = new stdClass;
$grade->quiz = $exam->quiz;
$grade->attempt = $exam->attempt;
// sum to final
$grade->finalgrade = $exam->grade * ($exam->maxgrade / $exam->sumgrades);
$grade->grademax = $exam->maxgrade;
$grade->timemodified = $exam->timemodified;
array_push($grades, $grade);
}
This works in latest moodle version. Moodle 2.9. Although I am still open for better solution as this is really hacky way of getting deeper analytics about user's performance.