How to prevent titan generate duplicate records with same property? - pyspark

i have spark app which insert data into titan with goblin. but it insert duplicate vertexes with same name. the test condition 'if not result:' not match, and i am in the same session.
def savePartition(p):
print ('savePartition', p)
from goblin import element, properties
class Brand(element.Vertex):
name = properties.Property(properties.String)
import asyncio
loop = asyncio.get_event_loop()
from goblin.app import Goblin
app = loop.run_until_complete(Goblin.open(loop))
app.register(Brand)
async def go(app):
session = await app.session()
for i in p:
if i['brand']:
traversal = session.traversal(Brand)
result = await traversal.has(Brand.name, i['brand']).oneOrNone()
if not result: # TODO: Remove Duplicates
print(i)
brand = Brand()
brand.name = i['brand']
session.add(brand)
session.flush()
await app.close()
loop.run_until_complete(go(app))
rdd = rdd.foreachPartition(savePartition)
how to fix it? thanks a lot.

I am not sure how this would work with Goblin but if you want Titan to prevent duplicates based on a vertex property you can just use Titan composite indices and specify that they must be unique. For example, you could do the following:
mgmt = graph.openManagement()
name = mgmt.makePropertyKey('name').dataType(String.class).make()
mgmt.buildIndex('byNameUnique', Vertex.class).addKey(name).unique().buildCompositeIndex()
mgmt.commit()
The above will specify that the name property on vertices must be unique.

Related

Scala : How to use variable in for loop outside loop block

How can I create Dataframe with all my json files, when after reading each file I need to add fileName as field in dataframe? it seems Variable in for loop is not recognized outside loop. How to overcome this issue?
for (jsonfilenames <- fileArray) {
var df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
// trying to create temp table from dataframe created in loop
tblLanding.registerTempTable("LandingTable") // ERROR here, can't resolved tblLanding
Thank in advance
Hossain
I think you are new to programming itself.
Anyways here you go.
Basically you specify the type and initialise it before loop.
var df:DataFrame = null
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
df.registerTempTable("LandingTable") // Getting ERROR here
Update
Ok you are completely new to programming, even loops.
Suppose fileArray is having values as [1.json, 2.json, 3.json, 4.json]
So, the loop actually created 4 dataframe, by reading 4 json files.
Which one you want to register as temp table.
If all of them,
var df:DataFrame = null
var count = 0
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
df.registerTempTable(s"LandingTable_$count")
count++;
}
And reason for df being empty before this update is, your fileArray is empty or Spark failed to read that file. Print it and check.
To query any of those registered LandingTable
val df2 = hiveContext.sql("SELECT * FROM LandingTable_0")
Update
Question has changed to making a single dataFrame from all the json files.
var dataFrame:DataFrame = null
for (jsonfilename <- fileArray) {
val eachDataFrame = hivecontext.read.json(jsonfilename)
if(dataFrame == null)
dataFrame = eachDataFrame
else
dataFrame = eachDataFrame.unionAll(dataFrame)
}
dataFrame.registerTempTable("LandingTable")
Insure, that fileArray is not empty and all json files in fileArray are having same schema.
// Create list of dataframes with source-file-names
val dfList = fileArray.map{ filename =>
hivecontext.read.json(filename)
.withColumn("source_file_name", lit(filename))
}
// union the dataframes (assuming all are same schema)
val df = dfList.reduce(_ unionAll _) // or use union if spark 2.x
// register as table
df.registerTempTable("LandingTable")

Titan index issues with Cassandra storage backend

I am populating a Titan 1.0.0 single instance with a moderate graph in order to test its query performance. I am using Cassandra 2.0.17 as storage backend.
The thing is I am not able to create node indexes, and hence query results optimally. I have read the docs and I am trying to follow them carefully without much success. I am using the following groovy script for the schema definition, data population and index creation:
import com.thinkaurelius.titan.core.*;
import com.thinkaurelius.titan.core.schema.*;
import com.thinkaurelius.titan.graphdb.database.management.ManagementSystem;
import java.time.temporal.ChronoUnit;
graph = TitanFactory.open('conf/my-titan.properties');
mgmt = graph.openManagement();
// Build graph schema
// Node properties
idProp = mgmt.containsPropertyKey('userId') ?
mgmt.getPropertyKey('userId') : mgmt.makePropertyKey('id').dataType(String.class).cardinality(Cardinality.SINGLE);
isPublicProp = mgmt.containsPropertyKey('isPublic') ?
mgmt.getPropertyKey('isPublic') : mgmt.makePropertyKey('isPublic').dataType(Boolean.class).cardinality(Cardinality.SINGLE);
completionPercentageProp = mgmt.containsPropertyKey('completionPercentage') ?
mgmt.getPropertyKey('completionPercentage') : mgmt.makePropertyKey('completionPercentage').dataType(Integer.class).cardinality(Cardinality.SINGLE);
genderProp = mgmt.containsPropertyKey('gender') ?
mgmt.getPropertyKey('gender') : mgmt.makePropertyKey('gender').dataType(String.class).cardinality(Cardinality.SINGLE);
regionProp = mgmt.containsPropertyKey('region') ?
mgmt.getPropertyKey('region') : mgmt.makePropertyKey('region').dataType(String.class).cardinality(Cardinality.SINGLE);
lastLoginProp = mgmt.containsPropertyKey('lastLogin') ?
mgmt.getPropertyKey('lastLogin') : mgmt.makePropertyKey('lastLogin').dataType(String.class).cardinality(Cardinality.SINGLE);
registrationProp = mgmt.containsPropertyKey('registration') ?
mgmt.getPropertyKey('registration') : mgmt.makePropertyKey('registration').dataType(String.class).cardinality(Cardinality.SINGLE);
ageProp = mgmt.containsPropertyKey('age') ? mgmt.getPropertyKey('age') : mgmt.makePropertyKey('age').dataType(Integer.class).cardinality(Cardinality.SINGLE);
mgmt.commit();
nUsers = 0
println 'Starting nodes population...';
// Load users
new File('/home/jarandaf/soc-pokec-profiles.txt').eachLine {
try {
fields = it.split('\t').take(8);
userId = fields[0];
isPublic = fields[1] == '1' ? true : false;
completionPercentage = fields[2]
gender = fields[3] == '1' ? 'male' : 'female';
region = fields[4];
lastLogin = fields[5];
registration = fields[6];
age = fields[7] as int;
graph.addVertex('userId', userId, 'isPublic', isPublic, 'completionPercentage', completionPercentage, 'gender', gender, 'region', region, 'lastLogin', lastLogin, 'registration', registration, 'age', age);
} catch (Exception e) {
// Silently skip...
}
nUsers += 1
if (nUsers % 100000 == 0) println String.valueOf(nUsers) + ' loaded...';
};
graph.tx().commit();
println 'Nodes population finished';
// Index users by userId, gender and age
println 'Getting node properties...';
mgmt = graph.openManagement();
userId = mgmt.getPropertyKey('userId');
gender = mgmt.getPropertyKey('gender');
age = mgmt.getPropertyKey('age');
println 'Building byUserId index...';
if (mgmt.getGraphIndex('byUserId') == null) mgmt.buildIndex('byUserId', Vertex.class).addKey(userId).buildCompositeIndex();
println 'Building byGender index...';
if (mgmt.getGraphIndex('byGender') == null) mgmt.buildIndex('byGender', Vertex.class).addKey(gender).buildCompositeIndex();
println 'Building byAge index...';
if (mgmt.getGraphIndex('byAge') == null) mgmt.buildIndex('byAge', Vertex.class).addKey(age).buildCompositeIndex();
mgmt.commit();
// Wait for the indexes to become available
println 'Awaiting byUserId graph index status...';
ManagementSystem.awaitGraphIndexStatus(graph, 'byUserId')
.status(SchemaStatus.REGISTERED)
.timeout(10, ChronoUnit.MINUTES)
.call();
println 'Awaiting byGender graph index status...';
ManagementSystem.awaitGraphIndexStatus(graph, 'byGender')
.status(SchemaStatus.REGISTERED)
.timeout(10, ChronoUnit.MINUTES)
.call();
println 'Awaiting byAge graph index status...';
ManagementSystem.awaitGraphIndexStatus(graph, 'byAge')
.status(SchemaStatus.REGISTERED)
.timeout(10, ChronoUnit.MINUTES)
.call();
// Reindex the existing data
mgmt = graph.openManagement();
println 'Reindexing data by byUserId index...';
mgmt.updateIndex(mgmt.getGraphIndex('byUserId'), SchemaAction.REINDEX).get();
println 'Reindexing data by byGender index...';
mgmt.updateIndex(mgmt.getGraphIndex('byGender'), SchemaAction.REINDEX).get();
println 'Reindexing data by byAge index...';
mgmt.updateIndex(mgmt.getGraphIndex('byAge'), SchemaAction.REINDEX).get();
mgmt.commit();
// Enable indexes
println 'Enabling byUserId index...'
mgmt.awaitGraphIndexStatus(graph, 'byUserId').status(SchemaStatus.ENABLED).call();
println 'Enabling byGender index...'
mgmt.awaitGraphIndexStatus(graph, 'byGender').status(SchemaStatus.ENABLED).call();
println 'Enabling byAge index...'
mgmt.awaitGraphIndexStatus(graph, 'byAge').status(SchemaStatus.ENABLED).call();
graph.close();
The error I am getting is the following and is related with the reindex phase:
08:24:26 ERROR com.thinkaurelius.titan.graphdb.database.management.ManagementLogger - Evicted [2#0ac717511509-mybox] from cache but waiting too long for transactions to close. Stale transaction alert on: [standardtitantx[0x4b8696a4], standardtitantx[0x2d39f30a], standardtitantx[0x0da9172d], standardtitantx[0x7c6c7909], standardtitantx[0x79dd0a38], standardtitantx[0x5999c49e], standardtitantx[0x5aaba4a7]]
08:24:26 ERROR com.thinkaurelius.titan.graphdb.database.management.ManagementLogger - Evicted [3#0ac717511509-mybox] from cache but waiting too long for transactions to close. Stale transaction alert on: [standardtitantx[0x4b8696a4], standardtitantx[0x2d39f30a], standardtitantx[0x0da9172d], standardtitantx[0x7c6c7909], standardtitantx[0x79dd0a38], standardtitantx[0x5999c49e], standardtitantx[0x5aaba4a7]]
08:24:26 ERROR com.thinkaurelius.titan.graphdb.database.management.ManagementLogger - Evicted [4#0ac717511509-mybox] from cache but waiting too long for transactions to close. Stale transaction alert on: [standardtitantx[0x4b8696a4], standardtitantx[0x2d39f30a], standardtitantx[0x0da9172d], standardtitantx[0x7c6c7909], standardtitantx[0x79dd0a38], standardtitantx[0x5999c49e], standardtitantx[0x5aaba4a7]]
Any hints on this would be much appreciated.
The errors you get indicate that you have open transactions when you try to modify the schema. Titan needs to wait for all transactions to complete before it can modify the schema. See the answer from Matthias Broecheler on the mailing list for more information.
In general, you should avoid reindexing if possible as it requires Titan to walk over all vertices to see whether they need to be added to the index that should be updated. The documentation contains more information about this process.
For your use case, you can simply create all indexes before you load any data. When you then add the data after all indexes are ready, they will be simply added to the indexes. That way, you should be able to use the indexes immediately.
A minimal example for the schema creation in Groovy (but it should be basically the same in Java):
import com.thinkaurelius.titan.core.TitanFactory;
import com.thinkaurelius.titan.core.Multiplicity;
import com.thinkaurelius.titan.core.Cardinality;
graph = TitanFactory.open('conf/my-titan.properties')
mgmt = graph.openManagement()
id = mgmt.makePropertyKey('id').dataType(String.class).cardinality(Cardinality.SINGLE)
// some other properties that will not be indexed
mgmt.makePropertyKey('isPublic').dataType(Boolean.class).cardinality(Cardinality.SINGLE)
mgmt.makePropertyKey('completionPercentage').dataType(Integer.class).cardinality(Cardinality.SINGLE)
// I prefer to use vertex labels to differentiate between different 'types' of vertices but this isn't necessary
User = mgmt.makeVertexLabel('User').make()
mgmt.buildIndex('UserById',Vertex.class).addKey(id).indexOnly(user).buildCompositeIndex()
mgmt.commit()
I removed all the checks for already existing schema elements for simplicity, but you can of course add them again.
After the schema creation, you can add your data just like before.
A final node about index management: Try to always define the property keys that you want to index in the same transaction in which you create the index. Otherwise, Titan cannot know whether there is already data that needs to be added to the new index which requires again a complete scan of all data. This might require to choose a different name for a property. When you add for example a new vertex label post, then you might want to use a new name like postId instead of using the property id again to avoid the scan of all existing data.

How to count new element from stream by using spark-streaming

I have done implementation of daily compute. Here is some pseudo-code.
"newUser" may called first activated user.
// Get today log from hbase or somewhere else
val log = getRddFromHbase(todayDate)
// Compute active user
val activeUser = log.map(line => ((line.uid, line.appId), line).reduceByKey(distinctStrategyMethod)
// Get history user from hdfs
val historyUser = loadFromHdfs(path + yesterdayDate)
// Compute new user from active user and historyUser
val newUser = activeUser.subtractByKey(historyUser)
// Get new history user
val newHistoryUser = historyUser.union(newUser)
// Save today history user
saveToHdfs(path + todayDate)
Computation of "activeUser" can be converted to spark-streaming easily. Here is some code:
val transformedLog = sdkLogDs.map(sdkLog => {
val time = System.currentTimeMillis()
val timeToday = ((time - (time + 3600000 * 8) % 86400000) / 1000).toInt
((sdkLog.appid, sdkLog.bcode, sdkLog.uid), (sdkLog.channel_no, sdkLog.ctime.toInt, timeToday))
})
val activeUser = transformedLog.groupByKeyAndWindow(Seconds(86400), Seconds(60)).mapValues(x => {
var firstLine = x.head
x.foreach(line => {
if (line._2 < firstLine._2) firstLine = line
})
firstLine
})
But the approach of "newUser" and "historyUser" is confusing me.
I think my question can be summarized as "how to count new element from stream". As my pseudo-code above, "newUser" is part of "activeUser". And I must maintain a set of "historyUser" to know which part is "newUser".
I consider an approach, but I think it may not work right way:
Load the history user as a RDD. Foreach DStream of "activeUser" and find the elements doesn't exist in the "historyUser". A problem here is when should I update this RDD of "historyUser" to make sure I can get the right "newUser" of a window.
Update the "historyUser" RDD means add "newUser" to it. Just like what I did in the pseudo-code above. The "historyUser" is updated once a day in that code. Another problem is how to do this update RDD operation from a DStream. I think update "historyUser" when window slides is proper. But I haven't find a proper API to do this.
So which is the best practice to solve this problem.
updateStateByKey would help here as it allows you to set initial state (your historical users) and then update it on each interval of your main stream. I put some code together to explain the concept
val historyUsers = loadFromHdfs(path + yesterdayDate).map(UserData(...))
case class UserStatusState(isNew: Boolean, values: UserData)
// this will prepare the RDD of already known historical users
// to pass into updateStateByKey as initial state
val initialStateRDD = historyUsers.map(user => UserStatusState(false, user))
// stateful stream
val trackUsers = sdkLogDs.updateStateByKey(updateState, new HashPartitioner(sdkLogDs.ssc.sparkContext.defaultParallelism), true, initialStateRDD)
// only new users
val newUsersStream = trackUsers.filter(_._2.isNew)
def updateState(newValues: Seq[UserData], prevState: Option[UserStatusState]): Option[UserStatusState] = {
// Group all values for specific user as needed
val groupedUserData: UserData = newValues.reduce(...)
// prevState is defined only for users previously seen in the stream
// or loaded as initial state from historyUsers RDD
// For new users it is None
val isNewUser = !prevState.isDefined
// as you return state here for the user - prevState won't be None on next iterations
Some(UserStatusState(isNewUser, groupedUserData))
}

How to obtain a subset of records within a context using EntityFramework?

A newbie question. I am using EntityFramework 4.0. The backend database has a function that will return a subset of records based on time.
Example of working code is:
var query = from rx in context.GetRxByDate(tencounter,groupid)
select rx;
var result = context.CreateDetachedCopy(query.ToList());
return result;
I need to verify that a record does not exist in the database before inserting a new record. Before performing the "Any" filter, I would like to populate the context.Rxes with a subset of the larger backend database using the above "GetRxByDate()" function.
I do not know how to populate "Rxes" before performing any further filtering since Rxes is defined as
IQueryable<Rx> Rxes
and does not allow "Rxes =.. ". Here is what I have so far:
using (var context = new EnityFramework())
{
if (!context.Rxes.Any(c => c.Cform == rx.Cform ))
{
// Insert new record
Rx r = new Rx();
r.Trx = realtime;
context.Add(r);
context.SaveChanges();
}
}
I am fully prepared to kick myself since I am sure the answer is simple.
All help is appreciated. Thanks.
Edit:
If I do it this way, "Any" seems to return the opposite results of what is expected:
var g = context.GetRxByDate(tencounter, groupid).ToList();
if( g.Any(c => c.Cform == rx.Cform ) {....}

Entity Framework check against a local list

I have a local list of values that I need to have entity framework check against the database and return them.
If the list was already in the database, the following would work:
var list = /* some ef query */;
var myList = context.Logs.Where(l => list.Any(li => l.LogNumber == li.LogNumber));
But if the list is local, it would throw an error:
var list = new List<Log>();
var myList = context.Logs.Where(l => list.Any(li => l.LogNumber == li.LogNumber));
Exception: Unable to process the type 'Data.Log[]', because it has no known mapping to the value layer.
So how can I match a local list against a database list using EF?
I got a different error than you with the code sample, but I believe it's the same idea. EF doesn't know how to translate List<Log> into a SQL store expression. It works when you're still in a query because it hasn't been serialized yet.
I realize this is less than ideal, but I was able to make this query work by extracting the scalar values of LogNumber and then using that in the query.
var list = new List<Log>();
list.Add(new Log()
{
LogNumber = 1
});
var numbers = list.Select(l => l.LogNumber);
var myList = m.Logs.Where(l => numbers.Contains(l.LogNumber));