neo4j embedded scala example inserting nodes - scala

I've been able to run the neo4j scala example using the batch insert with no problems. However, when I try to create Nodes without the unsafe batch inserter, I get no errors but no inserts either.
Here's the sample code
private def insertNodes(label:String, data: Iterator[Map[String, String]]) = {
val dynLabel: Label = DynamicLabel.label(label)
val graphDb = new GraphDatabaseFactory().newEmbeddedDatabase(DB_PATH)
registerShutdownHook( graphDb )
val tx = graphDb.beginTx()
for (item <- data) {
val node: Node = graphDb.createNode(dynLabel)
node.setProperty("item_id", data("item_id"))
node.setProperty("title", data("title"))
}
tx.success
graphDb.shutdown()
}

You have to commit the transaction. I'm not sure the proper scala syntax, but you want something like:
val tx = graphDb.beginTx()
try {
for (item <- data) {
val node: Node = graphDb.createNode(dynLabel)
node.setProperty("item_id", data("item_id"))
node.setProperty("title", data("title"))
}
tx.success
} catch {
case e: Exception => tx.failure
} finally {
tx.close
}

Related

Parsing stops with Akka Streams mapAsync

I am parsing 50000 records which contain their titles and URLs on the web page. While parsing, I am writing them to the database, which is PostgreSQL. I deployed my application using docker-compose. However, it keeps stopping on some page without any reason. I tried to write some logs to figure out what's happening, but there is no connection error or anything like that.
Here is my code for parsing and writing to the database:
object App {
val db = Database.forURL("jdbc:postgresql://db:5432/toloka?user=user&password=password")
val browser = JsoupBrowser()
val catRepo = new CategoryRepo(db)
val torrentRepo = new TorrentRepo(db)
val torrentForParseRepo = new TorrentForParseRepo(db)
val parallelismFactor = 10
val groupFactor = 10
implicit val system = ActorSystem("TolokaParser")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
def parseAndWriteTorrentsForParseToDb(doc: App.browser.DocumentType) = {
Source(getRecordsLists(doc))
.grouped(groupFactor)
.mapAsync(parallelismFactor) { torrentForParse: Seq[TorrentForParse] =>
torrentForParseRepo.createInBatch(torrentForParse)
}
.runWith(Sink.ignore)
}
def getRecordsLists(doc: App.browser.DocumentType) = {
val pages = generatePagesFromHomePage(doc)
println("torrent links generated")
println(pages.size)
val result = for {
page <- pages
} yield {
println(s"Parsing torrent list...$page")
val tmp = getTitlesAndLinksTuple(getTitlesList(browser.get(page)), getLinksList(browser.get(page)))
println(tmp.size)
tmp
}
println("torrent links and names tupled")
result flatten
}
}
What may be the cause of such problems?
Put a supervision strategy to avoid stream finalization in case of error. Such as:
val decider: Supervision.Decider = {
case _ => Supervision.Resume
}
def parseAndWriteTorrentsForParseToDb = {
Source.fromIterator(() => List(1,2,3).toIterator)
.grouped(1)
.mapAsync(1) { torrentForParse: Seq[Int] =>
Future { 0 }
}
.withAttributes(ActorAttributes.supervisionStrategy(decider))
.runWith(Sink.ignore)
}
The stream should not stop with this async stage config

Insert into postgres using slick in a non blocking way

class Employee(tag: Tag) extends Table[table_types.user](tag, "EMPLOYEE") {
def employeeID = column[Int]("EMPLOYEE_ID")
def empName = column[String]("NAME")
def startDate = column[String]("START_DATE")
def * = (employeeID, empName, startDate)
}
object employeeHandle {
def insert(emp:Employee):Future[Any] = {
val dao = new SlickPostgresDAO
val db = dao.db
val insertdb = DBIO.seq(employee += (emp))
db.run(insertdb)
}
}
Insert into database a million employee records
object Hello extends App {
val employees = List[*1 million employee list*]
for(employee<-employees) {
employeeHandle.insert(employee)
*Code to connect to rest api to confirm entry*
}
}
However when I run the above code I soon run out of connections to Postgres. How can I do it in parallel (in a non blocking way) but at the same time ensure I don't run out of connections to postgres.
I think you don't need to do it in parallel; I don't see how it can solve it. Instead you could simply create connection before you start that loop and pass it to employeeHandle.insert(db, employee).
Something like (I don't know scala):
object Hello extends App {
val dao = new SlickPostgresDAO
val db = dao.db
val employees = List[*1 million employee list*]
for(employee<-employees) {
employeeHandle.insert(db, employee)
*Code to connect to rest api to confirm entry*
}
}
Almost all examples of slick insert I have come across uses blocking to fullfil the results. It would be nice to have one that doesn't.
My take on it:
object Hello extends App {
val employees = List[*1 million employee list*]
val groupedList = employees.grouped(10).toList
insertTests()
def insertTests(l: List[List[Employee]] = groupedList): Unit = {
val ins = l.head
val futures = ins.map { no => employeeHandle.insert(employee)}
val seq = Future.sequence(futures)
Await.result(seq, Duration.Inf)
if(l.nonEmpty) insertTests(l.tail)
}
}
Also the connection parameter in insert handle should be outside
object employeeHandle {
val dao = new SlickPostgresDAO
val db = dao.db
def insert(emp:Employee):Future[Any] = {
val insertdb = DBIO.seq(employee += (emp))
db.run(insertdb)
}
}

Scala mapPartition collect on partition do nothing

I'm trying to move data from rdd to postgres table, using:
def copyIn(reader: java.io.Reader, columnStmt: String = "") = {
//connect to postgres database on the localhost
val driver = "org.postgresql.Driver"
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection()
try {
connection.unwrap(classOf[PGConnection]).getCopyAPI.copyIn(s"COPY my_table ($columnStmt) FROM STDIN WITH CSV", reader)
} catch {
case se: SQLException => println(se.getMessage)
case t: Throwable => println(t.getMessage)
} finally {
connection.close()
}
}
myRdd.mapPartitions(iter => {
val sb = new StringBuilder()
var n_iter = iter.map(row => {
val mapRequest = Utils.getMyRowMap(myMap, row)
sb.append(mapRequest.values.mkString(", ")).append("\n")
})
copyIn(new StringReader(sb.toString), geoSelectMap.keySet.mkString(", "))
sb.clear
n_iter
}).collect
The script keep getting in to the CopyIn function with no data to insert in. I think it maybe because iter.map just map the partition and do not perform collect? I try to collect te whole myRdd object and still didnt get data in copyIn function.
How can I iterate over an rdd and get the StringBuilder appended and why the snippet above doesn't work?
Anybody have a clue?
iter is an Iterator. So iter.map creates a new Iterator, but you don't actually iterate it and it does nothing. You probably want foreach instead. Except then iter will be empty by the time you return it and the result of collect will be an empty RDD.
The actual method you want is foreachPartition:
myRdd.foreachPartition(iter => {
val sb = new StringBuilder()
iter.foreach(row => {
val mapRequest = Utils.getMyRowMap(myMap, row)
sb.append(mapRequest.values.mkString(", ")).append("\n")
})
copyIn(new StringReader(sb.toString), geoSelectMap.keySet.mkString(", "))
sb.clear
})
and then myRdd.collect if you want to collect it as well. (Persist myRdd if you want to use it twice without recalculating it.)

Database Exception in Slick 3.0 while batch insert

While inserting thousands of records per five seconds through batch insert in slick 3 I am getting
org.postgresql.util.PSQLException: FATAL: sorry, too many clients already
My data access layer looks like :
val db: CustomPostgresDriver.backend.DatabaseDef = Database.forURL(url, user=user, password=password, driver= jdbcDriver)
override def insertBatch(rowList: List[T#TableElementType]): Future[Long] = {
val res = db.run(insertBatchQuery(rowList)).map(_.head.toLong).recover{ case ex:Throwable=> RelationalRepositoryUtility.handleBatchOperationErrors(ex)}
//db.close()
res
}
override def insertBatchQuery(rowList: List[T#TableElementType]): FixedSqlAction[Option[Int], NoStream, Write] = {
query ++= (rowList)
}
closing the connection in insert batch has no effect...it still gives the same error.
I am calling insert batch from my code like this :
val temp1 = list1.flatMap { li =>
Future.sequence(li.map { trip =>
val data = for {
tripData <- TripDataRepository.insertQuery( trip.tripData)
subTripData <- SubTripDataRepository.insertBatchQuery(getUpdatedSubTripDataList(trip.subTripData, tripData.id))
} yield ((tripData, subTripData))
val res=db.run(data.transactionally)
res
//db.close()
})
}
if i close the connection after my work here as you can see in commented code i get error :
java.util.concurrent.RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2#6c3ae2b6 rejected from java.util.concurrent.ThreadPoolExecutor#79d2d4eb[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
After calling the method without Future.sequence like this :
val temp1 =list.map { trip =>
val data = for {
tripData <- TripDataRepository.insertQuery( trip.tripData)
subTripData <- SubTripDataRepository.insertBatchQuery(getUpdatedSubTripDataList(trip.subTripData, tripData.id))
} yield ((tripData, subTripData))
val res=db.run(data.transactionally)
res
}
I still got too many clients error...
The root of this problem is that you are spinning up an unbounded list of Future simultaneously, each connecting to the database - one per entry in list.
This can be solved by running your inserts in serial, forcing each insert batch to depend on the previous:
// Empty Future for the results. Replace Unit with the correct type - whatever
// "res" is below.
val emptyFuture = Future.successful(Seq.empty[Unit])
// This will only insert one at a time. You could use list.sliding to batch the
// inserts if that was important.
val temp1 = list.foldLeft(emptyFuture) { (previousFuture, trip) =>
previousFuture flatMap { previous =>
// Inner code copied from your example.
val data = for {
tripData <- TripDataRepository.insertQuery(trip.tripData)
subTripData <- SubTripDataRepository.insertBatchQuery(getUpdatedSubTripDataList(trip.subTripData, tripData.id))
} yield ((tripData, subTripData))
val res = db.run(data.transactionally)
previous :+ res
}
}

Spark job not parallelising locally (using Parquet + Avro from local filesystem)

edit 2
Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.
edit 1: Removed local variable reference in map function
I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?
I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.
The function to ETL my input for processing looks as follows :
def genForum {
class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
override def write(t: Topic) {
synchronized {
super.write(t)
}
}
}
def makeTopic(x: ForumTopic): Topic = {
// Ommited to save space
}
val writer = new MyWriter
val q =
DBCrawler.db.withSession {
Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
}
val sz = q.size
val c = new AtomicInteger(0)
q.par.foreach {
x =>
writer.write(makeTopic(x))
val count = c.incrementAndGet()
print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
}
writer.close()
}
And my transformation looks as follows :
def sparkNLPTransformation() {
val sc = new SparkContext("local[8]", "forumAddNlp")
// io configuration
val job = new Job()
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)
// configure annotator
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
val an = DAnnotator(props)
// annotator function
def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
val new_p = top.getPosts.map{ x=>
val at = new Annotation(x.getPostText.toString)
ann.annotator.annotate(at)
val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList
val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
if(t.nonEmpty) r.setTrees(t)
r
}
val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
new_t.setPosts(new_p)
new_t
}
// transformation
val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )
new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
classOf[Void],
classOf[Topic],
classOf[ParquetOutputFormat[Topic]],
job.getConfiguration
)
}
Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file