I am trying to build a simple relationship in Neo4j using Spark-Neo4j connector. My dataframe looks like this:
df_new= spark.createDataFrame(
[("CompanyA",'A','CompanyA','B'),("CompanyB",'B','CompanyB','C') ],
["name",'gid','description','parent_gid']
)
The desired tree should look like this:
The query I wrote looks like this:
query = """
MERGE (c:Company {gid:event.gid})
ON CREATE SET c.name=event.name, c.description=event.description
ON MATCH SET c.name=event.name, c.description=event.description
MERGE (p:Company {gid:event.parent_gid})
MERGE (p)-[:PARENT_OF]->(c)
"""
df_new.write\
.mode("Overwrite")\
.format("org.neo4j.spark.DataSource")\
.option("url", "bolt://localhost:7687")\
.option("authentication.type", "basic")\
.option("authentication.basic.username", username)\
.option("authentication.basic.password", password)\
.option("query", query)\
.save()
However my code ends up creating node instead of merging it, and I end up with two nodes for company B
You have the exact right logic, there's just some nuance at play that is hard to pin down. This article has your answer; read the section near the end about unique constraints: https://neo4j.com/developer/kb/understanding-how-merge-works/
One solution is to change your query to this:
query = '''
merge (c:Company {gid:event.gid})
set c.name = event.name, c.description = event.description
merge (p:Company {gid:event.parent_gid})
set p.name = event.name, p.description = event.description
merge (p)-[:PARENT_OF]->(c)
'''
Now when performing concurrent operations, cypher has enough unique constraints to avoid duplicating gid = "B"
Related
I am experiencing various things while studying JPA, but I am too unfamiliar with it, so I would like to get some advice.
The parts I got stuck in during my study were grouped into three main categories. Could you please take a look at the code below?
#Repository
public interface TestRepository extends JpaRepository<TestEntity,Long> {
#Query(" SELECT
, A.test1
, A.test2
, B.test1
, B.test2
FROM TEST_TABLE1 A
LEFT JOIN TEST_TABLE2 B
ON A.test_no = B.test_no
WHERE A.test3 = ?1 # Here's the first question
if(VO.test4 is not null) AND B.test4 = ?2") # Here's the second question
List<Object[] # Here's the third question> getTestList(VO);
}
First, is it possible to extract test3 from the VO received when using native sql?
Usually, String test1 is used like this, but I wonder if there is any other way other than this.
Second, if extracting is possible in VO, can you add a query in #QUERY depending on whether Test4 is valued or not?
Thirdly, if I use List<Object[]>, can the result of executing a query that is not in the already created entity (eg, test1 in TEST_TABLE2, which is not in the entity of TEST_TABLE1) can be included?,
First, is it possible to extract test3 from the VO received when using native sql? Usually, String test1 is used like this, but I wonder if there is any other way other than this.
Yes, it is possible.
You must use, eg where :#{[0].test3} is equals vo.test3
[0] is position the first param, past for method annotated with #Query
#Query(value = "SELECT a.test1, a.test2, b.test1, b.test2
FROM test_table1 a
LEFT JOIN test_table2 b ON a.test_no = b.test_no
WHERE a.test3 = :#{[0].test3}", nativeQuery = true)
List<Object[]> getList(VO);
Second, if extracting is possible in VO, can you add a query in #QUERY depending on whether Test4 is valued or not?
You can use a trick eg:
SELECT ... FROM table a
LEFT JOIN table b ON a.id = b.id
WHERE a.test3 = :#{[0].test3}
AND (:#{[0].test4} IS NOT NULL AND b.test4 = :#{[0].test4})
Thirdly, if I use List<Object[]>, can the result of executing a query that is not in the already created entity (eg, test1 in TEST_TABLE2, which is not in the entity of TEST_TABLE1) can be included?
Sorry, but I not understand the third question.
Maybe this tutorial will help you: https://www.baeldung.com/jpa-queries-custom-result-with-aggregation-functions
Insert into a Table from Raw SQL Select
val rawSql: DBIO[Vector[(String, String)]] = sql"SELECT id, name FROM SomeTable".as[(String, String)]
val myTable :TableQuery[MyClass] // with columns id (String), name(String) and some other columns
Is there a way to use forceInsert functions to insert data from select into the tables?
If not, Is there a way to generate a sql string by using forceInsertStatements?
Something like:
db.run {
myTable.map{ t => (t.id, t.name)}.forceInsert????(rawSql)
}
P.S. I don't want to make two I/O calls because my RAW SQL might be returning thousands of records.
Thanks for the help.
If you can represent your rawSql query as a Slick query instead...
val query = someTable.map(row => (row.id, row.name))
...for example, then forceInsertQuery will do what you need. An example might be:
val action =
myTable.map(row => (row.someId, row.someName))
.forceInsertQuery(
someTable.map(query)
)
However, I presume you're using raw SQL for a good reason. In that case, I don't believe you can use forceInsert (without a round-trip to the database) because the raw SQL is already an action (not a query).
But, as you're using raw SQL, why not do the whole thing in raw SQL? Something like:
val rawEverything =
sqlu" insert into mytable (someId, someName) select id, name from sometable "
...or similar.
Let's say I have a two tables, one for students (tbl_students) and another for exams (tbl_exams). In vanilla SQL with a relational database, I can be able to use an outer join to find the list of students who have missed a particular exam, since the student_id won't match any row in the exam table for a that particular exam_id. I could also insert the result of this outer join query into another table using the SELECT INTO syntax.
With that background, can I be able to achieve a similar result using spark sql and scala, where I can populate a dataframe using the result of an outer join? Example code is (the code is not tested and may not run as is):
//Create schema for single column
val schema = StructType(
StructField("student_id", StringType, true)
)
//Create empty RDD
var dataRDD = sc.emptyRDD
//pass rdd and schema to create dataframe
val joinDF = sqlContext.createDataFrame(dataRDD, schema);
joinDF.createOrReplaceTempView("tbl_students_missed_exam");
//Populate tbl_students_missed_exam dataframe using result of outer join
sparkSession.sql(s"""
SELECT tbl_students.student_id
INTO tbl_students_missed_exam
FROM tbl_students
LEFT OUTER JOIN tbl_exams ON tbl_students.student_id = tbl_exams.exam_id;""")
Thanks in advance for your input
i have a requirement to validate an ingest operation , bassically, i have two big files within HDFS, one is avro formatted (ingested files), another one is parquet formatted (consolidated file).
Avro file has this schema:
filename, date, count, afield1,afield2,afield3,afield4,afield5,afield6,...afieldN
Parquet file has this schema:
fileName,anotherField1,anotherField1,anotherField2,anotherFiel3,anotherField14,...,anotherFieldN
If i try to load both files in a DataFrame and then try to use a naive join-where, the job in my local machine takes more than 24 hours!, which is unaceptable.
ingestedDF.join(consolidatedDF).where($"filename" === $"fileName").count()
¿Which is the best way to achieve this? ¿dropping colums from the DataFrame before doing the join-where-count? ¿calculating the counts per dataframe and then join and sum?
PD
I was reading about map-side-joint technique but it looks that this technique would work for me if there was a small file able to fit in RAM, but i cant assure that, so, i would like to know which is the prefered way from the community to achieve this.
http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
I would approach this problem by stripping down the data to only the field I'm interested in (filename), making a unique set of the filename with the source it comes from (the origin dataset).
At this point, both intermediate datasets have the same schema, so we can union them and just count. This should be orders of magnitude faster than using a join on the complete data.
// prepare some random dataset
val data1 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.8).map(i => (s"file$i", i, "rubbish"))
val data2 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.7).map(i => (s"file$i", i, "crap"))
val df1 = sparkSession.createDataFrame(data1).toDF("filename", "index", "data")
val df2 = sparkSession.createDataFrame(data2).toDF("filename", "index", "data")
// select only the column we are interested in and tag it with the source.
// Lets make it distinct as we are only interested in the unique file count
val df1Filenames = df1.select("filename").withColumn("df", lit("df1")).distinct
val df2Filenames = df2.select("filename").withColumn("df", lit("df2")).distinct
// union both dataframes
val union = df1Filenames.union(df2Filenames).toDF("filename","source")
// let's count the occurrences of filename, by using a groupby operation
val occurrenceCount = union.groupBy("filename").count
// we're interested in the count of those files that appear in both datasets (with a count of 2)
occurrenceCount.filter($"count"===2).count
I create the following Column Family in Cassandra:
CREATE COLUMN FAMILY test with comparator = 'CompositeType(UTF8Type,UTF8Type)' and key_validation_class=UTF8Type;
Now I want to add some data:
set test['a']['b:c'] = 'abc'
set test['a']['b:d'] = 'abd'
set test['a']['e:f'] = 'aef'
set test['a']['e:g'] = 'aeg';
Now I would like to retrieve all rows which have e in its Composite key:
something like:
get test['a']['e:*];
and result should be 'aef' and 'aeg'.
How cli query should look like?
I am not sure about CQL, but with playOrm, if you partitioned by a, you can just do S-SQL(scalable SQL) query of
PARTITIONS alias('a') SELECT alias FROM Table as alias WHERE a.column = 'e';
A partition can have millions of rows.
Anyways, just thought it might help you a bit.