OrientDB: Is there a way to query metadata about all indexes via the API? - orientdb

OrientDB's console.sh has an INDEXES command, which gives a list of all existing indexes, like so:
+----+-------------------+-----------------+-------+------------+-------+-----------------+
|# |NAME |TYPE |RECORDS|CLASS |COLLATE|FIELDS |
+----+-------------------+-----------------+-------+------------+-------+-----------------+
|0 |dictionary |DICTIONARY |0 | |default| |
|1 |OFunction.name |UNIQUE_HASH_INDEX|11 |OFunction |default|name(STRING) |
|2 |ORole.name |UNIQUE |3 |ORole |ci |name(STRING) |
|3 |OUser.name |UNIQUE |1 |OUser |ci |name(STRING) |
|4 |UserRole.Desc |UNIQUE |3 |UserRole |default|Desc(STRING) |
+----+-------------------+-----------------+-------+------------+-------+-----------------+
| |TOTAL | |18 | | | |
+----+-------------------+-----------------+-------+------------+-------+-----------------+
Is there a way to get this information via the API (or a SQL query)?

I contacted OrientDB directly, and #lvca told me about the "metadata:indexmanager" class that contains the index information I was looking for:
select expand(indexes) from metadata:indexmanager
Here's an up to date link to the documentation:
https://orientdb.com/docs/last/SQL.html#query-the-available-indexes

With this query you can get all metadata:
SELECT expand(classes) from metadata:schema

Related

Select max common Date from differents DataFrames (Scala Spark)

I have differents dataframes and I want to select the max common Date of these DF. For example, I have the following dataframes:
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2012-12-21 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
The selected date would be 2016-09-02 because is the max date that exists in these 3 DF (the date 2017-11-19 is not in the third DF).
I am trying to do it with agg(max) but in this way I just have the highest date of a DataFrame:
df1.select("Date").groupBy("Date").agg(max("Date))
Thanks in advance!
You can do semi joins to get the common dates, and aggregate the maximum date. No need to group by date because you want to get its maximum.
val result = df1.join(df2, Seq("Date"), "left_semi").join(df3, Seq("Date"), "left_semi").agg(max("Date"))
You can also use intersect:
val result = df1.select("Date").intersect(df2.select("Date")).intersect(df3.select("Date")).agg(max("Date"))

SQL Select Unique Values Each Column

I'm looking to select unique values from each column of a table and output the results into a single table. Take the following example table:
+------+---------------+------+---------------+
|col1 |col2 |col_3 |col_4 |
+------+---------------+------+---------------+
|1 |"apples" |A |"red" |
|2 |"bananas" |A |"red" |
|3 |"apples" |B |"blue" |
+------+---------------+------+---------------+
the ideal output would be:
+------+---------------+------+---------------+
|col1 |col2 |col_3 |col_4 |
+------+---------------+------+---------------+
|1 |"apples" |A |"red" |
|2 |"bananas" |B |"blue" |
|3 | | | |
+------+---------------+------+---------------+
Thank you!
Edit: My actual table has many more columns, so ideally the SQL query can be done via a SELECT * as opposed to 4 individual select queries within the FROM statement.

Query over 3 tables and different columns

I would like to do a fairly simple query, but I cant figure out, how to join the tables together. I am new to this world of SQL and after reading documentation of JOIN and SELECT clauses, I still can't figure this one out.
Here are my 3 tables:
Seller
|SELLER_ID|NUMBER|FIRST_NAME|LAST_NAME|TEAM_NR|
| 1|105 |John |Smith |1 |
| 2|106 |James |Brown |3 |
| 3|107 |Jane |Doe |3 |
| 4|108 |Nicole |Sanchez |2 |
Service
|SERVICE_ID|CODE|NAME |PRICE |SELLER_ID|CLIENT_ID|
| 1| 502|BLAHBLAH |200 |2 |2 |
| 2| 503|BLAHBLAH2|175 |1 |3 |
| 3| 504|BLAHBLAH3|250 |3 |2 |
| 4| 505|BLAHBLAH4|130 |2 |4 |
Client
|CLIENT_ID|NUMBER |FIRST_NAME | LAST_NAME |
| 1|51 |JOHN | ADAMS |
| 2|52 |MARY | BRYANT |
| 3|53 |FRANCIS | JOHNSON |
| 4|55 |BEN | CASTLE |
The goal of this query would be to figure out which team(TEAM_NR from Seller) sold the most services in a month, on the basis of total amount sold(sum of PRICE from Service)
The result should display FIRST_NAME, LAST_NAME and TEAM_NR of everyone in the "winning" team.
I already looked for help from StackOverFlow and Google and tried editing these according to my tables, but they didn't pan out.
Thank You!
SELECT S.FIRST_NAME, S.LAST_NAME, S.TEAM_NR, sum(R.PRICE) Winning
FROM Seller S
LEFT JOIN Service R ON (S.SELLER_ID=R.SELLER_ID)
GROUP BY S.TEAM_NR, S.FIRST_NAME, S.LAST_NAME
EDIT Don't even need any join on Client table.
EDIT 2 All fields from the SELECT have to be in the GROUP BY.

Data loss after writing in spark

I obtain a resultant dataframe after performing some computations over it.Say the dataframe is result. When i write it to Amazon S3 there are specific cells which are shown blank. The top 5 of my result dataframe is:
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 | |1 | | |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 | |2 | | |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 | |5 | | |
---------------------------------------------------------
But when i run result.show() command i am able to see the values.
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 |2 |1 |1 |6 |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 |2 |2 |2 |12 |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 |20 |5 |5 |34 |
---------------------------------------------------------
Also, the blank are shown in same cells every time i run it.
Use this to save data to your s3
DataFrame.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("s3n://Yourpath")
For anyone who might have come across this issue, I can tell what worked for me.
I was joining 1 data frame ( let's say inputDF) with another df ( delta DF) based on some logic and storing in an output data frame (outDF). I was getting same error where by I could see a record in outDF.show() but while writing this dataFrame into a hive table OR persisting the outDF ( using outDF.persist(StorageLevel.MEMORY_AND_DISC)) I wasn't able to see that particular record.
SOLUTION:- I persisted the inputDF ( inputDF.persist(StorageLevel.MEMORY_AND_DISC)) before joining it with deltaDF. After that outDF.show() output was consistent with the hive table where outDF was written.
P.S:- I am not sure how this solved the issue. Would be awesome if someone could explain this, but the above worked for me.

OrientDB spatial query to find all pairs within X km of each other

I'm testing out the orientdb spatial module. I've put together a simple dataset with coordinates for a few of geoglyphs in the nazca lines (nazca_lines.csv):
Name,Latitude,Longitude
Hummingbird,-14.692131,-75.148892
Monkey,-14.706940,-75.138532
Condor,-14.697444,-75.126208
Spider,-14.694145,-75.122381
Spiral,-14.688277,-75.122746
Hands,-14.694459,-75.113881
Tree,-14.693898,-75.114520
Astronaut,-14.745222,-75.079755
Dog,-14.706401,-75.130788
Wing,-14.680309,-75.100385
Parrot,-14.689463,-75.107498
I create a spatial index using:
CREATE INDEX GeoGlyph.index.Location
ON GeoGlyph(Latitude,Longitude) SPATIAL ENGINE LUCENE
I can generate a list of nodes that are within, say 2km, of a specific geoglyph using a query like the one I generated in this stack-overflow question:
SELECT $temp.Name AS SourceName, Name AS TargetName, $distance.format("%.4f") AS Distance
FROM GeoGlyph
LET $temp = first((SELECT * FROM GeoGlyph WHERE Name = "Tree"))
WHERE [Latitude,Longitude,$spatial]
NEAR [$temp.Latitude, $temp.Longitude,{"maxDistance":2}]
ORDER BY Distance
which gives me this result:
+----+----------+----------+--------+
|# |SourceName|TargetName|Distance|
+----+----------+----------+--------+
|0 |Tree |Tree |0.0000 |
|1 |Tree |Hands |0.0884 |
|2 |Tree |Spider |0.9831 |
|3 |Tree |Spiral |1.0883 |
|4 |Tree |Condor |1.5735 |
+----+----------+----------+--------+
This is nice, but I can only find the nodes relative to a specific node. I would like to expand this to ask for all pairs of nodes that are within 2km of each other.
A result that I'm interested in would look something like this:
+----+-----------+-----------+--------+
|# |SourceName |TargetName |Distance|
+----+-----------+-----------+--------+
|1 |Hummingbird|Monkey |1.6314 |
|2 |Monkey |Dog |1.8035 |
|3 |Dog |Condor |0.9349 |
|4 |Dog |Spider |1.5487 |
|5 |Condor |Spider |0.6772 |
|6 |Condor |Spiral |1.2685 |
|7 |Condor |Tree |1.5735 |
|8 |Condor |Hands |1.6150 |
|9 |Spider |Spiral |0.6797 |
...
Any ideas?
you should use the new spatial module feature with OPoint type
and use ST_Distance_Sphere function
http://orientdb.com/docs/2.2/Spatial-Index.html#stdistancesphere-from-orientdb-224