SQL Select Unique Values Each Column - hiveql

I'm looking to select unique values from each column of a table and output the results into a single table. Take the following example table:
+------+---------------+------+---------------+
|col1 |col2 |col_3 |col_4 |
+------+---------------+------+---------------+
|1 |"apples" |A |"red" |
|2 |"bananas" |A |"red" |
|3 |"apples" |B |"blue" |
+------+---------------+------+---------------+
the ideal output would be:
+------+---------------+------+---------------+
|col1 |col2 |col_3 |col_4 |
+------+---------------+------+---------------+
|1 |"apples" |A |"red" |
|2 |"bananas" |B |"blue" |
|3 | | | |
+------+---------------+------+---------------+
Thank you!
Edit: My actual table has many more columns, so ideally the SQL query can be done via a SELECT * as opposed to 4 individual select queries within the FROM statement.

Related

create unique runid and append in output dataframe for each time we run spark scala code

create unique runid and append in output dataframe for each time we run spark scala code
Below is my Output dataframe i want to add 1 more column for runid , can anyone help?
+--------+-------------------------------+---+
|order_id|Diff |id |
+--------+-------------------------------+---+
|12 |order_status |1 |
|1 |order_customer_id order_status |1 |
|68885 |New row in DataFrame 2 |1 |
|68886 |New row in DataFrame 2 |1 |
|2 |order_customer_id |1 |
|12 |order_status |2 |
|1 |order_customer_id order_status |2 |
|68885 |New row in DataFrame 2 |2 |
|68886 |New row in DataFrame 2 |2 |
|2 |order_customer_id |2 |
+--------+-------------------------------+---+

Select max common Date from differents DataFrames (Scala Spark)

I have differents dataframes and I want to select the max common Date of these DF. For example, I have the following dataframes:
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2012-12-21 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
The selected date would be 2016-09-02 because is the max date that exists in these 3 DF (the date 2017-11-19 is not in the third DF).
I am trying to do it with agg(max) but in this way I just have the highest date of a DataFrame:
df1.select("Date").groupBy("Date").agg(max("Date))
Thanks in advance!
You can do semi joins to get the common dates, and aggregate the maximum date. No need to group by date because you want to get its maximum.
val result = df1.join(df2, Seq("Date"), "left_semi").join(df3, Seq("Date"), "left_semi").agg(max("Date"))
You can also use intersect:
val result = df1.select("Date").intersect(df2.select("Date")).intersect(df3.select("Date")).agg(max("Date"))

Replace the values based on condition spark

I have dataset I want to replace the result column based on the least value of quantity by grouping id,date
id,date,quantity,result
1,2016-01-01,245,1
1,2016-01-01,345,3
1,2016-01-01,123,2
1,2016-01-02,120,5
2,2016-01-01,567,1
2,2016-01-01,568,1
2,2016-01-02,453,1
Here the output, replace the quantity which has least value in that groupby(id,date). Here ordering of rows doesn't matter, any order it can be.
id,date,quantity,result
1,2016-01-01,245,2
1,2016-01-01,345,2
1,2016-01-01,123,2
1,2016-01-02,120,5
2,2016-01-01,567,1
2,2016-01-01,568,1
2,2016-01-02,453,1
Use the Window and get the maximum by max.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('id', 'date')
df.withColumn('result', f.when(f.col('quantity') == f.min('quantity').over(w), f.col('result'))) \
.withColumn('result', f.max('result').over(w)).show(10, False)
+---+----------+--------+------+
|id |date |quantity|result|
+---+----------+--------+------+
|1 |2016-01-02|120 |5 |
|1 |2016-01-01|245 |2 |
|1 |2016-01-01|345 |2 |
|1 |2016-01-01|123 |2 |
|2 |2016-01-02|453 |1 |
|2 |2016-01-01|567 |1 |
|2 |2016-01-01|568 |1 |
+---+----------+--------+------+

OrientDB: Is there a way to query metadata about all indexes via the API?

OrientDB's console.sh has an INDEXES command, which gives a list of all existing indexes, like so:
+----+-------------------+-----------------+-------+------------+-------+-----------------+
|# |NAME |TYPE |RECORDS|CLASS |COLLATE|FIELDS |
+----+-------------------+-----------------+-------+------------+-------+-----------------+
|0 |dictionary |DICTIONARY |0 | |default| |
|1 |OFunction.name |UNIQUE_HASH_INDEX|11 |OFunction |default|name(STRING) |
|2 |ORole.name |UNIQUE |3 |ORole |ci |name(STRING) |
|3 |OUser.name |UNIQUE |1 |OUser |ci |name(STRING) |
|4 |UserRole.Desc |UNIQUE |3 |UserRole |default|Desc(STRING) |
+----+-------------------+-----------------+-------+------------+-------+-----------------+
| |TOTAL | |18 | | | |
+----+-------------------+-----------------+-------+------------+-------+-----------------+
Is there a way to get this information via the API (or a SQL query)?
I contacted OrientDB directly, and #lvca told me about the "metadata:indexmanager" class that contains the index information I was looking for:
select expand(indexes) from metadata:indexmanager
Here's an up to date link to the documentation:
https://orientdb.com/docs/last/SQL.html#query-the-available-indexes
With this query you can get all metadata:
SELECT expand(classes) from metadata:schema

Data loss after writing in spark

I obtain a resultant dataframe after performing some computations over it.Say the dataframe is result. When i write it to Amazon S3 there are specific cells which are shown blank. The top 5 of my result dataframe is:
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 | |1 | | |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 | |2 | | |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 | |5 | | |
---------------------------------------------------------
But when i run result.show() command i am able to see the values.
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 |2 |1 |1 |6 |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 |2 |2 |2 |12 |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 |20 |5 |5 |34 |
---------------------------------------------------------
Also, the blank are shown in same cells every time i run it.
Use this to save data to your s3
DataFrame.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("s3n://Yourpath")
For anyone who might have come across this issue, I can tell what worked for me.
I was joining 1 data frame ( let's say inputDF) with another df ( delta DF) based on some logic and storing in an output data frame (outDF). I was getting same error where by I could see a record in outDF.show() but while writing this dataFrame into a hive table OR persisting the outDF ( using outDF.persist(StorageLevel.MEMORY_AND_DISC)) I wasn't able to see that particular record.
SOLUTION:- I persisted the inputDF ( inputDF.persist(StorageLevel.MEMORY_AND_DISC)) before joining it with deltaDF. After that outDF.show() output was consistent with the hive table where outDF was written.
P.S:- I am not sure how this solved the issue. Would be awesome if someone could explain this, but the above worked for me.