Scala: best way to update a deltatable after filling missing values - scala

I have the following delta table
+-+----+
|A|B |
+-+----+
|1|10 |
|1|null|
|2|20 |
|2|null|
+-+----+
I want to fill the null values in column B based on the A column.
I figured this to do so:
var df = spark.sql("select * from MyDeltaTable")
val w = Window.partitionBy("A")
df = df.withColumn("B", last("B", true).over(w))
Which gives me the desired output:
+-+----+
|A|B |
+-+----+
|1|10 |
|1|10 |
|2|20 |
|2|20 |
+-+----+
Now, my question is:
What is the best way to write the result in my delta table correctly ?
Should I merge ? Re-write with overwrite option ?
My delta table us huge and it will keep on increasing, I am looking for the best possible method to achieve so.
Thank you

It depends on the distribution of the rows (aka. are they all in 1 file or spread through many?) that contain null values you'd like to fill.
MERGE will rewrite entire files, so you may end up rewriting enough of the table to justify simply overwriting it instead. You'll have to test this to determine what's best for your use case.
Also, to use MERGE, you need to filter the dataset down only to the changes. Your example "desired output" table has the all the data, which you'd fail to MERGE in its current state because there are duplicate keys.
Check the Important! section in the docs for more

Related

How to assign values from a CSV to individual variables using Scala

I'd like first to appologize if this is not the proper way of asking a question but it's my first one.
I have a CSV with 2 columns, one with names and another one with values.
I have imported the CSV as a Scala DataFrame into Spark and I want to assign all the values to individual variables (getting the variable names from the column name).
The DataFrame looks like this (there are more variables and the total number of variables can change)
| name|value|
+--------------------+-----+
| period_id| 188|
| homeWork| 0.75|
| minDays| 7|
...
I am able to assign the value of one individual row to a variable using the code below, but I'd like to do it for all the records automatically.
val period_id = vbles.filter(vbles("name").equalTo("period_id")).select("value").first().getString(0).toDouble
The idea I had was to iterate through all the rows in the DataFrame and run the code above each time, something like
for (valName <- name) {
val valName = vbles.filter(vbles("name").equalTo(valName )).select("value").first().getString(0).toDouble
}
I have tried some ways of iterating through the rows of the DataFrame but I haven't been successful.
What I'd like to get is as if I'd do this manually
val period_id = 188
val homeWork = 0.75
val minDays = 7
...
I suppose that there are smarter ways of getting this, but if I could just iterate through the rows it'd be fine, although any solution that works is welcome.
Thanks a lot

How to save the predictions of YOLO (You Only Look Once) Object detection in a jsonb field in a database

I want to run Darknet(YOLO) on a number of images and store its predictions in PostgreSQL Database.
This is the structure of my table:
sample=> \d+ prediction2;
Table "public.prediction2"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+-------+-----------+----------+--------------+-------------
path | text | not null | extended | |
pred_result | jsonb | | extended | |
Indexes:
"prediction2_pkey" PRIMARY KEY, btree (path)
Darknet(YOLO)'s source files are written in C.
I have already stored Caffe's predictions in the database as follows. I have listed one of the rows of my database here as an example.
path | pred_result
-------------------------------------------------+------------------------------------------------------------------------------------------------------------------
/home/reena-mary/Pictures/predict/gTe5gy6xc.jpg | {"bow tie": 0.00631, "lab coat": 0.59257, "neck brace": 0.00428, "Windsor tie": 0.01155, "stethoscope": 0.36260}
I want to add YOLO's predictions to the jsonb data of pred_result i.e for each image path and Caffe prediction result already stored in the database, I would like to append Darknet (YOLO's) predictions.
The reason I want to do this is to add search tags to each image. So, by running Caffe and Darknet on images, I want to be able to get enough labels that can help me make my image search better.
Kindly help me with how I should do this in Darknet.
This is an issue I also encountered. Actually YOLO does not provide a JSON output interface, so there is no way to get the same output as from Caffe.
However, there is a pull request that you can merge to get workable output here: https://github.com/pjreddie/darknet/pull/34/files. It outputs CSV data, which you can convert to JSON to store in the database.
You could of course also alter the source code of YOLO to make your own implementation that outputs JSON directly.
If you are able to use a TensorFlow implementation of YOLO try this: https://github.com/thtrieu/darkflow
You can directly interact with darkflow from another python application and then do with the output data as you please (or get JSON data saved to a file, whichever is easier).

Is there any way to know the last commit value in a table?

I am using Postgres 9.5. If I update certain values of a row and commit, is there any way to fetch the old value afterwards? I am thinking is there something like a flashback? But this would be a selective flashback. I don't want to rollback the entire database. I just need to revert one row.
Short answer - it is not possible.
But for future readers, you can create an array field with historical data that will look something like this:
Column | Type |
----------------+--------------------------+------
value | integer |
value_history | integer[] |
For more info read the docs about arrays

Can I use SELECT from dataframe instead of creating this temp table?

I am currently using :
+---+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |sen |attributes |
+---+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |Stanford is good college.|[[Stanford,ORGANIZATION,NNP], [is,O,VBZ], [good,O,JJ], [college,O,NN], [.,O,.], [Stanford,ORGANIZATION,NNP], [is,O,VBZ], [good,O,JJ], [college,O,NN], [.,O,.]]|
+---+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
I want to get above df from :
+----------+--------+--------------------+
|article_id| sen| attribute|
+----------+--------+--------------------+
| 1|example1|[Standford,Organi...|
| 1|example1| [is,O,VP]|
| 1|example1| [good,LOCATION,ADP]|
+----------+--------+--------------------+
using :
df3.registerTempTable("d1")
val df4 = sqlContext.sql("select article_id,sen,collect(attribute) as attributes from d1 group by article_id,sen")
Is there any way that I don't have to register temp table, as while saving dataframe, it is giving lot of garbage!! Something lige df3.Select""??
The only way Spark currently has to run SQL against a dataframe is via a temporary table. However, you can add implicit methods to DataFrame to automate this, as we have done at Swoop. I can't share all the code as it uses a number of our internal utilities & implicits but the core is in the following gist. The importance of using unique temporary tables is that (at least until Spark 2.0) temporary tables are cluster global.
We use this approach regularly in our work, especially since there are many situations in which SQL is much simpler/easier to write and understand than the Scala DSL.
Hope this helps!

Limit results on OR condition in Sphinx

I am trying to limit results by somehow grouping them,
This query attempt should makes things clear:
#namee ("Cameras") limit 5| #namee ("Mobiles") limit 5| #namee ("Washing Machine") limit 5| #namee ("Graphic Cards") limit 5
where namee is the column
Basically I am trying to limit results/ based upon specific criteria.
Is this possible ? Any alternative way of doing what I want to do.
I am on sphinx 2.2.9
There is no Sphinx syntax to do this directly.
The easiest would be just to do directly 4 separate queries and 'UNION' them in the application itself. Performance isn't going to be terrible.
... If you REALLY want to do it in Sphinx, can explicit a couple of tricks to get close, but it gets very complicated.
Would need to create 4 separate indexes (or upto as many terms as you need!). Each with the the same data, but with the field called something different. (they duplicate each other!) You would also need an attribute on each one (more on why later)
source str1 {
sql_query = SELECT id, namee AS field1, 1 as idx FROM ...
sql_attr_unit = idx
source str2 {
sql_query = SELECT id, namee AS field2, 2 as idx FROM ...
sql_attr_unit = idx
... etc
Then create a single distributed index over the 4 indexes.
Then can run a single query to get all results kinda magically unioned...
MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
(The ##relaxed is important, as the fields are different. the matches must come from different indexes)
Now to limiting them... Because each keyword match must come from a different index, and each index has a unique attribute, the attribute identifies what term matches....
in Sphinx, there is a nice GROUP N BY where you only get a certain number of results from each attribute, so could do... (putting all that together)
SELECT *,WEIGHT() AS weight
FROM dist_index
WHERE MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
GROUP 4 BY idx
ORDER BY weight DESC;
simples eh?
(note it only works if want 4 from each index, if want different limits is much more complicated!)