Duplicates Pyspark - pyspark

Given that I have json objects and load it into PySpark dataframe:
{"id":"1", "make":"Audi", "model": "A3", "title": "Audi a3 engine 3.0", "description": "Car for sale, audi a3", "location": "New Jersey"}
{"id":"2", "make":"Audi", "model": "A3", "title": "Audi a3 3.0", "description": "Car for sale, audi a3, engine 3.0", "location": "New Jersey"}
{"id":"3", "make":"BMW", "model": "X3", "title": "Bmw x3 3.0", "description": "Car for sale, bmw x3, engine 3.0", "location": "New Jersey"}
How I can perform similarity check on fields make, model, title and description, calculate similarity check for each, and add new field to each row, as sum, referring to id (ids) on which is the highest score match?
So it would be great if result look like this:
+----+-----+------------------+---------------------------------+----------------------------------------------+
|make|model|title |description |match |
+----+-----+------------------+---------------------------------+----------------------------------------------+
|Audi|A3 |Audi a3 engine 3.0|Car for sale, audi a3 |[{'id': 2, 'score':98}] |
|Audi|A3 |Audi a3 3.0 |Car for sale, audi a3, engine 3.0|[{'id': 1, 'score':98}] |
|BMW |X3 |Bmw x3 3.0 |Car for sale, bmw x3, engine 3.0 |[{'id':1, 'score':60}, {{'id':2, 'score':80}}]|
+----+-----+------------------+---------------------------------+----------------------------------------------+

Related

How to query using auto generated id and get know if it is an edge id or vertex id

Hot to query using the auto generated id in the apache age graph database.
I want to get to know if it is edge id or a vertex id.
It would sort of depend on what you're trying to MATCH on to determine if it is a vertex or an edge. I don't think it is possible to match on an abstract object in the graph.
For a vertex:
postgresDB=# SELECT * FROM cypher('airroutes', $$
MATCH (n)
WHERE id(n) = 844424930131969
RETURN n $$)
AS (n agtype);
n
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
{"id": 844424930131969, "label": "airport", "properties": {"id": "1", "lat": "33.63669968", "lon": "-84.42810059", "city": "Atlanta", "code": "ATL", "desc": "Hartsfield - Ja
ckson Atlanta International Airport", "elev": "1026", "icao": "KATL", "__id__": 1, "region": "US-GA", "country": "US", "longest": "12390", "runways": "5"}}::vertex
For an edge:
postgresDB=# SELECT * FROM cypher('airroutes', $$
MATCH ()-[e]-()
WHERE id(e) = 1688849860263937
RETURN e $$)
AS (n agtype);
n
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------
{"id": 1688849860263937, "label": "route", "end_id": 844424930131971, "start_id": 844424930131969, "properties": {"dist": "809", "route_id": "3749", "end_vertex_type": "airp
ort"}}::edge
{"id": 1688849860263937, "label": "route", "end_id": 844424930131971, "start_id": 844424930131969, "properties": {"dist": "809", "route_id": "3749", "end_vertex_type": "airp
ort"}}::edge
(2 rows)

Redshift: can't parse multiple items from list in JSON to each row

I have a JSON in the RedShift.
"value" is stored in the inputs column as JSON text. Create a unique record for each "value". Look at the example below this table:
{"inputs": [{"name": "ambient", "desc": "Select from below the ambient setting that best decribe your environment right now", "values": ["Indoor - Loud", "Indoor - Normal", "Indoor - Whisper", "Outdoor - Loud", "Outdoor - Normal", "Outdoor - Whisper", "Semi-Outdoor - Loud", "Semi-Outdoor - Normal", "Semi-Outdoor - Whisper"]}
As a result it must be like this:
ProjectId : 10. Input value = Indoor – Loud
ProjectId : 10. Input value = Indoor – Normal
ProjectId : 10. Input value Indoor – Whisper
Each value need to be stored as one row in the dim_collect_user_inp_configs table. Example, Indoor-Loud as one row and it will have it’s own unique identifier as prompt_input_value_id, Indoor-Normal as one row and it will have it’s own unique identifier as prompt_input_value_id till the Semi-Outdoor-Whisper.
There could be multiple input “name” in one inputs column. Each name and its value need to be stored separately. Example :
[{"desc": "How many people does the video contain?", "name": "Number of People", "type": "dropdown", "values": ["", "Only 1", "2-3", "3+"]}, {"desc": "What is the camera position?", "name": "Movement", "type": "dropdown", "values": ["", "Fixed position", "Moving"]}, {"desc": "From which angle did you shoot the video?", "name": "Shoot Angle", "type": "dropdown", "values": ["", "Frontal recording", "Tight angle: 10-40 degree", "Wide angle: 40-70 degree"]}, {"desc": "From which distance did you shoot the video?", "name": "Distance", "type": "dropdown", "values": ["", "Near/Selfie", "Mid (3-6 ft)", "Far (>6 ft)"]}, {"desc": "What is the video lighting direction?", "name": "Lighting Direction", "type": "dropdown", "values": ["", "Front lit", "Side lit", "Back lit"]}, {"desc": "What is the video background?", "name": "Background", "type": "dropdown", "values": ["", "Outdoors", "In office", "At home", "Plain background"]}, {"desc": "What is the topic in your speech?", "name": "Topic", "type": "dropdown", "values": ["", "Arts and Media", "Business", "Education", "Entertainment", "Food/Eating", "Nutrition", "Healthcare ", "High School Life", "Mental Health", "News", "Technology", "Morals and Ethics", "Phones and Apps", "Sports", "Science"]}]
I try to do it with this query:
WITH all_values AS (
SELECT projectid AS projectid,
prompttype AS prompttype,
json_extract_path_text(json_extract_array_element_text(inputs, 0, True), 'name') AS name,
json_extract_path_text(json_extract_array_element_text(inputs, 0, True), 'desc') AS description,
json_extract_path_text(json_extract_array_element_text(inputs, 0, True), 'values') AS value,
scriptid AS scriptid,
corpuscode AS corpuscode
FROM source.table
WHERE
prompttype = 'input'
GROUP BY projectid, prompttype, name, description, scriptid, corpuscode, value
LIMIT 10
)
SELECT * FROM all_values;
But now I haven't each row for "value" as I need. :(
Can you help me?
Thnx.

How do you combine data in a Scala dataframe and output it as JSON objects?

This is slightly hard to explain for me so I'll do my best. Here is the given data set:
Name
Car Brand
Car Model
Car Color
Year Bought
Tom
Toyota
Corolla
Black
2009
Tom
Hyundai
Kona
Blue
2010
Tom
Kia
Soul
Red
2011
Bob
Mazda
CX-30
Red
2008
Bob
BMW
X1
Blue
2014
With the given data set, I want to condense it based on name and just put all the cars into a list and output it out as JSON objects on separated lines in file. For the above data set, the output should look like this:
{
"name": "Tom",
"Cars": [{
"CarSpecifications": {
"Brand": "Toyota",
"Model": "Corolla",
"Color": "Black"
},
"YearBought":2009
},
{
"CarSpecifications": {
"Brand": "Hyundai",
"Model": "Kona",
"Color": "Blue"
},
"YearBought":2010
},
{
"CarSpecifications": {
"Brand": "Hyundai",
"Model": "Kona",
"Color": "Blue"
},
"YearBought":2011
}]
}
{
"name": "Bob",
"Cars": [{
"CarSpecifications": {
"Brand": "Mazda",
"Model": "CX-30",
"Color": "Red"
},
"YearBought":2008
},
{
"CarSpecifications": {
"Brand": "BMW",
"Model": "X1",
"Color": "Blue"
},
"YearBought":2014
}]
}
How could I accomplish these transformations using Scala and Scala Dataframes?
You can aggregate the dataset using groupBy & collect_list and generate JSON strings with toJSON:
df.groupBy("Name").agg(collect_list(
struct(
struct(
$"Car Brand".as("Brand"),
$"Car Model".as("Model"),
$"Car Color".as("Color")
).as("CarSpecifications"),
$"Year Bought".as("YearBought")
).as("CarSpecifications")
).as("Cars"))
.toJSON
.show(false)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"Name":"Tom","Cars":[{"CarSpecifications":{"Brand":"Toyota","Model":"Corolla","Color":"Black"},"YearBought":"2009"},{"CarSpecifications":{"Brand":"Hyundai","Model":"Kona","Color":"Blue"},"YearBought":"2010"},{"CarSpecifications":{"Brand":"Kia","Model":"Soul","Color":"Red"},"YearBought":"2011"}]}|
|{"Name":"Bob","Cars":[{"CarSpecifications":{"Brand":"Mazda","Model":"CX-30","Color":"Red"},"YearBought":"2008"},{"CarSpecifications":{"Brand":"BMW","Model":"X1","Color":"Blue"},"YearBought":"2014"}]} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

RedShift: Unnest subquery's result on leader is not supported

I try to parse JSON in RedShift.
My string in column "inputs" is:
[{"desc": "How many people does the video contain?", "name": "Number of People", "type": "dropdown", "values": ["", "Only 1", "2-3", "3+"]}, {"desc": "What is the camera position?", "name": "Movement", "type": "dropdown", "values": ["", "Fixed position", "Moving"]}, {"desc": "From which angle did you shoot the video?", "name": "Shoot Angle", "type": "dropdown", "values": ["", "Frontal recording", "Tight angle: 10-40 degree", "Wide angle: 40-70 degree"]}, {"desc": "From which distance did you shoot the video?", "name": "Distance", "type": "dropdown", "values": ["", "Near/Selfie", "Mid (3-6 ft)", "Far (>6 ft)"]}, {"desc": "What is the video lighting direction?", "name": "Lighting Direction", "type": "dropdown", "values": ["", "Front lit", "Side lit", "Back lit"]}, {"desc": "What is the video background?", "name": "Background", "type": "dropdown", "values": ["", "Outdoors", "In office", "At home", "Plain background"]}, {"desc": "What is the topic in your speech?", "name": "Topic", "type": "dropdown", "values": ["", "Arts and Media", "Business", "Education", "Entertainment", "Food/Eating", "Nutrition", "Healthcare ", "High School Life", "Mental Health", "News", "Technology", "Morals and Ethics", "Phones and Apps", "Sports", "Science"]}]
My task is: "Each value, name, desc from JSON need to be stored as one row in the table".
Example:
id: 1, desc: "How many people does the video contain?"
id: 2, desc: "What is the camera position?"
etc.
I use query:
SELECT c.*, d.desc, d.name, d.values FROM source.table AS c, c.inputs AS d;
And got an ERROR: navigation on column "inputs" is not allowed as it is not SUPER type
And query:
SELECT c.*, d.desc, d.name, d.values FROM source.table AS c, JSON_PARSE(c.inputs) AS d;
gives me another error: "function expression in FROM may not refer to other relations of same query level"
But when I have created test JSON as:
CREATE TABLE test_parse_json_super
(
id smallint,
details super
);
INSERT INTO test_parse_json_super VALUES(1, JSON_PARSE('[{"desc": "How many people does the video contain?", "name": "Number of People", "type": "dropdown", "values": ["", "Only 1", "2-3", "3+"]}, {"desc": "What is the camera position?", "name": "Movement", "type": "dropdown", "values": ["", "Fixed position", "Moving"]}, {"desc": "From which angle did you shoot the video?", "name": "Shoot Angle", "type": "dropdown", "values": ["", "Frontal recording", "Tight angle: 10-40 degree", "Wide angle: 40-70 degree"]}, {"desc": "From which distance did you shoot the video?", "name": "Distance", "type": "dropdown", "values": ["", "Near/Selfie", "Mid (3-6 ft)", "Far (>6 ft)"]}, {"desc": "What is the video lighting direction?", "name": "Lighting Direction", "type": "dropdown", "values": ["", "Front lit", "Side lit", "Back lit"]}, {"desc": "What is the video background?", "name": "Background", "type": "dropdown", "values": ["", "Outdoors", "In office", "At home", "Plain background"]}, {"desc": "What is the topic in your speech?", "name": "Topic", "type": "dropdown", "values": ["", "Arts and Media", "Business", "Education", "Entertainment", "Food/Eating", "Nutrition", "Healthcare ", "High School Life", "Mental Health", "News", "Technology", "Morals and Ethics", "Phones and Apps", "Sports", "Science"]}]'));
and use query "SELECT c.*, d.desc, d.name, d.values FROM test_parse_json_super AS c, c.details AS d;" from official RedShift docs - it's work fine and all data from JSON is parsing to each rows and JSON is correct.
How I can fix query for work with my real data?
Thnx.
It looks like you are confusing the concepts of SUPER data type for sub-select. You cannot cast a string to super (using JSON_PARSE) and then use it as the source of the FROM clause at the same level of the query. I don't have a complete understanding of your situation but I think something like this should get you closer:
SELECT c.*, d.desc, d.name, d.values
FROM (
SELECT *, JSON_PARSE(inputs) AS inputs_super
FROM source.table
) AS c,
JSON_PARSE(c.inputs_super) AS d
;
(This is an off the cuff response to show structure so please forgive any syntax issues)
It will be more correct:
SELECT c.*, d.desc, d.name, d.values
FROM (
SELECT id, created, JSON_PARSE(inputs) AS inputs_super
FROM source.table
WHERE prompttype = 'input'
) AS c,
c.inputs_super AS d
;
I ran into a similar issue. The json has to be parsed in a created table before use, and cannot be referenced via a CTE or subquery and then used. Therefore I believe the answer is:
CREATE TABLE my_table_with_super
AS
(
SELECT c.*,
JSON_PARSE(c.inputs) AS inputs_json_super
FROM source.table c
);
SELECT c.*,
json_row.desc,
json_row.name,
json_row.values
FROM my_table_with_super as c,
c.inputs_json_super as json_row;

Search and update a JSON array element in Postgres

I have a Jsonb column that store array of elements like the following:
[
{"id": "11", "name": "John", "age":"25", ..........},
{"id": "22", "name": "Mike", "age":"35", ..........},
{"id": "33", "name": "Tom", "age":"45", ..........},
.....
]
I want to replace the 2nd object(id=22) with a totally new object. I don't want to update each property one by one because there are many properties and their values all could have changed. I want to just identify the 2nd element and replace the whole object.
I know there is a jsonb_set(). However, to update the 2nd element, I need to know its array index=1 so I can do the following:
jsonb_set(data, '{1}', '{"id": "22", "name": "Don", "age":"55"}',true)
But I couldn't find any way to search and get that index. Can someone help me out?
One way I can think of is to combine row_number and json_array_elements:
-- test data
create table test (id integer, data jsonb);
insert into test values (1, '[{"id": "22", "name": "Don", "age":"55"}, {"id": "23", "name": "Don2", "age":"55"},{"id": "24", "name": "Don3", "age":"55"}]');
insert into test values (2, '[{"id": "32", "name": "Don", "age":"55"}, {"id": "33", "name": "Don2", "age":"55"},{"id": "34", "name": "Don3", "age":"55"}]');
select subrow, id, row_number() over (partition by id)
from (
select json_array_elements(data) as subrow, id
from test
) as t;
subrow | id | row_number
------------------------------------------+----+------------
{"id": "22", "name": "Don", "age":"55"} | 1 | 1
{"id": "23", "name": "Don2", "age":"55"} | 1 | 2
{"id": "24", "name": "Don3", "age":"55"} | 1 | 3
{"id": "32", "name": "Don", "age":"55"} | 2 | 1
{"id": "33", "name": "Don2", "age":"55"} | 2 | 2
{"id": "34", "name": "Don3", "age":"55"} | 2 | 3
-- apparently you can filter what you want from here
select subrow, id, row_number() over (partition by id)
from (
select json_array_elements(data) as subrow, id
from test
) as t
where subrow->>'id' = '23';
In addition, think about your schema design. It may not be the best idea to store your data this way.