How do I get this output of listing all the movies for each year using spark.sql?
Ouput:
(1988,{(Rain Man),(Die Hard)})
(1990,{(The Godfather: Part III),(Die Hard 2),(The Silence of the Lambs),(King of New York)})
(1992,{(Unforgiven),(Bad Lieutenant),(Reservoir Dogs)})
(1994,{(Pulp Fiction)})
this is the json data:
{ "id": "movie:1", "title": "Vertigo", "year": 1958, "genre": "Drama", "summary": "A retired San Francisco detective suffering from acrophobia investigates the strange activities of an old friend's wife, all the while becoming dangerously obsessed with her.", "country": "USA", "director": { "id": "artist:3", "last_name": "Hitchcock", "first_name": "Alfred", "year_of_birth": "1899" }, "actors": [ { "id": "artist:15", "role": "John Ferguson" }, { "id": "artist:16", "role": "Madeleine Elster" } ] }
Here is the code I have tried:
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val movies = hiveCtx.jsonFile("movies.json")
movies.createOrReplaceTempView("movies")
val ty = hiveCtx.sql("SELECT year, title FROM movies")
Please help me find the correct query.
Thanks for you help.
You can get something similar without using spark.sql. You can simply perform the operation on the dataframe itself:
movies.groupBy($"year").agg(concat_ws("; ", collect_list($"title"))).show
Dataset used:
{ "id": "movie:1", "title": "Vertigo", "year": 1958, "genre": "Drama", "summary": "A retired San Francisco detective suffering from acrophobia investigates the strange activities of an old friend's wife, all the while becoming dangerously obsessed with her.", "country": "USA", "director": { "id": "artist:3", "last_name": "Hitchcock", "first_name": "Alfred", "year_of_birth": "1899" }, "actors": [ { "id": "artist:15", "role": "John Ferguson" }, { "id": "artist:16", "role": "Madeleine Elster" } ] }
{ "id": "movie:2", "title": "The Blob", "year": 1958, "genre": "Drama", "summary": "The Blob", "country": "USA", "director": { "id": "artist:3", "last_name": "Hitchcock", "first_name": "Alfred", "year_of_birth": "1899" }, "actors": [ { "id": "artist:15", "role": "John Ferguson" }, { "id": "artist:16", "role": "Madeleine Elster" } ] }
Output:
+----+----------------------------------+
|year|concat_ws(; , collect_list(title))|
+----+----------------------------------+
|1958| Vertigo; The Blob|
+----+----------------------------------+
Related
Is it possible to filter array items in CosmosDb? for example I just need customer info and the first pet(in an array)
Current result:
[
{
"CustomerId": "100",
"name": "John",
"lastName": "Doe",
"pets": [
{
"id": "pet01",
"CustomerId": "100",
"name": "1st pet"
},
{
"id": "pet02",
"CustomerId": "100",
"name": "2nd pet"
}
]
}
]
Expected:
[
{
"CustomerId": "100",
"name": "John",
"lastName": "Doe",
"pets": [
{
"id": "pet01",
"CustomerId": "100",
"name": "1st pet"
}
]
}
]
You can use ARRAY_SLICE function.
SQL:
SELECT c.CustomerId,c.name,c.lastName,ARRAY_SLICE(c.pets,0,1) as pets
FROM c
Result:
[
{
"CustomerId": "100",
"name": "John",
"lastName": "Doe",
"pets": [
{
"id": "pet01",
"CustomerId": "100",
"name": "1st pet"
}
]
}
]
I am trying filter the response city wise. I am not able to understand how to query filter parameters.
I have tried different ways but with no success. This is the response without applying filter. But I want filter it for a particular city.
{
"index_name": "3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69",
"title": "Real time Air Quality Index from various location",
"desc": "Real time Air Quality Index from various location",
"org_type": "Central",
"org": [
"Ministry of Environment and Forests",
"Central Pollution Control Board"
],
"sector": [
"Industrial Air Pollution"
],
"source": "data.gov.in",
"catalog_uuid": "a3e7afc6-b799-4ede-b143-8e074b27e0621",
"visualizable": "1",
"active": "1",
"created": 1543320551,
"updated": 1559683085,
"created_date": "2018-11-27T17:39:11Z",
"updated_date": "2019-06-05T02:48:05Z",
"target_bucket": {
"index": "air_quality",
"type": "a3e7afc6-b799-4ede-b143-8e074b27e0621",
"field": "3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69"
},
"field": [
{
"id": "id",
"name": "id",
"type": "double"
},
{
"id": "country",
"name": "country",
"type": "keyword"
},
{
"id": "state",
"name": "state",
"type": "keyword"
},
{
"id": "city",
"name": "city",
"type": "keyword"
},
{
"id": "station",
"name": "station",
"type": "keyword"
},
{
"id": "last_update",
"name": "last_update",
"type": "date"
},
{
"id": "pollutant_id",
"name": "pollutant_id",
"type": "keyword"
},
{
"id": "pollutant_min",
"name": "pollutant_min",
"type": "double"
},
{
"id": "pollutant_max",
"name": "pollutant_max",
"type": "double"
},
{
"id": "pollutant_avg",
"name": "pollutant_avg",
"type": "double"
},
{
"id": "pollutant_unit",
"name": "pollutant_unit",
"type": "keyword"
}
],
"status": "ok",
"message": "Resource detail",
"total": 1000,
"count": 10,
"limit": "10",
"offset": "8",
"records": [
{
"id": "13",
"country": "India",
"state": "Andhra_Pradesh",
"city": "Rajamahendravaram",
"station": "Anand Kala Kshetram, Rajamahendravaram - APPCB",
"last_update": "05-06-2019 02:00:00",
"pollutant_id": "CO",
"pollutant_min": "2",
"pollutant_max": "50",
"pollutant_avg": "28",
"pollutant_unit": "NA"
},
{
"id": "14",
"country": "India",
"state": "Andhra_Pradesh",
"city": "Rajamahendravaram",
"station": "Anand Kala Kshetram, Rajamahendravaram - APPCB",
"last_update": "05-06-2019 02:00:00",
"pollutant_id": "OZONE",
"pollutant_min": "37",
"pollutant_max": "132",
"pollutant_avg": "71",
"pollutant_unit": "NA"
}
{
"id": "16",
"country": "India",
"state": "Andhra_Pradesh",
"city": "Tirupati",
"station": "Tirumala, Tirupati - APPCB",
"last_update": "05-06-2019 02:00:00",
"pollutant_id": "PM10",
"pollutant_min": "33",
"pollutant_max": "72",
"pollutant_avg": "55",
"pollutant_unit": "NA"
}
],
"version": "2.1.0"
}
This is only documentation on how to do filtering.
properties: OrderedMap { "id": OrderedMap { "type": "integer" }, "date": OrderedMap { "type": "integer" } }
How to form request url for filtering the response?
yes the documentation is very poor, but still out of many trials I got it work like
this
https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=<your key>&format=json&offset=0&limit=10
&filters[pollutant_id]=NO2
Without avail, I cannot get the country_code (CA) or phone to fill up after successfully calling create payment.
Country always show up as "United States" and Phone is "+1".
With or without shipping_address, shipping_preference: NO_SHIPPING. Using so-called examples (which might be outdated or not correctly documented), and the API documentation, which would be great to have examples included...
This is the create-payment query in json. I'm getting the same structure back with the following additions:
id:PAY-xxx
state:created
phone in payer_info (but no country_code as expected from the API)
created_time
...and...
PayPal's links in links.
Which indicate that the call was successful.
Either I should ditch "/v1/payments/payment" for something else I'm not aware of, or Paypal API is not up-to-date.
----- json create-payment query -----
{
"intent": "sale",
"payer": {
"payment_method": "paypal",
"payer_info": {
"email": "<snip>",
"first_name": "Bob",
"last_name": "Smith",
"billing_address": {
"line1": "1 notre dame",
"line2": "",
"city": "Montreal",
"country_code": "CA",
"postal_code": "H1H 1H1",
"phone": "011862212345678",
"state": "QC"
}
}
},
"application_context": {
"brand_name": "Server-side Test",
"locale": "fr_CA",
"landing_page": "Billing"
},
"transactions": [
{
"description": "The payment transaction description.",
"invoice_number": "5b5a38cb35bb7",
"custom": "merchant custom data",
"payment_options": {
"allowed_payment_method": "INSTANT_FUNDING_SOURCE"
},
"amount": {
"total": "5.75",
"currency": "CAD",
"details": {
"subtotal": "5",
"tax": "0.75"
}
},
"item_list": {
"items": [
{
"name": "item 1",
"description": "item 1 description",
"quantity": "1",
"price": "1",
"tax": "0.15",
"currency": "CAD"
},
{
"name": "item 2",
"description": "item 2 description",
"quantity": "2",
"price": "2",
"tax": "0.6",
"currency": "CAD"
}
],
"shipping_address": {
"recipient_name": "Bob Smith",
"line1": "1 notre dame",
"line2": "",
"city": "Montreal",
"country_code": "CA",
"postal_code": "H1H 1H1",
"phone": "011862212345678",
"state": "QC"
}
}
}
],
"redirect_urls": {
"return_url": "http:\/\/<snip>\/return.php",
"cancel_url": "http:\/\/<snip>\/cancel.php"
}
}
There are 3 master collection of category , subcategory and criteria each, i will be building framework with any possible combination of category , subcategory and criteria which will be stored as below-
framework document is added below having list of criteriaconfig as embedded object which further have single object of category , subcategory and criteria. you can refer criteriaconfig as link table that u call in mysql.
[
{
"id": "592bc3059f3ad715002b2331",
"name": "Framework1",
"description": "framework 1 for testing",
"criteriaConfigs": [
{
"id": "592bc3059f3ad715002b232f",
"category": {
"id": "591c2f5faa187956b2d0fb39",
"name": "category1",
"description": "category1",
"deleted": false,
"createdDate": 1495019359558
},
"subCategory": {
"id": "591c2f5faa187956b2d0fb83",
"name": "subCat1",
"description": "subCat1"
},
"criteria": {
"id": "591c2f5faa187956b2d0fbad",
"name": "criteria1",
"measure": "Action"
}
},
{
"id": "592bc3059f3ad715002b232e",
"category": {
"id": "591c2f5faa187956b2d0fb37",
"name": "Process",
"description": "Enagagement"
},
"subCategory": {
"id": "591c2f5faa187956b2d0fb81",
"name": "COMM / BRANDING",
"description": "COMM / BRANDING"
},
"criteria": {
"id": "591c2f5faa187956b2d0fba9",
"name": "Company representative forgets about customer on hold",
"measure": ""
}
} ]
},
{
"id": "592bc3059f3ad715002b2332",
"name": "Framework2",
"description": "framework 2 for testing",
"criteriaConfigs": [
{
"id": "592bc3059f3ad715002b232f",
"category": {
"id": "591c2f5faa187956b2d0fb39",
"name": "category1",
"description": "category1"
},
"subCategory": {
"id": "591c2f5faa187956b2d0fb83",
"name": "subCat1",
"description": "subCat1"
},
"criteria": {
"id": "591c2f5faa187956b2d0fbad",
"name": "criteria1",
"measure": "Action"
}
}
]
}
]
i need a view containing framework that will contain all list of category and inside category there will be list of added subcategory and inside subcategory will have list of criteria for single framework.
expected result -
[
{
"id": "f1",
"name": "Framework1",
"description": "framework 1 for testing",
"categories": [
{
"id": "c2",
"name": "category2",
"description": "category2",
"subCategories": [
{
"id": "sb1",
"name": "subCat1",
"description": "subCat1",
"criterias": [
{
"id": "cr1",
"name": "criteria1",
"measure": "Action"
},
{
"id": "cr2",
"name": "criteria2",
"measure": "Action"
},
{
"id": "cr3",
"name": "criteria3",
"measure": "Action"
}]
},
{
"id": "sb2",
"name": "subCat2",
"description": "subCat2",
"criterias": [
{
"id": "cr1",
"name": "criteria1",
"measure": "Action"
},
{
"id": "cr4",
"name": "criteria4",
"measure": "Action"
}]
}]
},
{
"id": "c1",
"name": "category1",
"description": "category1",
"subCategories": [
{
"id": "sb3",
"name": "subCat3",
"description": "subCat3",
"criterias": [
{
"id": "cr1",
"name": "criteria1",
"measure": "Action"
},
{
"id": "cr2",
"name": "criteria2",
"measure": "Action"
}
]},
{
"id": "sb2",
"name": "subCat2",
"description": "subCat2",
"criterias": [
{
"id": "cr1",
"name": "criteria1",
"measure": "Action"
},
{
"id": "cr4",
"name": "criteria4",
"measure": "Action"
}]
}
]
}]
},
{
"id": "f2",
"name": "Framework2",
"description": "framework 2 for testing",
"categories": [
{
"id": "c2",
"name": "category2",
"description": "category2",
"subCategories": [
{
"id": "sb4",
"name": "subCat5",
"description": "subCat5",
"criterias": [
{
"id": "cr1",
"name": "criteria1",
"measure": "Action"
},
{
"id": "cr3",
"name": "criteria3",
"measure": "Action"
}]
},
{
"id": "sb2",
"name": "subCat2",
"description": "subCat2",
"criterias": [
{
"id": "cr1",
"name": "criteria1",
"measure": "Action"
},
{
"id": "cr4",
"name": "criteria4",
"measure": "Action"
}]
}]
},
{
"id": "c1",
"name": "category1",
"description": "category1",
"subCategories": [
{
"id": "sb3",
"name": "subCat3",
"description": "subCat3",
"criterias": [
{
"id": "cr1",
"name": "criteria1",
"measure": "Action"
},
{
"id": "cr2",
"name": "criteria2",
"measure": "Action"
}
]},
{
"id": "sb2",
"name": "subCat2",
"description": "subCat2",
"criterias": [
{
"id": "cr1",
"name": "criteria1",
"measure": "Action"
},
{
"id": "cr4",
"name": "criteria4",
"measure": "Action"
}]
}
]
}]
}
]
Note - Category document doesn't have any reference to subcategory and same way subcategory doesn't have any reference to criteria object currently as they are master data and are generic , framework is created with their combination dynamically.
If you want to try to do all the work in the aggregation, you could group first by subcategory, then by category like:
db.collection.aggregate([
{$unwind:"$criteriaConfigs"},
{$project:{
_id:0,
category:"$criteriaConfigs.category",
subCategory:"$criteriaConfigs.subCategory",
criteria:"$criteriaConfigs.criteria"
}},
{$group:{
_id:{"category":"$category","subCategory":"$subCategory"},
criteria:{$addToSet:"$criteria"}
}},
{$group:{
_id:{"category":"$_id.category"},
subCategories:{$addToSet:{subCategory:"$_id.subCategory",
criteria:"$criteria"}}
}},
{$project:{
_id:0,category:"$_id.category",
subCategories:"$subCategories"
}}
])
Depending on how you plan to us the return data, it may be more efficient to return each unique combination:
db.collection.aggregate([
{$unwind:"$criteriaConfigs"},
{$group:{
_id:{
category:"$criteriaConfigs.category.name",
subCategory:"$criteriaConfigs.subCategory.name",
criteria:"$criteriaConfigs.criteria.name"
}
}},
{$project:{
_id:0,
category:"$_id.category",
subCategory:"$_id.subCategory",
criteria:"$_id.criteria"
}}
])
I'm not sure from your question what shape you are expecting the return data to have, so you may need to adjust for that.
{
"_id": ObjectId("4ed8d496c605da94400001e4"),
"status": 1,
"user": {
"uid": 1
},
"nid": 10582,
"form": {
"your-name": "Bob Smith",
"description": "",
"photo": "",
"address": "123 Turk Hill Rd",
"city": "",
"zip": "14450"
},
"location": {
"address": "123 Turk Hill Rd",
"city": "",
"zip": "14450",
"geo_lat": 43.0329181,
"geo_lng": -77.4391148,
"address_confirmed": "123 Turk Hill Rd, Victor, NY 14564, USA",
"address_status": 200,
"accuracy": 8
},
"keywords": {
"0": "bob",
"1": "smith",
"2": "",
"4": "123",
"5": "turk",
"6": "hill",
"7": "rd",
"9": "14450"
},
"time": ISODate("2011-12-02T13: 37: 26.0Z")
Search:
{
nid: 10582,
keywords: {"$in": ['turk']}
}
Results: none!
What am I doing wrong?
Answer is simple: because of keywords is not an array. To search on keywords you need to change document structure as follow:
{
...
"keywords": [
"bob",
"smith",
"123",
"turk",
"hill",
"rd",
"14450"
],
...
}
It usually happens when you from driver serialize dictionary. In current moment there is no way to search in such structure. Simple use arrays instead of dictionaries. Or you can convert dictionary to array before serialize document and viсe versa when deserialize document.