Unable to read json file with pyspark in Databricks

Unable to read json file with pyspark in Databricks - pyspark

I'm using pyspark to create a dataframe from a JSON file.
The structure of the JSON file is as follows:
[
{
"Volcano Name": "Abu",
"Country": "Japan",
"Region": "Honshu-Japan",
"Location": {
"type": "Point",
"coordinates": [
131.6,
34.5
]
},
"Elevation": 571,
"Type": "Shield volcano",
"Status": "Holocene",
"Last Known Eruption": "Unknown",
"id": "4cb67ab0-ba1a-0e8a-8dfc-d48472fd5766"
},
{
"Volcano Name": "Acamarachi",
"Country": "Chile",
"Region": "Chile-N",
"Location": {
"type": "Point",
"coordinates": [
-67.62,
-23.3
}]
I will read in the file using the following line of code:
myjson = spark.read.json("/FileStore/tables/sample.json")
However, I keep on getting the following error message:
Spark Jobs
myjson:pyspark.sql.dataframe.DataFrame
_corrupt_record:string
Can someone let me know what I might doing wrong?
Is the problem with the structure of the json file?

Seems like your JSON is multiple line Json that why issue is, to fix this below is code snippet,
myjson = spark.read.option("multiline", "true").option("mode", "PERMISSIVE")
.json("/FileStore/tables/sample.json")
Hope this will solve issue.

Related

LiteDB Studio: Unexpected token {

I'm trying to use LiteDB Studio 1.0.2 to import a JSON file. I've created a database and selected the menu option "Import from JSON". I selected a sample file that I found in a post related to NoSQL. It contains:
{
"Contact": [
{
"_id": 1,
"FullName": "Lucy",
"Email": "lucy#gmail.com",
"Phone": "4584656970",
"Address": "223 Str"
},
{
"_id": 2,
"FullName": "Tom",
"Email": "tom#gmail.com",
"Phone": "4375588889",
"Address": "123 Str"
}
]
}
The menu option produces a query template that I updated to:
SELECT $
INTO new_col
FROM $file('Sample.json');
Upon running the query, I immediately receive an error:
Unexpected token { in position 0.
SELECT $ INTO ^
Can anyone shed light on this error?

It seems that LiteDB expects an array of document, so your file should start by [ and end by ] and the documents inside { } in the middle

LiteDB expects an array of json elements, so your file should look like this example
[
{
"_id": 1,
"FullName": "fullname_1",
"Email": "a#example1.com",
"Phone": "0123456789",
"Address": "123 Street"
},
{
"_id": 2,
"FullName": "fullname_2",
"Email": "fullname_2#example1.com",
"Phone": "9876543210",
"Address": "321 Street"
}
]
and your Insert statement will be like that
SELECT $
INTO contacts
FROM $file('Contacts.json');

MongoDB: GeoJSON object type index, Can't extract geo keys

I have made many the below GeoJSON Point objects in MongoDB Compass as per the docs
{
"_id": {
"$oid": "5e86d275a3d7fd05e4f022a8"
},
"location": {
"type": "point",
"coordinates": ["-110.85458435", "39.68476146"]
},
"website": "carbonmedicalservice.com",
"address": "125 S Main St",
"state": "UT",
"zip": "84526",
"name": "Carbon Medical"
}
I want to be able te search the collection to return all records within an reactangle, I believe I need to add a Geospatial Indexe to the location but I get the error shown below...
This is how I entered the data:

In a GeoJSON type, coordinates should be float/double not string.
Also, the type should be Point, not point in your case. So, the GeoJSON in your case should be:
{
"type": "Point",
"coordinates": [-110.85458435, 39.68476146]
}
instead of
{
"type": "point",
"coordinates": ["-110.85458435", "39.68476146"]
}

MongoDB - MongoImport of JSON (jsonl) - Rename, change types and add fields

i'm new to the topic MongoDB and have 4 different problems importing a big (16GB) file (jsonl) into my MongoDB (simple PSA-Cluster).
Below attached you will find a sample entry from the mentiond JSON-Dump.
With this file which i get from an external provider I actually have 4 problems.
"hotel_id" is the key and should normally be (re-)named as "_id"
"hotel_id" should not be treated as string rather than as Number
"location" is not properly formatted (if i understood correctly the MongoDB Manual) as GeoJSON as it should be like
"location": {
"type": "Point",
"coordinates": [-93.26838,37.15845]
}
instead of
"location": {
"coordinates": {
"latitude": 37.15845,
"longitude": -93.26838
}
}
"dates" can this be used to efficiently update just the records which needs to be updated?
So my challenge is now to transform the data according to my needs before importing the data or at time of import, but in both cases of course as quickly as possible.
Therefore i searched a lot for hints and best practices, but i was not able to find a solution yet, maybe due to the fact that i'm a beginner with MongoDB.
I played around with "jq" to adjust the data and for example add the type which seems to be necessary for the location (point 3), but wasn't really successful.
cat dump.jsonl | ./bin/jq --arg typeOfField Point '.location + {type: $typeOfField}'
Beside that i was injecting a sample dump of round-about 500MB which took 1,5 mins when importing it the first time (empty database). If i run it in "upsert" mode it will take round-about 12 hours. So i was also wondering what is the best practice to import such a big JSON-dump?
Any help is appreciated!! :-)
Kind regards,
Lumpy
{
"hotel_id": "12345",
"name": "Test Hotel",
"address": {
"line_1": "123 Test St",
"line_2": "Apt A",
"city": "Test City",
},
"ratings": {
"property": {
"rating": "3.5",
"type": "Star"
},
"guest": {
"count": 48382,
"average": "3.1"
}
},
"location": {
"coordinates": {
"latitude": 22.54845,
"longitude": -90.11838
}
},
"phone": "555-0153",
"fax": "555-7249",
"category": {
"id": 1,
"name": "Hotel"
},
"rank": 42,
"dates": {
"added": "1998-07-19T05:00:00.000Z",
"updated": "2018-03-22T07:23:14.000Z"
},
"statistics": {
"11": {
"id": 11,
"name": "Total number of rooms - 220",
"value": "220"
},
"12": {
"id": 12,
"name": "Number of floors - 7",
"value": "7"
}
},
"chain": {
"id": -2,
"name": "Test Hotels"
},
"brand": {
"id": 2,
"name": "Test Brand"
}
}

How do I configure kafka-connect-spooldir to consume a json array?

I've configured kafka-connect-spooldir to consume files containing JSON objects according to the instructions at https://github.com/jcustenborder/kafka-connect-spooldir. This consumes files containing one or more JSON objects. Now how can I configure this to consume a file containing a JSON array instead?
Here is my current key and value schemas:
key.schema={"name": "com.example.users.UserKey", "type": "STRUCT", "isOptional": false, "fieldSchemas": {"id": {"type": "INT64", "isOptional": false }}}
value.schema={"name": "com.example.users.User", "type": "STRUCT", "isOptional": false, "fieldSchemas": {"id": {"type": "INT64", "isOptional": false}, "test": {"type": "STRING", "isOptional": true}}}
Here is a sample of my data:
{
"id": 10,
"test": "Carla Howe"
}
{
"id": 1,
"test": "Gayle Becker"
}
Here is what I would like the data to look like:
[
{
"id": 10,
"test": "Carla Howe"
},
{
"id": 1,
"test": "Gayle Becker"
}
]
I've tried simply to change the first type from STRUCT to ARRAY, but this throws an NPE "valueSchema cannot be null".
Can someone please point me in the right direction, or provide an example?

According to documentation there is a SchemaGenerator tool that can be run to generate the schema for sample data.

A good example of using the Mapbox distance API

I am trying to use Mapbox to calculate the duration between two locations however the examples here are incomplete (at least with my limited experience). I would like to connect to this API using server-side Java, however I can't even get a basic example working in javaScript, Python or simply in the address bar in my browser.
I can get an example working in my browser using this url and substituting in my API key:
https://api.mapbox.com/geocoding/v5/mapbox.places/Chester.json?country=us&access_token=pk.my-token-value
However I can't get a similar example working with the distance API. The best I can manage is something like this:
https://api.mapbox.com/distances/v1/driving/[13.41894,52.50055],[14.10293,52.50055]&access_token=pk.my-token-value.
But I have no idea how to format my coordinates as I can't find a single example.
Has anyone been able to get this working. Ideally in Java, but client-side JavaScript or a valid url would be a great start.
I should also add that I can't get the JavaScript or Python ones working as they rely on external librarys that aren't referenced anywhere in the documentation!.
Thanks in advance.

Looks like you can provide a list of 2 or more semi-colon separated coordinates:
https://api.mapbox.com/optimized-trips/v1/mapbox/driving/13.41894,52.50055;14.10293,52.50055?access_token=pk.your_token
returns:
{
"code":"Ok",
"waypoints":[
{"distance":9.0352511932471,"name":"","location":[13.418991,52.500625],"waypoint_index":0,"trips_index":0},
{"distance":91.0575241567836,"name":"Berliner Chaussee","location":[14.103096,52.499738],"waypoint_index":1,"trips_index":0}
],
"trips":
[
{"geometry":"}_m_Iu{{pA}ZuQ}Iad#}cAk^jr#etOdE_iGtTqoAxBkoGnOkiCjR_s#wJ_v#b#}aN|aBogRyVucBiEw_C~r#_eB`Fc`NtP_bAshBorHa#}dCkOe~AmPmrGlPlrGjOd~A`#|dCrhBnrHuP~aAaFb`N_s#~dBhEv_CxVtcBkbBbeRo#l`NzJ`z#mRpr#qOpjCwBpnGoT~lAeEdkGsr#jtOtp#dQ~UjTtDfZf]jS",
"legs":[
{"summary":"","weight":4198.3,"duration":3426.4,"steps":[],"distance":49487},
{"summary":"","weight":7577.8,"duration":3501.3,"steps":[],"distance":49479.7}
],
"weight_name":"routability",
"weight":11776.1,
"duration":6927.700000000001,
"distance":98966.7}
]
}

I don't know if this is better now, but I had to work through it right now and can give you this example
curl "https://api.mapbox.com/directions-matrix/v1/mapbox/driving/9.51416,54.47004;13.5096410,50.0716190;6.8614070,48.5206360;14.1304840,51.0856060?sources=0&access_token={token}"`
This will return the following json for driving durations. I made use of the sources attribute, which is my search lng/lat and all other points are places in my database.
{
"code": "Ok",
"durations": [
[0.0, 29407.9, 34504.7, 24163.5]
],
"destinations": [{
"distance": 131.946157371,
"name": "",
"location": [9.514914, 54.468939]
}, {
"distance": 34.636975593,
"name": "",
"location": [13.509868, 50.071344]
}, {
"distance": 295.206928933,
"name": "",
"location": [6.863566, 48.52287]
}, {
"distance": 1186.975749670,
"name": "",
"location": [14.115694, 51.080408]
}],
"sources": [{
"distance": 131.946157371,
"name": "",
"location": [9.514914, 54.468939]
}]
}
Adding the annotations=distance parameter to the url will return the distances instead of the durations if you need that.
{
"code": "Ok",
"distances": [
[0.0, 738127.3, 902547.6, 616060.8] // distances in meters
],
"destinations": [{ // destinations including the source
"distance": 131.946157371, // result 0
"name": "",
"location": [9.514914, 54.468939]
}, {
"distance": 34.636975593, // result 1
"name": "",
"location": [13.509868, 50.071344]
}, {
"distance": 295.206928933, // result 2
"name": "",
"location": [6.863566, 48.52287]
}, {
"distance": 1186.975749670, // result 3
"name": "",
"location": [14.115694, 51.080408]
}],
"sources": [{ // source where we start from
"distance": 131.946157371,
"name": "",
"location": [9.514914, 54.468939]
}]
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Unable to read json file with pyspark in Databricks - pyspark

Seems like your JSON is multiple line Json that why issue is, to fix this below is code snippet, myjson = spark.read.option("multiline", "true").option("mode", "PERMISSIVE") .json("/FileStore/tables/sample.json") Hope this will solve issue.

Related

LiteDB Studio: Unexpected token {

MongoDB: GeoJSON object type index, Can't extract geo keys

MongoDB - MongoImport of JSON (jsonl) - Rename, change types and add fields

How do I configure kafka-connect-spooldir to consume a json array?

A good example of using the Mapbox distance API

Categories

Resources