Save DF with JSON string as JSON without escape characters with Apache Spark - scala

I have a dataframe which contains some column and json string:
val df = Seq (
(0, """{​"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }​"""),
(1, """{​"device_id": 1, "device_type": "sensor-igauge", "ip": "213.161.254.1", "cca3": "NOR", "cn": "Norway", "temp": 30, "signal": 18, "battery_level": 6, "c02_level": 1413, "timestamp" :1475600498 }​""")
).toDF("id", "json")
Which I want to save as json - without a nested json string in it but a 'raw' one instead.
When I
df.write.json("path")
It saves my json column as string:
{"id":0,"json":"{​\"device_id\": 0, \"device_type\": \"sensor-ipad\", \"ip\": \"68.161.225.1\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 25, \"signal\": 23, \"battery_level\": 8, \"c02_level\": 917, \"timestamp\" :1475600496 }​"}
And what I need is:
{"id": 0,"json": {"device_id": 0,"device_type": "sensor-ipad","ip": "68.161.225.1","cca3": "USA","cn": "United States","temp": 25,"signal": 23,"battery_level": 8,"c02_level": 917,"timestamp": 1475600496}}
How can I achieve it? Please not that the structure of json could be different for each row, it can contain additional fields.

You can use from_json function to get the json string data as a new column
// get schema of the json data
// You can also define your own schema
import org.apache.spark.sql.functions._
val json_schema = spark.read.json(df.select("json").as[String]).schema
val resultDf = df.withColumn("json", from_json($"json", json_schema))
Output:
{"id":0,"json":{"battery_level":8,"c02_level":917,"cca3":"USA","cn":"United States","device_id":0,"device_type":"sensor-ipad","ip":"68.161.225.1","signal":23,"temp":25,"timestamp":1475600496}}
{"id":1,"json":{"battery_level":6,"c02_level":1413,"cca3":"NOR","cn":"Norway","device_id":1,"device_type":"sensor-igauge","ip":"213.161.254.1","signal":18,"temp":30,"timestamp":1475600498}}

Related

'TypeError: StructType can not accept object

I'm trying to convert this json string data to Dataframe in Databricks
a = """{ "id": "a",
"message_type": "b",
"data": [ {"c":"abcd","timestamp":"2022-03-
01T13:10:00+00:00","e":0.18,"f":0.52} ]}"""
the schema I defined for the data is this
schema=StructType(
[
StructField("id",StringType(),False),
StructField("message_type",StringType(),False),
StructField("data", ArrayType(StructType([
StructField("c",StringType(),False),
StructField("timestamp",StringType(),False),
StructField("e",DoubleType(),False),
StructField("f",DoubleType(),False),
])))
,
]
)
and when I run this command
df = sqlContext.createDataFrame(sc.parallelize([a]), schema)
I get this error
PythonException: 'TypeError: StructType can not accept object '{ "id": "a",\n"message_type": "JobMetric",\n"data": [ {"c":"abcd","timestamp":"2022-03- \n01T13:10:00+00:00","e":0.18,"f":0.52=} ]' in type <class 'str'>'. Full traceback below:
anyone could help me with this, would much appreciate it!
Your a variable is wrong.
"data": [ "{"JobId":"ATLUPS10m2101V1","Timestamp":"2022-03-
01T13:10:00+00:00","number1":0.9098145961761475,"number2":0.5294908881187439}" ]
Should be
"data": [ {"JobId":"ATLUPS10m2101V1","Timestamp":"2022-03-
01T13:10:00+00:00","number1":0.9098145961761475,"number2":0.5294908881187439} ]
And check if it is OK to match name with JobId to job_id and Timestamp to timestamp.
Issue is whenever you're passing the string object to struct schema it expects RDD([StringType, StringType,...]) however, in your current scenario it is getting just string object. In order to fix it first you need to convert your string to a json object and from there you'll need to create a RDD. See the below logic for details -
Input Data -
a = """{"run_id": "1640c68e-5f02-4f49-943d-37a102f90146",
"message_type": "JobMetric",
"data": [ {"JobId":"ATLUPS10m2101V1","timestamp":"2022-03-01T13:10:00+00:00",
"score":0.9098145961761475,
"severity":0.5294908881187439
}
]
}"""
Converting to a RDD using json object -
from pyspark.sql.types import *
import json
schema=StructType(
[
StructField("run_id",StringType(),False),
StructField("message_type",StringType(),False),
StructField("data", ArrayType(StructType([
StructField("JobId",StringType(),False),
StructField("timestamp",StringType(),False),
StructField("score",DoubleType(),False),
StructField("severity",DoubleType(),False),
])))
,
]
)
df = spark.createDataFrame(data=sc.parallelize([json.loads(a)]),schema=schema)
df.show(truncate=False)
Output -
+------------------------------------+------------+--------------------------------------------------------------------------------------+
|run_id |message_type|data |
+------------------------------------+------------+--------------------------------------------------------------------------------------+
|1640c68e-5f02-4f49-943d-37a102f90146|JobMetric |[{ATLUPS10m2101V1, 2022-03-01T13:10:00+00:00, 0.9098145961761475, 0.5294908881187439}]|
+------------------------------------+------------+--------------------------------------------------------------------------------------+

Why apache-hudi is creating COPY_ON_WRITE table even if I have given MERGE_ON_READ?

I am trying to create a simple hudi table with MERGE_ON_READ table type.
After executing the code still in hoodie.properties file I see hoodie.table.type=COPY_ON_WRITE
Am I missing something here ?
Jupyter Notebook for this code: https://github.com/sannidhiteredesai/spark/blob/master/hudi_acct.ipynb
hudi_options = {
"hoodie.table.name": "hudi_acct",
"hoodie.table.type": "MERGE_ON_READ",
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.recordkey.field": "acctid",
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.datasource.write.partitionpath.field": "date",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.upsert.shuffle.parallelism": 8,
"hoodie.insert.shuffle.parallelism": 8,
}
input_df = spark.createDataFrame(
[
(100, "2015-01-01", "2015-01-01T13:51:39.340396Z", 10),
(101, "2015-01-01", "2015-01-01T12:14:58.597216Z", 10),
(102, "2015-01-01", "2015-01-01T13:51:40.417052Z", 10),
(103, "2015-01-01", "2015-01-01T13:51:40.519832Z", 10),
(104, "2015-01-02", "2015-01-01T12:15:00.512679Z", 10),
(104, "2015-01-02", "2015-01-01T12:15:00.512679Z", 10),
(104, "2015-01-02", "2015-01-02T12:15:00.512679Z", 20),
(105, "2015-01-02", "2015-01-01T13:51:42.248818Z", 10),
],
("acctid", "date", "ts", "deposit"),
)
# INSERT
(
input_df.write.format("org.apache.hudi")
.options(**hudi_options)
.mode("append")
.save(hudi_dataset)
)
update_df = spark.createDataFrame(
[(100, "2015-01-01", "2015-01-01T13:51:39.340396Z", 20)],
("acctid", "date", "ts", "deposit"))
# UPDATE
(
update_df.write.format("org.apache.hudi")
.options(**hudi_options)
.mode("append")
.save(hudi_dataset)
)
Edit: After execution of above code I see 2 parquet files created in the date=2015-01-01 partition. On reading the 2nd parquet file I was expecting to get only the updated 1 record, but I can see all other records in that partition as well.
The issue is with "hoodie.table.type": "MERGE_ON_READ", configuration. You have to use hoodie.datasource.write.table.type instead. If you update the configuration as follows it will work. I have tested.
hudi_options = {
"hoodie.table.name": "hudi_acct",
"hoodie.datasource.write.table.type": "MERGE_ON_WRITE",
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.recordkey.field": "acctid",
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.datasource.write.partitionpath.field": "date",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.upsert.shuffle.parallelism": 8,
"hoodie.insert.shuffle.parallelism": 8,
"hoodie.compact.inline": "true",
"hoodie.compact.inline.max.delta.commits": 10
}
would you please try mode("overwrite") when using insert to load data into hudi first and see if it works?

plotting graph in sapui5 with pandas dataframe

As pandas supports dataframe to json conversion and the dataframe can be converted to a json data as below: (1) and 2) are just for references nothing to do with sapui5,
1) for eg:
import pandas as pd
df = pd.DataFrame([['madrid', 10], ['venice', 20],['milan',40],['las vegas',35]],columns=['city', 'temp'])
df.to_json(orient="records")
gives:
[{"city":"madrid","temp":10},{"city":"venice","temp":20},{"city":"milan","temp":40},{"city":"las vegas","temp":35}]
and
df.to_json(orient="split")
gives:
{"columns":["city","temp"],"index":[0,1,2,3],"data":[["madrid",10],["venice",20],["milan",40],["las vegas",35]]}
As we have json data , this data could be used as input to plot properties.
2)for the same json data I have created an API (running on localhost):
http://127.0.0.1:****/graph
API using in flask:(just for refernce)
from flask import Flask
import pandas as pd
app=Flask(__name__)
#app.route('/graph')
def plot():
df=pd.DataFrame([['madrid', 10], ['venice', 20], ['milan', 40], ['las vegas', 35]],
columns=['city', 'temp'])
jsondata=df.to_json(orient='records')
return jsondata
if __name__=='__main__':
app.run()
postman result:
[
{
"city": "madrid",
"temp": 10
},
{
"city": "venice",
"temp": 20
},
{
"city": "milan",
"temp": 40
},
{
"city": "las vegas",
"temp": 35
}
]
3)How can I make use of this sample api to fetch data and then plot a sample graph for {city vs temp} using sapui5 ??
looking for an example to do so, (or) any help on how to make use of api's in sapui5 ?

Extract value from cloudant IBM Bluemix NoSQL Database

How to Extract value from Cloudant IBM Bluemix NoSQL Database stored in JSON format?
I tried this code
def readDataFrameFromCloudant(host,user,pw,database):
cloudantdata=spark.read.format("com.cloudant.spark"). \
option("cloudant.host",host). \
option("cloudant.username", user). \
option("cloudant.password", pw). \
load(database)
cloudantdata.createOrReplaceTempView("washing")
spark.sql("SELECT * from washing").show()
return cloudantdata
hostname = ""
user = ""
pw = ""
database = "database"
cloudantdata=readDataFrameFromCloudant(hostname, user, pw, database)
It is stored in this format
{
"_id": "31c24a382f3e4d333421fc89ada5361e",
"_rev": "1-8ba1be454fed5b48fa493e9fe97bedae",
"d": {
"count": 9,
"hardness": 72,
"temperature": 85,
"flowrate": 11,
"fluidlevel": "acceptable",
"ts": 1502677759234
}
}
I want this result
Expected
Actual Outcome
Create a dummy dataset for reproducing the issue:
cloudantdata = spark.read.json(sc.parallelize(["""
{
"_id": "31c24a382f3e4d333421fc89ada5361e",
"_rev": "1-8ba1be454fed5b48fa493e9fe97bedae",
"d": {
"count": 9,
"hardness": 72,
"temperature": 85,
"flowrate": 11,
"fluidlevel": "acceptable",
"ts": 1502677759234
}
}
"""]))
cloudantdata.take(1)
Returns:
[Row(_id='31c24a382f3e4d333421fc89ada5361e', _rev='1-8ba1be454fed5b48fa493e9fe97bedae', d=Row(count=9, flowrate=11, fluidlevel='acceptable', hardness=72, temperature=85, ts=1502677759234))]
Now flatten:
flat_df = cloudantdata.select("_id", "_rev", "d.*")
flat_df.take(1)
Returns:
[Row(_id='31c24a382f3e4d333421fc89ada5361e', _rev='1-8ba1be454fed5b48fa493e9fe97bedae', count=9, flowrate=11, fluidlevel='acceptable', hardness=72, temperature=85, ts=1502677759234)]
I tested this code with an IBM Data Science Experience notebook using Python 3.5 (Experimental) with Spark 2.0
This answer is based on: https://stackoverflow.com/a/45694796/1033422

How can I load data from mongodb collection into pandas' DataFrame?

I am new to pandas (well, to all things "programming"...), but have been encouraged to give it a try.
I have a mongodb database - "test" - with a collection called "tweets".
I access the database in ipython:
import sys
import pymongo
from pymongo import Connection
connection = Connection()
db = connection.test
tweets = db.tweets
the document structure of documents in tweets is as follows:
entities': {u'hashtags': [],
u'symbols': [],
u'urls': [],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'filter_level': u'medium',
u'geo': {u'coordinates': [placeholder coordinate, -placeholder coordinate], u'type': u'Point'},
u'id': 349223842700472320L,
u'id_str': u'349223842700472320',
u'in_reply_to_screen_name': None,
u'in_reply_to_status_id': None,
u'in_reply_to_status_id_str': None,
u'in_reply_to_user_id': None,
u'in_reply_to_user_id_str': None,
u'lang': u'en',
u'place': {u'attributes': {},
u'bounding_box': {u'coordinates': [[[placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate]]],
u'type': u'Polygon'},
u'country': u'placeholder country',
u'country_code': u'example',
u'full_name': u'name, xx',
u'id': u'user id',
u'name': u'name',
u'place_type': u'city',
u'url': u'http://api.twitter.com/1/geo/id/1820d77fb3f65055.json'},
u'retweet_count': 0,
u'retweeted': False,
u'source': u'Twitter for iPhone',
u'text': u'example text',
u'truncated': False,
u'user': {u'contributors_enabled': False,
u'created_at': u'Sat Jan 22 13:42:59 +0000 2011',
u'default_profile': False,
u'default_profile_image': False,
u'description': u'example description',
u'favourites_count': 100,
u'follow_request_sent': None,
u'followers_count': 100,
u'following': None,
u'friends_count': 100,
u'geo_enabled': True,
u'id': placeholder_id,
u'id_str': u'placeholder_id',
u'is_translator': False,
u'lang': u'en',
u'listed_count': 0,
u'location': u'example place',
u'name': u'example name',
u'notifications': None,
u'profile_background_color': u'000000',
u'profile_background_image_url': u'http://a0.twimg.com/images/themes/theme19/bg.gif',
u'profile_background_image_url_https': u'https://si0.twimg.com/images/themes/theme19/bg.gif',
u'profile_background_tile': False,
u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/241527685/1363314054',
u'profile_image_url': u'http://a0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
u'profile_image_url_https': u'https://si0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
u'profile_link_color': u'000000',
u'profile_sidebar_border_color': u'FFFFFF',
u'profile_sidebar_fill_color': u'000000',
u'profile_text_color': u'000000',
u'profile_use_background_image': False,
u'protected': False,
u'screen_name': placeholder screen_name',
u'statuses_count': xxxx,
u'time_zone': u'placeholder time_zone',
u'url': None,
u'utc_offset': -21600,
u'verified': False}}
Now, as far as I understand, pandas' main data structure - a spreadsheet-like table - is called DataFrame. How can I load the data from my "tweets" collection into pandas' DataFrame? And how can I query for a subdocument within the database?
Comprehend the cursor you got from the MongoDB before passing it to DataFrame
import pandas as pd
df = pd.DataFrame(list(tweets.find()))
If you have data in MongoDb like this:
[
{
"name": "Adam",
"age": 27,
"address":{
"number": 4,
"street": "Main Road",
"city": "Oxford"
}
},
{
"name": "Steve",
"age": 32,
"address":{
"number": 78,
"street": "High Street",
"city": "Cambridge"
}
}
]
You can put the data straight into a dataframe like this:
from pandas import DataFrame
df = DataFrame(list(db.collection_name.find({}))
And you will get this output:
df.head()
| | name | age | address |
|----|---------|------|-----------------------------------------------------------|
| 1 | "Steve" | 27 | {"number": 4, "street": "Main Road", "city": "Oxford"} |
| 2 | "Adam" | 32 | {"number": 78, "street": "High St", "city": "Cambridge"} |
However the subdocuments will just appear as JSON inside the subdocument cell. If you want to flatten objects so that subdocument properties are shown as individual cells you can use json_normalize without any parameters.
from pandas.io.json import json_normalize
datapoints = list(db.collection_name.find({})
df = json_normalize(datapoints)
df.head()
This will give the dataframe in this format:
| | name | age | address.number | address.street | address.city |
|----|--------|------|----------------|----------------|--------------|
| 1 | Thomas | 27 | 4 | "Main Road" | "Oxford" |
| 2 | Mary | 32 | 78 | "High St" | "Cambridge" |
You can load your MongoDB data to pandas DataFame using this code. It works for me.
import pymongo
import pandas as pd
from pymongo import Connection
connection = Connection()
db = connection.database_name
input_data = db.collection_name
data = pd.DataFrame(list(input_data.find()))
Use:
df=pd.DataFrame.from_dict(collection)
This is the simplest technique to achieve your aim.
import pymongo
import pandas as pd
from pymongo import Connection
conn = Connection()
db = conn.your_database_name
input_data = db.your_collection_name
pandas_data_frame = pd.DataFrame(list(input_data.find()))
print(pandas_data_frame)