'TypeError: StructType can not accept object - pyspark

I'm trying to convert this json string data to Dataframe in Databricks
a = """{ "id": "a",
"message_type": "b",
"data": [ {"c":"abcd","timestamp":"2022-03-
01T13:10:00+00:00","e":0.18,"f":0.52} ]}"""
the schema I defined for the data is this
schema=StructType(
[
StructField("id",StringType(),False),
StructField("message_type",StringType(),False),
StructField("data", ArrayType(StructType([
StructField("c",StringType(),False),
StructField("timestamp",StringType(),False),
StructField("e",DoubleType(),False),
StructField("f",DoubleType(),False),
])))
,
]
)
and when I run this command
df = sqlContext.createDataFrame(sc.parallelize([a]), schema)
I get this error
PythonException: 'TypeError: StructType can not accept object '{ "id": "a",\n"message_type": "JobMetric",\n"data": [ {"c":"abcd","timestamp":"2022-03- \n01T13:10:00+00:00","e":0.18,"f":0.52=} ]' in type <class 'str'>'. Full traceback below:
anyone could help me with this, would much appreciate it!

Your a variable is wrong.
"data": [ "{"JobId":"ATLUPS10m2101V1","Timestamp":"2022-03-
01T13:10:00+00:00","number1":0.9098145961761475,"number2":0.5294908881187439}" ]
Should be
"data": [ {"JobId":"ATLUPS10m2101V1","Timestamp":"2022-03-
01T13:10:00+00:00","number1":0.9098145961761475,"number2":0.5294908881187439} ]
And check if it is OK to match name with JobId to job_id and Timestamp to timestamp.

Issue is whenever you're passing the string object to struct schema it expects RDD([StringType, StringType,...]) however, in your current scenario it is getting just string object. In order to fix it first you need to convert your string to a json object and from there you'll need to create a RDD. See the below logic for details -
Input Data -
a = """{"run_id": "1640c68e-5f02-4f49-943d-37a102f90146",
"message_type": "JobMetric",
"data": [ {"JobId":"ATLUPS10m2101V1","timestamp":"2022-03-01T13:10:00+00:00",
"score":0.9098145961761475,
"severity":0.5294908881187439
}
]
}"""
Converting to a RDD using json object -
from pyspark.sql.types import *
import json
schema=StructType(
[
StructField("run_id",StringType(),False),
StructField("message_type",StringType(),False),
StructField("data", ArrayType(StructType([
StructField("JobId",StringType(),False),
StructField("timestamp",StringType(),False),
StructField("score",DoubleType(),False),
StructField("severity",DoubleType(),False),
])))
,
]
)
df = spark.createDataFrame(data=sc.parallelize([json.loads(a)]),schema=schema)
df.show(truncate=False)
Output -
+------------------------------------+------------+--------------------------------------------------------------------------------------+
|run_id |message_type|data |
+------------------------------------+------------+--------------------------------------------------------------------------------------+
|1640c68e-5f02-4f49-943d-37a102f90146|JobMetric |[{ATLUPS10m2101V1, 2022-03-01T13:10:00+00:00, 0.9098145961761475, 0.5294908881187439}]|
+------------------------------------+------------+--------------------------------------------------------------------------------------+

Related

How to get Quantile/median values in pydruid

My goal is to query the median value of column height in my druid datasource. I was able to use other aggregations like count and count distinct values. Here's my query so far:
group = query.groupby(
datasource=datasource,
granularity='all',
intervals='2020-01-01T00:00:00+00:00/2101-01-01T00:00:00+00:00',
dimensions=[
"category_a"
],
filter=(Dimension("country") == country_id),
aggregations={
'count': longsum('count'),
'count_distinct_city': aggregators.thetasketch('city'),
}
)
There's a class Quantile under postaggregator.py so I tried using this.
class Quantile(Postaggregator):
def __init__(self, name, probability):
Postaggregator.__init__(self, None, None, name)
self.post_aggregator = {
"type": "quantile",
"fieldName": name,
"probability": probability,
}
Here's my attempt at getting the median:
post_aggregations={
'median_value': postaggregator.Quantile(
'height', 50
)
}
The error I'm getting here is 'Could not resolve type id \'quantile\' as a subtype of [simple type, class io.druid.query.aggregation.PostAggregator]:
Druid Error: {'error': 'Unknown exception', 'errorMessage': 'Could not resolve type id \'quantile\' as a subtype of [simple type, class io.druid.query.aggregation.PostAggregator]: known type ids = [arithmetic, constant, doubleGreatest, doubleLeast, expression, fieldAccess, finalizingFieldAccess, hyperUniqueCardinality, javascript, longGreatest, longLeast, quantilesDoublesSketchToHistogram, quantilesDoublesSketchToQuantile, quantilesDoublesSketchToQuantiles, quantilesDoublesSketchToString, sketchEstimate, sketchSetOper, thetaSketchEstimate, thetaSketchSetOp] (for POJO property \'postAggregations\')\n at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 856] (through reference chain: io.druid.query.groupby.GroupByQuery["postAggregations"]->java.util.ArrayList[0])', 'errorClass': 'com.fasterxml.jackson.databind.exc.InvalidTypeIdException', 'host': None}
I modified the code of pydruid to get this working on our end. I've created new aggregator and postaggregator under /pydruid/utils.
aggregator.py
def quantilesDoublesSketch(raw_column, k=128):
return {"type": "quantilesDoublesSketch", "fieldName": raw_column, "k": k}
postaggregator.py
class QuantilesDoublesSketchToQuantile(Postaggregator):
def __init__(self, name: str, field_name: str, fraction: float):
self.post_aggregator = {
"type": "quantilesDoublesSketchToQuantile",
"name": name,
"fraction": fraction,
"field": {
"fieldName": field_name,
"name": field_name,
"type": "fieldAccess",
},
}
My first time to create a PR! Hopefully they accept and publish officially.
https://github.com/druid-io/pydruid/pull/287

Save DF with JSON string as JSON without escape characters with Apache Spark

I have a dataframe which contains some column and json string:
val df = Seq (
(0, """{​"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }​"""),
(1, """{​"device_id": 1, "device_type": "sensor-igauge", "ip": "213.161.254.1", "cca3": "NOR", "cn": "Norway", "temp": 30, "signal": 18, "battery_level": 6, "c02_level": 1413, "timestamp" :1475600498 }​""")
).toDF("id", "json")
Which I want to save as json - without a nested json string in it but a 'raw' one instead.
When I
df.write.json("path")
It saves my json column as string:
{"id":0,"json":"{​\"device_id\": 0, \"device_type\": \"sensor-ipad\", \"ip\": \"68.161.225.1\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 25, \"signal\": 23, \"battery_level\": 8, \"c02_level\": 917, \"timestamp\" :1475600496 }​"}
And what I need is:
{"id": 0,"json": {"device_id": 0,"device_type": "sensor-ipad","ip": "68.161.225.1","cca3": "USA","cn": "United States","temp": 25,"signal": 23,"battery_level": 8,"c02_level": 917,"timestamp": 1475600496}}
How can I achieve it? Please not that the structure of json could be different for each row, it can contain additional fields.
You can use from_json function to get the json string data as a new column
// get schema of the json data
// You can also define your own schema
import org.apache.spark.sql.functions._
val json_schema = spark.read.json(df.select("json").as[String]).schema
val resultDf = df.withColumn("json", from_json($"json", json_schema))
Output:
{"id":0,"json":{"battery_level":8,"c02_level":917,"cca3":"USA","cn":"United States","device_id":0,"device_type":"sensor-ipad","ip":"68.161.225.1","signal":23,"temp":25,"timestamp":1475600496}}
{"id":1,"json":{"battery_level":6,"c02_level":1413,"cca3":"NOR","cn":"Norway","device_id":1,"device_type":"sensor-igauge","ip":"213.161.254.1","signal":18,"temp":30,"timestamp":1475600498}}

plotting graph in sapui5 with pandas dataframe

As pandas supports dataframe to json conversion and the dataframe can be converted to a json data as below: (1) and 2) are just for references nothing to do with sapui5,
1) for eg:
import pandas as pd
df = pd.DataFrame([['madrid', 10], ['venice', 20],['milan',40],['las vegas',35]],columns=['city', 'temp'])
df.to_json(orient="records")
gives:
[{"city":"madrid","temp":10},{"city":"venice","temp":20},{"city":"milan","temp":40},{"city":"las vegas","temp":35}]
and
df.to_json(orient="split")
gives:
{"columns":["city","temp"],"index":[0,1,2,3],"data":[["madrid",10],["venice",20],["milan",40],["las vegas",35]]}
As we have json data , this data could be used as input to plot properties.
2)for the same json data I have created an API (running on localhost):
http://127.0.0.1:****/graph
API using in flask:(just for refernce)
from flask import Flask
import pandas as pd
app=Flask(__name__)
#app.route('/graph')
def plot():
df=pd.DataFrame([['madrid', 10], ['venice', 20], ['milan', 40], ['las vegas', 35]],
columns=['city', 'temp'])
jsondata=df.to_json(orient='records')
return jsondata
if __name__=='__main__':
app.run()
postman result:
[
{
"city": "madrid",
"temp": 10
},
{
"city": "venice",
"temp": 20
},
{
"city": "milan",
"temp": 40
},
{
"city": "las vegas",
"temp": 35
}
]
3)How can I make use of this sample api to fetch data and then plot a sample graph for {city vs temp} using sapui5 ??
looking for an example to do so, (or) any help on how to make use of api's in sapui5 ?

AvroTypeException: Not an enum: MOBILE on DataFileWriter

I am getting the following error message when I tried to write avro records using build-in AvroKeyValueSinkWriter in Flink 1.3.2 and avro 1.8.2:
My schema looks like this:
{"namespace": "com.base.avro",
"type": "record",
"name": "Customer",
"doc": "v6",
"fields": [
{"name": "CustomerID", "type": "string"},
{"name": "platformAgent", "type": {
"type": "enum",
"name": "PlatformAgent",
"symbols": ["WEB", "MOBILE", "UNKNOWN"]
}, "default":"UNKNOWN"}
]
}
And I am calling the following Flink code to write data:
var properties = new util.HashMap[String, String]()
val stringSchema = Schema.create(Type.STRING)
val myTypeSchema = Customer.getClassSchema
val keySchema = stringSchema.toString
val valueSchema = myTypeSchema.toString
val compress = true
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_KEY_SCHEMA, keySchema)
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_VALUE_SCHEMA, valueSchema)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS, compress.toString)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS_CODEC, DataFileConstants.SNAPPY_CODEC)
val sink = new BucketingSink[org.apache.flink.api.java.tuple.Tuple2[String, Customer]]("s3://test/flink")
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd/HH/mm/"))
sink.setInactiveBucketThreshold(120000) // this is 2 minutes
sink.setBatchSize(1024 * 1024 * 64) // this is 64 MB,
sink.setPendingSuffix(".avro")
val writer = new AvroKeyValueSinkWriter[String, Customer](properties)
sink.setWriter(writer.duplicate())
However, it throws the following errors:
Caused by: org.apache.avro.AvroTypeException: Not an enum: MOBILE
at org.apache.avro.generic.GenericDatumWriter.writeEnum(GenericDatumWriter.java:177)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:119)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:302)
... 10 more
Please suggest!
UPDATE 1:
I found this is kind of bug in avro 1.8+ based on this ticket: https://issues-test.apache.org/jira/browse/AVRO-1810
It turns out this is an issue with Avro 1.8+, I have to override the version flink uses dependencyOverrides += "org.apache.avro" % "avro" % "1.7.3", the bug can be found here https://issues-test.apache.org/jira/browse/AVRO-1810

Is there a way to return a JSON object within a xe:restViewColumn?

I'm trying to generate a REST-Service on a XPage with the viewJsonService service type.
Within a column I need to have a JSON object and tried to solve that with this code:
<xe:restViewColumn name="surveyResponse">
<xe:this.value>
<![CDATA[#{javascript:
var arrParticipants = new Array();
arrParticipants.push({"participant": "A", "selection": ["a1"]});
arrParticipants.push({"participant": "B", "selection": ["b1", "b2"]});
return (arrParticipants);
}
]]>
</xe:this.value>
</xe:restViewColumn>
I was expecting to get this for that specific column:
...
"surveyResponse": [
{ "participant": "A",
"selection": [ "a1" ]
},
{ "participant": "B",
"selection": [ "b1", "b2" ]
}
]
...
What I am getting is this:
...
"surveyResponse": [
"???",
"???"
]
...
When trying to use toJson for the array arrParticipants the result is not valid JSON format:
...
"surveyResponse": "[{\"selection\": [\"a1\"],\"participant\":\"A\"},{\"selection\": [\"b1\",\"b2\"],\"participant\":\"B\"}]"
...
When tyring to use fromJson for the array arrParticipants the result is:
{
"code": 500,
"text": "Internal Error",
"message": "Error while executing JavaScript computed expression",
"type": "text",
"data": "com.ibm.xsp.exception.EvaluationExceptionEx: Error while executing JavaScript computed expression at com.ibm.xsp.binding.javascript.JavaScriptValueBinding.getValue(JavaScriptValueBinding.java:132) at com.ibm.xsp.extlib.component.rest.DominoViewColumn.getValue(DominoViewColumn.java:93) at com.ibm.xsp.extlib.component.rest.DominoViewColumn.evaluate(DominoViewColumn.java:133) at com.ibm.domino.services.content.JsonViewEntryCollectionContent.writeColumns(JsonViewEntryCollectionContent.java:213) at com.ibm.domino.services.content.JsonViewEntryCollectionContent.writeEntryAsJson(JsonViewEntryCollectionContent.java:191) at com.ibm.domino.services.content.JsonViewEntryCollectionContent.writeViewEntryCollection(JsonViewEntryCollectionContent.java:170) at com.ibm.domino.services.rest.das.view.RestViewJsonService.renderServiceJSONGet(RestViewJsonService.java:394) at com.ibm.domino.services.rest.das.view.RestViewJsonService.renderService(RestViewJsonService.java:112) at com.ibm.domino.services.HttpServiceEngine.processRequest(HttpServiceEngine.java:167) at com.ibm.xsp.extlib.component.rest.UIBaseRestService._processAjaxRequest(UIBaseRestService.java:242) at com.ibm.xsp.extlib.component.rest.UIBaseRestService.processAjaxRequest(UIBaseRestService.java:219) at com.ibm.xsp.util.AjaxUtilEx.renderAjaxPartialLifecycle(AjaxUtilEx.java:206) at com.ibm.xsp.webapp.FacesServletEx.renderAjaxPartial(FacesServletEx.java:225) at com.ibm.xsp.webapp.FacesServletEx.serviceView(FacesServletEx.java:170) at com.ibm.xsp.webapp.FacesServlet.service(FacesServlet.java:160) at com.ibm.xsp.webapp.FacesServletEx.service(FacesServletEx.java:138) at com.ibm.xsp.webapp.DesignerFacesServlet.service(DesignerFacesServlet.java:103) at com.ibm.designer.runtime.domino.adapter.ComponentModule.invokeServlet(ComponentModule.java:576) at com.ibm.domino.xsp.module.nsf.NSFComponentModule.invokeServlet(NSFComponentModule.java:1281) at com.ibm.designer.runtime.domino.adapter.ComponentModule$AdapterInvoker.invokeServlet(ComponentModule.java:847) at com.ibm.designer.runtime.domino.adapter.ComponentModule$ServletInvoker.doService(ComponentModule.java:796) at com.ibm.designer.runtime.domino.adapter.ComponentModule.doService(ComponentModule.java:565) at com.ibm.domino.xsp.module.nsf.NSFComponentModule.doService(NSFComponentModule.java:1265) at com.ibm.domino.xsp.module.nsf.NSFService.doServiceInternal(NSFService.java:653) at com.ibm.domino.xsp.module.nsf.NSFService.doService(NSFService.java:476) at com.ibm.designer.runtime.domino.adapter.LCDEnvironment.doService(LCDEnvironment.java:341) at com.ibm.designer.runtime.domino.adapter.LCDEnvironment.service(LCDEnvironment.java:297) at com.ibm.domino.xsp.bridge.http.engine.XspCmdManager.service(XspCmdManager.java:272) Caused by: com.ibm.jscript.InterpretException: Script interpreter error, line=7, col=8: Error while converting from a JSON string at com.ibm.jscript.types.FBSGlobalObject$GlobalMethod.call(FBSGlobalObject.java:785) at com.ibm.jscript.types.FBSObject.call(FBSObject.java:161) at com.ibm.jscript.types.FBSGlobalObject$GlobalMethod.call(FBSGlobalObject.java:219) at com.ibm.jscript.ASTTree.ASTCall.interpret(ASTCall.java:175) at com.ibm.jscript.ASTTree.ASTReturn.interpret(ASTReturn.java:49) at com.ibm.jscript.ASTTree.ASTProgram.interpret(ASTProgram.java:119) at com.ibm.jscript.ASTTree.ASTProgram.interpretEx(ASTProgram.java:139) at com.ibm.jscript.JSExpression._interpretExpression(JSExpression.java:435) at com.ibm.jscript.JSExpression.access$1(JSExpression.java:424) at com.ibm.jscript.JSExpression$2.run(JSExpression.java:414) at java.security.AccessController.doPrivileged(AccessController.java:284) at com.ibm.jscript.JSExpression.interpretExpression(JSExpression.java:410) at com.ibm.jscript.JSExpression.evaluateValue(JSExpression.java:251) at com.ibm.jscript.JSExpression.evaluateValue(JSExpression.java:234) at com.ibm.xsp.javascript.JavaScriptInterpreter.interpret(JavaScriptInterpreter.java:221) at com.ibm.xsp.javascript.JavaScriptInterpreter.interpret(JavaScriptInterpreter.java:193) at com.ibm.xsp.binding.javascript.JavaScriptValueBinding.getValue(JavaScriptValueBinding.java:78) ... 27 more Caused by: com.ibm.commons.util.io.json.JsonException: Error when parsing JSON string at com.ibm.commons.util.io.json.JsonParser.fromJson(JsonParser.java:61) at com.ibm.jscript.types.FBSGlobalObject$GlobalMethod.call(FBSGlobalObject.java:781) ... 43 more Caused by: com.ibm.commons.util.io.json.parser.ParseException: Encountered " "object "" at line 1, column 2. Was expecting one of: "false" ... "null" ... "true" ... ... ... ... "{" ... "[" ... "]" ... "," ... at com.ibm.commons.util.io.json.parser.Json.generateParseException(Json.java:568) at com.ibm.commons.util.io.json.parser.Json.jj_consume_token(Json.java:503) at com.ibm.commons.util.io.json.parser.Json.arrayLiteral(Json.java:316) at com.ibm.commons.util.io.json.parser.Json.parseJson(Json.java:387) at com.ibm.commons.util.io.json.JsonParser.fromJson(JsonParser.java:59) ... 44 more "
}
Is there any way to get the desired answer?
Well, the best way to achieve the desired result is to use the xe:customRestService if you need to return a cascaded JSON object.
All other xe:***RestService elements assume that you will return a flat JSON construct of parameter and value pairs, where the value is a simple data type (like boolean, number or string and - funny though - arrays) but not a complex data type (like objects).
This is, that this result here
...
"surveyResponse": [
{ "participant": "A",
"selection": [ "a1" ]
},
{ "participant": "B",
"selection": [ "b1", "b2" ]
}
]
...
will be only available on using xe:customRestService where you can define your JSON result by yourself.
Using the other services the results are limited to this constructions:
...
"surveyResponse": true;
...
or
...
"surveyResponse": [
"A",
"B"
]
...
cant you use built-in server-side javascript function toJson ?
You could try intercepting the AJAX call when reading the JSON and then manually de-sanitise the JSON string data.
There are more details here.
http://www.browniesblog.com/A55CBC/blog.nsf/dx/15112012082949PMMBRD68.htm
Personally I'd recommend against this as unless you are absolutely sure the end user can't inject code into the JSON data.