greatexpectations - PySpark - ValueError: Unrecognized spark type: DecimalType(20,0)

greatexpectations - PySpark - ValueError: Unrecognized spark type: DecimalType(20,0) - pyspark

I am trying to implement the to_be_of_type expectation mentioned here for DecimalType with precision and scale in PySpark.
However, I am getting following error while testing it.
{
"success": False,
"expectation_config": {
"expectation_type": "expect_column_values_to_be_of_type",
"meta": {},
"kwargs": {
"column": "project_id",
"type_": "DecimalType(20,0)",
"result_format": {
"result_format": "SUMMARY"
}
}
},
"meta": {},
"exception_info": {
"raised_exception": True,
"exception_message": "ValueError: Unrecognized spark type: DecimalType(20,0)",
"exception_traceback": "Traceback (most recent call last):\n File "/home/spark/.local/lib/python3.7/site-packages/great_expectations/dataset/sparkdf_dataset.py", line 1196, in expect_column_values_to_be_of_type\n success = issubclass(col_type, getattr(sparktypes, type_))\nAttributeError: module \"pyspark.sql.types\" has no attribute \"DecimalType(20,0)\"\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/home/spark/.local/lib/python3.7/site-packages/great_expectations/data_asset/data_asset.py", line 275, in wrapper\n return_obj = func(self, **evaluation_args)\n File "/home/spark/.local/lib/python3.7/site-packages/great_expectations/dataset/sparkdf_dataset.py", line 1201, in expect_column_values_to_be_of_type\n raise ValueError(f"Unrecognized spark type: {type_
}")\nValueError: Unrecognized spark type: DecimalType(20,0)\n"
},
"result": {}
},
Is there a possibility to validate the DecimalType with precision and scale values?
I am using GE version 0.14.12 and PySpark version 2.4.3.
Let me know if you need any further information.

Related

Is creating a DataFrame from an Arrow Map Type supported?

I am trying to create a DataFrame from an Arrow array which contains a simple map using pyarrow-8.0.0 and polars-0.13.50. I'm running into an error, and I was wondering if there is support for this:
import pyarrow as pa
import polars as pl
if __name__ == "__main__":
t = pa.map_("string", "string")
arr = pa.array([[("one", "uno"), ("two", "dos")]], type=t)
tbl = pa.table([arr], names=["trans"])
df = pl.DataFrame(tbl)
print(df.head())
Getting this error:
Traceback (most recent call last):
File "/home/pdutta/vscode/sparkless/experiments/maptype.py", line 9, in <module>
df = pl.DataFrame(tbl)
File "/home/pdutta/.pyenv/versions/sparkless/lib/python3.10/site-packages/polars/internals/frame.py", line 308, in __init__
self._df = arrow_to_pydf(data, columns=columns)
File "/home/pdutta/.pyenv/versions/sparkless/lib/python3.10/site-packages/polars/internals/construction.py", line 615, in arrow_to_pydf
pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
ValueError: Cannot create polars series from Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Utf8, is_nullable: false, metadata: {} }, Field { name: "value", data_type: Utf8, is_nullable: true, metadata: {} }]), is_nullable: false, metadata: {} }, false) type
Is this supported?

How can I correct this error with AWS CloudFormation template

Team, I'm stressing out because I cannot find the errors with the following JSON script I'm trying to run in AWS Cloudformation; I'm receiving the following error:
(Cannot render the template because of an error.: YAMLException: end of the stream or a document separator is expected at line 140, column 65: ... e" content="{"version": "4", "rollouts& ... ^
<meta name="optimizely-datafile" content="{"version": "4", "rollouts": [], "typedAudiences": [], "anonymizeIP": true, "projectId":
Please help!!!

Compose Transporter throws error when collection_filters is set to sync data for current day from DocumentDB/MongoDB to file/ElasticSearch

I am using Compose Transporter to sync data from DocumentDB to ElasticSearch instance in AWS. After one time sync, I added following collection_filters in pipeline.js to sync incremental data daily:
// pipeline.js
var source = mongodb({
"uri": "mongodb <URI>"
"ssl": true,
"collection_filters": '{ "mycollection": { "createdDate": { "$gt": new Date(Date.now() - 24*60*60*1000) } }}',
})
var sink = file({
"uri": "file://mongo_dump.json"
})
t.Source("source", source, "^mycollection$").Save("sink", sink, "/.*/")
I get following error:
$ transporter run pipeline.js
panic: malformed collection_filters [recovered]
panic: Panic at 32: malformed collection_filters [recovered]
panic: Panic at 32: malformed collection_filters
goroutine 1 [running]:
github.com/compose/transporter/vendor/github.com/dop251/goja.(*Runtime).RunProgram.func1(0xc420101d98)
/Users/JP/gocode/src/github.com/compose/transporter/vendor/github.com/dop251/goja/runtime.go:779 +0x98
When I change collection_filters so that value of "gt" key is single string token (see below), malformed error vanishes but it doesn't fetch any document:
'{ "mycollection": { "createdDate": { "$gt": "new Date(Date.now() - 24*60*60 * 1000)" } }}',
To check if something is fundamentally wrong with the way I am querying, tried simple string filter and that works well:
"collection_filters": '{ "articles": { "createdBy": "author name" }}',
I tried various ways to pass createdDate filter but either getting malformed error or no data. However same filter on mongo shell gives me expected output. Note that I tried with ES as well as file as sink before asking here.

AvroTypeException: Not an enum: MOBILE on DataFileWriter

I am getting the following error message when I tried to write avro records using build-in AvroKeyValueSinkWriter in Flink 1.3.2 and avro 1.8.2:
My schema looks like this:
{"namespace": "com.base.avro",
"type": "record",
"name": "Customer",
"doc": "v6",
"fields": [
{"name": "CustomerID", "type": "string"},
{"name": "platformAgent", "type": {
"type": "enum",
"name": "PlatformAgent",
"symbols": ["WEB", "MOBILE", "UNKNOWN"]
}, "default":"UNKNOWN"}
]
}
And I am calling the following Flink code to write data:
var properties = new util.HashMap[String, String]()
val stringSchema = Schema.create(Type.STRING)
val myTypeSchema = Customer.getClassSchema
val keySchema = stringSchema.toString
val valueSchema = myTypeSchema.toString
val compress = true
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_KEY_SCHEMA, keySchema)
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_VALUE_SCHEMA, valueSchema)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS, compress.toString)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS_CODEC, DataFileConstants.SNAPPY_CODEC)
val sink = new BucketingSink[org.apache.flink.api.java.tuple.Tuple2[String, Customer]]("s3://test/flink")
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd/HH/mm/"))
sink.setInactiveBucketThreshold(120000) // this is 2 minutes
sink.setBatchSize(1024 * 1024 * 64) // this is 64 MB,
sink.setPendingSuffix(".avro")
val writer = new AvroKeyValueSinkWriter[String, Customer](properties)
sink.setWriter(writer.duplicate())
However, it throws the following errors:
Caused by: org.apache.avro.AvroTypeException: Not an enum: MOBILE
at org.apache.avro.generic.GenericDatumWriter.writeEnum(GenericDatumWriter.java:177)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:119)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:302)
... 10 more
Please suggest!
UPDATE 1:
I found this is kind of bug in avro 1.8+ based on this ticket: https://issues-test.apache.org/jira/browse/AVRO-1810

It turns out this is an issue with Avro 1.8+, I have to override the version flink uses dependencyOverrides += "org.apache.avro" % "avro" % "1.7.3", the bug can be found here https://issues-test.apache.org/jira/browse/AVRO-1810

ARM deployment resourceID with servicebus not working

I am getting this error when trying to deploy my Azure Resource Package. Would love to find a way around 'resourceId': function requires exactly one multi-segmented argument statement.
[ERROR] New-AzureRmResourceGroupDeployment : 2:17:06 PM - Error: Code=InvalidTemplate;
14:17:06 - [ERROR] Message=Deployment template validation failed: 'The template resource
14:17:06 - [ERROR] 'xxxxx/basket-item-changed-topic/basket-telemetry-processor' at line
14:17:06 - [ERROR] '2799' and column '10' is not valid: Unable to evaluate template language
14:17:06 - [ERROR] function 'resourceId': function requires exactly one multi-segmented argument
14:17:06 - [ERROR] which must be resource type including resource provider namespace. Current
14:17:06 - [ERROR] function arguments 'Microsoft.ServiceBus/namespaces/topics,xxxxxx/bask
14:17:06 - [ERROR] et-item-changed-topic'. Please see
-------------------This is the template
{ "comments": "Generalized from resource:
'/subscriptions/fa17ed69-d83f-47bc-8604-fd96cd27d322/resourcegroups/xxxxxxx-Integration-Environment/providers/Microsoft.ServiceBus/namespaces/xxxxx/topics/basket-item-changed-topic/subscriptions/basket-telemetry-processor'.",
"type": "Microsoft.ServiceBus/namespaces/topics/subscriptions",
"name":
"[parameters('subscriptions_basket_telemetry_processor_name')]",
"apiVersion": "2015-08-01", "location": "UK West",
"scale": null, "properties": {
"lockDuration": "00:02:00", "requiresSession": false,
"defaultMessageTimeToLive": "10675199.02:48:05.4775807",
"deadLetteringOnMessageExpiration": true,
"deadLetteringOnFilterEvaluationExceptions": true,
"messageCount": 0, "maxDeliveryCount": 1,
"enableBatchedOperations": true, "status": "Active",
"createdAt": "2017-05-10T14:31:54.2059078Z",
"updatedAt": "2017-05-10T14:31:56.6330818Z",
"accessedAt": "2017-06-23T10:53:20.3815084Z",
"countDetails": { "activeMessageCount": 0,
"deadLetterMessageCount": 0,
"scheduledMessageCount": 0, "transferMessageCount":
0, "transferDeadLetterMessageCount": 0
}, "autoDeleteOnIdle": "10675199.02:48:05.4775807",
"entityAvailabilityStatus": "Available" },
"dependsOn": [
"[resourceId('Microsoft.ServiceBus/namespaces',
parameters('namespaces_xxx_int_name'))]",
"[resourceId('Microsoft.ServiceBus/namespaces/topics',
parameters('topics_basket_item_changed_topic_name'))]" ]
},

You can just use the name of the resource if its deployed in the same template:
dependsOn: [
"[parameters('namespaces_xxx_int_name')]",
"[parameters('topics_basket_item_changed_topic_name')]"
]
you cannot use dependsOn with resources deployed or existing outside of the template, so my original remark doesn't make any sense.

That was a naming issue.
parameters('namespaces_xxx_int_name')
Steps to solve
1.Remove parameters from parameter file.( Use default parameters)
2.Add your functions.
3.Find all the related places that causes the error.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

greatexpectations - PySpark - ValueError: Unrecognized spark type: DecimalType(20,0) - pyspark

Related

Is creating a DataFrame from an Arrow Map Type supported?

How can I correct this error with AWS CloudFormation template

Compose Transporter throws error when collection_filters is set to sync data for current day from DocumentDB/MongoDB to file/ElasticSearch

AvroTypeException: Not an enum: MOBILE on DataFileWriter

ARM deployment resourceID with servicebus not working

Categories

Resources