greatexpectations - PySpark - ValueError: Unrecognized spark type: DecimalType(20,0) - pyspark

I am trying to implement the to_be_of_type expectation mentioned here for DecimalType with precision and scale in PySpark.
However, I am getting following error while testing it.
{
"success": False,
"expectation_config": {
"expectation_type": "expect_column_values_to_be_of_type",
"meta": {},
"kwargs": {
"column": "project_id",
"type_": "DecimalType(20,0)",
"result_format": {
"result_format": "SUMMARY"
}
}
},
"meta": {},
"exception_info": {
"raised_exception": True,
"exception_message": "ValueError: Unrecognized spark type: DecimalType(20,0)",
"exception_traceback": "Traceback (most recent call last):\n File "/home/spark/.local/lib/python3.7/site-packages/great_expectations/dataset/sparkdf_dataset.py", line 1196, in expect_column_values_to_be_of_type\n success = issubclass(col_type, getattr(sparktypes, type_))\nAttributeError: module \"pyspark.sql.types\" has no attribute \"DecimalType(20,0)\"\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/home/spark/.local/lib/python3.7/site-packages/great_expectations/data_asset/data_asset.py", line 275, in wrapper\n return_obj = func(self, **evaluation_args)\n File "/home/spark/.local/lib/python3.7/site-packages/great_expectations/dataset/sparkdf_dataset.py", line 1201, in expect_column_values_to_be_of_type\n raise ValueError(f"Unrecognized spark type: {type_
}")\nValueError: Unrecognized spark type: DecimalType(20,0)\n"
},
"result": {}
},
Is there a possibility to validate the DecimalType with precision and scale values?
I am using GE version 0.14.12 and PySpark version 2.4.3.
Let me know if you need any further information.

Related

Is creating a DataFrame from an Arrow Map Type supported?

I am trying to create a DataFrame from an Arrow array which contains a simple map using pyarrow-8.0.0 and polars-0.13.50. I'm running into an error, and I was wondering if there is support for this:
import pyarrow as pa
import polars as pl
if __name__ == "__main__":
t = pa.map_("string", "string")
arr = pa.array([[("one", "uno"), ("two", "dos")]], type=t)
tbl = pa.table([arr], names=["trans"])
df = pl.DataFrame(tbl)
print(df.head())
Getting this error:
Traceback (most recent call last):
File "/home/pdutta/vscode/sparkless/experiments/maptype.py", line 9, in <module>
df = pl.DataFrame(tbl)
File "/home/pdutta/.pyenv/versions/sparkless/lib/python3.10/site-packages/polars/internals/frame.py", line 308, in __init__
self._df = arrow_to_pydf(data, columns=columns)
File "/home/pdutta/.pyenv/versions/sparkless/lib/python3.10/site-packages/polars/internals/construction.py", line 615, in arrow_to_pydf
pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
ValueError: Cannot create polars series from Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Utf8, is_nullable: false, metadata: {} }, Field { name: "value", data_type: Utf8, is_nullable: true, metadata: {} }]), is_nullable: false, metadata: {} }, false) type
Is this supported?

How can I correct this error with AWS CloudFormation template

Team, I'm stressing out because I cannot find the errors with the following JSON script I'm trying to run in AWS Cloudformation; I'm receiving the following error:
(Cannot render the template because of an error.: YAMLException: end of the stream or a document separator is expected at line 140, column 65: ... e" content="{"version": "4", "rollouts& ... ^
<meta name="optimizely-datafile" content="{"version": "4", "rollouts": [], "typedAudiences": [], "anonymizeIP": true, "projectId":
Please help!!!

Compose Transporter throws error when collection_filters is set to sync data for current day from DocumentDB/MongoDB to file/ElasticSearch

I am using Compose Transporter to sync data from DocumentDB to ElasticSearch instance in AWS. After one time sync, I added following collection_filters in pipeline.js to sync incremental data daily:
// pipeline.js
var source = mongodb({
"uri": "mongodb <URI>"
"ssl": true,
"collection_filters": '{ "mycollection": { "createdDate": { "$gt": new Date(Date.now() - 24*60*60*1000) } }}',
})
var sink = file({
"uri": "file://mongo_dump.json"
})
t.Source("source", source, "^mycollection$").Save("sink", sink, "/.*/")
I get following error:
$ transporter run pipeline.js
panic: malformed collection_filters [recovered]
panic: Panic at 32: malformed collection_filters [recovered]
panic: Panic at 32: malformed collection_filters
goroutine 1 [running]:
github.com/compose/transporter/vendor/github.com/dop251/goja.(*Runtime).RunProgram.func1(0xc420101d98)
/Users/JP/gocode/src/github.com/compose/transporter/vendor/github.com/dop251/goja/runtime.go:779 +0x98
When I change collection_filters so that value of "gt" key is single string token (see below), malformed error vanishes but it doesn't fetch any document:
'{ "mycollection": { "createdDate": { "$gt": "new Date(Date.now() - 24*60*60 * 1000)" } }}',
To check if something is fundamentally wrong with the way I am querying, tried simple string filter and that works well:
"collection_filters": '{ "articles": { "createdBy": "author name" }}',
I tried various ways to pass createdDate filter but either getting malformed error or no data. However same filter on mongo shell gives me expected output. Note that I tried with ES as well as file as sink before asking here.

AvroTypeException: Not an enum: MOBILE on DataFileWriter

I am getting the following error message when I tried to write avro records using build-in AvroKeyValueSinkWriter in Flink 1.3.2 and avro 1.8.2:
My schema looks like this:
{"namespace": "com.base.avro",
"type": "record",
"name": "Customer",
"doc": "v6",
"fields": [
{"name": "CustomerID", "type": "string"},
{"name": "platformAgent", "type": {
"type": "enum",
"name": "PlatformAgent",
"symbols": ["WEB", "MOBILE", "UNKNOWN"]
}, "default":"UNKNOWN"}
]
}
And I am calling the following Flink code to write data:
var properties = new util.HashMap[String, String]()
val stringSchema = Schema.create(Type.STRING)
val myTypeSchema = Customer.getClassSchema
val keySchema = stringSchema.toString
val valueSchema = myTypeSchema.toString
val compress = true
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_KEY_SCHEMA, keySchema)
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_VALUE_SCHEMA, valueSchema)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS, compress.toString)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS_CODEC, DataFileConstants.SNAPPY_CODEC)
val sink = new BucketingSink[org.apache.flink.api.java.tuple.Tuple2[String, Customer]]("s3://test/flink")
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd/HH/mm/"))
sink.setInactiveBucketThreshold(120000) // this is 2 minutes
sink.setBatchSize(1024 * 1024 * 64) // this is 64 MB,
sink.setPendingSuffix(".avro")
val writer = new AvroKeyValueSinkWriter[String, Customer](properties)
sink.setWriter(writer.duplicate())
However, it throws the following errors:
Caused by: org.apache.avro.AvroTypeException: Not an enum: MOBILE
at org.apache.avro.generic.GenericDatumWriter.writeEnum(GenericDatumWriter.java:177)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:119)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:302)
... 10 more
Please suggest!
UPDATE 1:
I found this is kind of bug in avro 1.8+ based on this ticket: https://issues-test.apache.org/jira/browse/AVRO-1810
It turns out this is an issue with Avro 1.8+, I have to override the version flink uses dependencyOverrides += "org.apache.avro" % "avro" % "1.7.3", the bug can be found here https://issues-test.apache.org/jira/browse/AVRO-1810

ARM deployment resourceID with servicebus not working

I am getting this error when trying to deploy my Azure Resource Package. Would love to find a way around 'resourceId': function requires exactly one multi-segmented argument statement.
[ERROR] New-AzureRmResourceGroupDeployment : 2:17:06 PM - Error: Code=InvalidTemplate;
14:17:06 - [ERROR] Message=Deployment template validation failed: 'The template resource
14:17:06 - [ERROR] 'xxxxx/basket-item-changed-topic/basket-telemetry-processor' at line
14:17:06 - [ERROR] '2799' and column '10' is not valid: Unable to evaluate template language
14:17:06 - [ERROR] function 'resourceId': function requires exactly one multi-segmented argument
14:17:06 - [ERROR] which must be resource type including resource provider namespace. Current
14:17:06 - [ERROR] function arguments 'Microsoft.ServiceBus/namespaces/topics,xxxxxx/bask
14:17:06 - [ERROR] et-item-changed-topic'. Please see
-------------------This is the template
{ "comments": "Generalized from resource:
'/subscriptions/fa17ed69-d83f-47bc-8604-fd96cd27d322/resourcegroups/xxxxxxx-Integration-Environment/providers/Microsoft.ServiceBus/namespaces/xxxxx/topics/basket-item-changed-topic/subscriptions/basket-telemetry-processor'.",
"type": "Microsoft.ServiceBus/namespaces/topics/subscriptions",
"name":
"[parameters('subscriptions_basket_telemetry_processor_name')]",
"apiVersion": "2015-08-01", "location": "UK West",
"scale": null, "properties": {
"lockDuration": "00:02:00", "requiresSession": false,
"defaultMessageTimeToLive": "10675199.02:48:05.4775807",
"deadLetteringOnMessageExpiration": true,
"deadLetteringOnFilterEvaluationExceptions": true,
"messageCount": 0, "maxDeliveryCount": 1,
"enableBatchedOperations": true, "status": "Active",
"createdAt": "2017-05-10T14:31:54.2059078Z",
"updatedAt": "2017-05-10T14:31:56.6330818Z",
"accessedAt": "2017-06-23T10:53:20.3815084Z",
"countDetails": { "activeMessageCount": 0,
"deadLetterMessageCount": 0,
"scheduledMessageCount": 0, "transferMessageCount":
0, "transferDeadLetterMessageCount": 0
}, "autoDeleteOnIdle": "10675199.02:48:05.4775807",
"entityAvailabilityStatus": "Available" },
"dependsOn": [
"[resourceId('Microsoft.ServiceBus/namespaces',
parameters('namespaces_xxx_int_name'))]",
"[resourceId('Microsoft.ServiceBus/namespaces/topics',
parameters('topics_basket_item_changed_topic_name'))]" ]
},
You can just use the name of the resource if its deployed in the same template:
dependsOn: [
"[parameters('namespaces_xxx_int_name')]",
"[parameters('topics_basket_item_changed_topic_name')]"
]
you cannot use dependsOn with resources deployed or existing outside of the template, so my original remark doesn't make any sense.
That was a naming issue.
parameters('namespaces_xxx_int_name')
Steps to solve
1.Remove parameters from parameter file.( Use default parameters)
2.Add your functions.
3.Find all the related places that causes the error.