Read CosmosDb items from pyspark (Databricks) with an inconsistent model

Read CosmosDb items from pyspark (Databricks) with an inconsistent model - pyspark

Let's say I have two items in CosmosDb:
{
"Test": {
"InconsistentA": 10
},
"Common": 1
}
{
"Test": {
"InconsistentB": 10
},
"Common": 2
}
How to read this data so to have the following schema:
Test: string (the JSON string of the inconsistent part of the model)
Common: int (the consistent part of the model)
I don't know in advance what the model is and the spark CosmosDb driver (com.microsoft.azure.cosmosdb.spark) only reads X first items in CosmosDb to infer the schema.
What I have tried is enforcing the schema this way:
|-- Test: string (nullable = true)
|-- Common: integer (nullable = true)
But the result of the Test column is:
'{ InconsistentA=10 }'
Instead of:
'{ "InconsistentA": 10 }'

Related

AWS Glue: Pyspark: Problems while flattening DynamoDB JSON

Need help with reading and converting a JSON element.
From Glue ETL job, I am using DynamoDB connector to export data from cross account DynamoDB.
Example Data from DynamoDB:
{
"header_name": {
"S": "2caee47b-4e6d-4eba-812d-65e4098f1f78"
},
"additional_info": {
"S": "test additionalIdentifier"
},
"biz_group": {
"S": "64b27e6a-863d-4ee6-9a33-386675ce348a"
},
"key_id": {
"S": "2caee47b-4e6d-4eba-812d-65e4098f1f78"
},
"workflow_id": {
"M": {
"48281078dd41": {
"N": "1"
}
}
}
}
There are 2 million unique “workflow_id”s.
I’m reading the data using glue connector to DynamoDB exports
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.export": "ddb",
"dynamodb.s3.bucket": "s3_bucket",
"dynamodb.s3.prefix": "prefix-",
"dynamodb.tableArn": "TABLE_ARN",
"dynamodb.unnestDDBJson": False,
"dynamodb.sts.roleArn": "STS_ROLE_ARN",
"dynamodb.region": "us-west-2",
"dynamodb.s3.bucketOwner":"ACCOUNT ID BUCKET OWNER"
},
transformation_ctx="S3bucket_node1",
)
When I try to infer schema type for workflow_id, it is interpreting the values as individual columns. workflow_id is getting exploded (pasting only a few below).
root
|-- header_name: string
|-- additional_info: string
|-- biz_group: string
|-- key_id: string
|-- workflow_id: struct
| |-- 48281078dd41: long
| |-- 48281078dd42: long
.............
.............
.............
.............
.............
.............
I would like to consider the element workflow_id as string instead of interpreting as struct.
Question: is there a way I can convert the field workflow_id to string. I tried using crawler and changing the type to string but it is causing “internal exception” issues. Is there a way we can manipulate the schema while reading the data?
I tried using crawler and changing the type to string but it is causing “internal exception” issue. I am trying to avoid using crawler and use minimal amount of layers possible.
Use Case:
DynamoDB (cross account)->AWS Glue DDB connectors->dataframe->flatten->Parquet->Redshift Spectrum

How to get String representation of ObjectId from response body?

I am trying to create a simple API which works with MongoDB documents created from this class:
#Document()
data class Student(#Id val id: ObjectId?,
val firstName: String,
val secondName: String) {}
And I have a REST controller which returns me Student documents.
{
"id": {
"timestamp": 1657005140,
"date": "2022-07-05T07:12:20.000+00:00"
},
"firstName": "Test",
"secondName": "Test"
}
But I also need a controller which returns me documents by id. How can I put this JSON id with timestamp and date in a request param like /getByName?id= ? Maybe there is a way to get an ID in one-string representation?

The problem is that my ObjectId was serialised.
To get simple one-string value I needed just to add: #JsonSerialize(using = ToStringSerializer::class)
So my document looks like this:
#Document()
data class Student(
#Id
#JsonSerialize(using = ToStringSerializer::class)
val id: ObjectId?,
val firstName: String,
val secondName: String) {}

Export parquet file schema to JSON or CSV

I need to extract schema of parquet file into JSON, TXT or CSV format.
That should include column name, datatype from parquet file.
For example:
{"id", "type" : "integer" },
{"booking_date""type" : "timestamp", "format" : "%Y-%m-%d %H:%M:%S.%f" }

We can read the schema from the parquet file using .schema and convert to json format finally save as textfile.
input parquet file:
spark.read.parquet("/tmp").printSchema()
#root
#|-- id: integer (nullable = true)
#|-- name: string (nullable = true)
#|-- booking_date: timestamp (nullable = true)
Extract the schema and write to HDFS/local filesystem:
spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS
Read the output file from hdfs:
$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}

Why integer (Number) are saved as Double in MongoDb

I'm working on a new project and I try to figure out why when Mongoose save my model, instead of having an integer, I got a Double.
Ex. {myId: 12345678} become {myId: 12345678.0}
My schema contains this:
{
myId: {
type: Number
}
}
Mongoose version: 5.x
Node: 10.x
Any idea ?

The Number schema type is floating point. If you want to store a number as an integer, you can use the mongoose-int32 plug-in:
var Int32 = require('mongoose-int32');
const schema = new mongoose.Schema({
myId: {
type: Int32
}
});
If you need 64-bit integer support, use the mongoose-long plug-in.

Instead of additional npm package overhead, it's recommended to use such getter/setter in the Schema Types.
var numberSchema = new Schema({
integerOnly: {
type: Number,
get: v => Math.round(v),
set: v => Math.round(v),
alias: 'i'
}
});
Therefore, CRUD operations related to field of Number Schema Type will cast to Int32 in Mongodb.
More info can be found in Mongoose v5.8.9 Schema Types documentation

Performance on sorting by populated field using mongoose

I have learned that it is not possible to sort by populated field in mongodb during querying. Suppose I have a schema like below, and I have 1 million data in record. And i only need to return 10 records for each query, depending of the column sorting (asc/desc) and page defined. What are the effective solution to this problem?
Simplify problem:
In the front end, I will have a data table with column firstname, lastname, test.columnA and test.columnB. Each of this column is sortable by user.
My initial solution was to query everything out in mongoose, flattening it to json and using javascript to reorder and finally response the final 10 data only. But this will have bad performance impact with increasing data set.
var testSchema = {
columnA: { type: String },
columnB: { type: String },
}
var UserSchema = {
firstname: { type: string },
lastname: { type: string },
test: {
type: ObjectId,
ref: 'Test'
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Read CosmosDb items from pyspark (Databricks) with an inconsistent model - pyspark

Related

AWS Glue: Pyspark: Problems while flattening DynamoDB JSON

How to get String representation of ObjectId from response body?

Export parquet file schema to JSON or CSV

Why integer (Number) are saved as Double in MongoDb

Performance on sorting by populated field using mongoose

Categories

Resources