OrientDB ETL importing date field - orientdb

I have imported a large table containing demographic information and my ETL JSON looks like this
{
"config": {
"log": "debug"
},
"extractor" : {
"jdbc": { "driver": "com.mysql.jdbc.Driver",
"url": "jdbc:mysql://localhost/avicenna",
"userName": "xxxxxx",
"userPassword": "xxxxxxxxxxxx",
"query": "select * from patients",
"fetchSize": 1000
}
},
"transformers" : [
{ "vertex": { "class": "patients"} }
],
"loader" : {
"orientdb": {
"dbURL": "plocal:c:/tools/orientdb-community-2.0.5/databases/Avicenna",
"dbType": "graph",
"dbAutoCreate": true
}
}
}
My patients class in OrientDB is defined as follows
-------------------------------+-------------+
NAME | TYPE |
-------------------------------+-------------+
PatientID | INTEGER |
MaritalStatus | STRING |
DOB | DATE |
Sex | STRING |
-------------------------------+-------------+
Although MySQL patients table has DOB field created as "Date" all imported data will nevertheless display full DateTime
orientdb {db=avicenna}> select from patients limit 3
----+-----+--------+---------+-------------------+-------------+----+------------
# |#RID |#CLASS |PatientID|DOB |MaritalStatus|Sex |out_admitted
----+-----+--------+---------+-------------------+-------------+----+------------
0 |#18:0|patients|1022 |1996-02-29 00:00:00|Single |M |[size=5]
1 |#18:1|patients|1033 |1996-02-02 00:00:00|Single |M |[size=1]
2 |#18:2|patients|1089 |1995-07-21 00:00:00|Single |F |[size=1]
----+-----+--------+---------+-------------------+-------------+----+------------
is there something I am doing wrong with the import script?
And now how can I clean up the dates in OrientDB?

I believe there's nothing wrong with your script, that's just the way Dates are stored.
create property V.someDate date
create property V.someDateTime datetime
insert into V set someDate = sysdate(), someDateTime = sysdate()
select from V

Related

Flatten nested json with array into single line dataframe in Apache Spark Scala

I am trying to flatten the below json into a single lined dataframe. I've seen plenty of articles showing how to flatten a complex/nested json object with arrays into multiple lines. However, I don't want to flatten the json into multiple lines. I just wanted a single lined dataframe as shown in the output. The array indices are converted into columns names. How can I accomplish this in Apache Spark Scala?
JSON
{
"name":"John",
"age":30,
"bike":{
"name":"Bajaj", "models":["Dominor", "Pulsar"]
},
"cars": [
{ "name":"Ford", "models":[ "Fiesta", "Focus", "Mustang" ] },
{ "name":"BMW", "models":[ "320", "X3", "X5" ] },
{ "name":"Fiat", "models":[ "500", "Panda" ] }
]
}
OUPUT
name | age | bike_name | bike_models_0 | bike_models_1 | cars_0_name | cars_0_modesl_0 | ... | cars_1_name | cars_1_models_0 | ...
John 30 Bajaj Dominor Pulsar Ford Fiesta BMW 320
You can use colname[index] to access values from an array and parentcol.childcol to access nested column.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val json =
"""
|{
| "name":"John",
| "age":30,
| "bike":{
| "name":"Bajaj", "models":["Dominor", "Pulsar"]
| },
| "cars": [
| { "name":"Ford", "models":[ "Fiesta", "Focus", "Mustang" ] },
| { "name":"BMW", "models":[ "320", "X3", "X5" ] },
| { "name":"Fiat", "models":[ "500", "Panda" ] }
| ]
|}
|
|""".stripMargin
val df = spark.read.option("multiline", true).json(Seq(json).toDS())
df.selectExpr("name", "age",
"bike.name bike_name", "bike.models[0] bike_models_0", "bike.models[1] bike_models_1",
"cars[0].name cars_0_name", "cars[0].models[0] cars_0_models_0", "cars[0].models[1] cars_0_models_1", "cars[0].models[2] cars_0_models_2",
"cars[1].name cars_1_name", "cars[1].models[0] cars_1_models_0", "cars[1].models[1] cars_1_models_1", "cars[1].models[2] cars_1_models_2",
"cars[2].name cars_2_name", "cars[2].models[0] cars_2_models_0", "cars[2].models[1] cars_2_models_1"
).show(false)
/*
+----+---+---------+-------------+-------------+-----------+---------------+---------------+---------------+-----------+---------------+---------------+---------------+-----------+---------------+---------------+
|name|age|bike_name|bike_models_0|bike_models_1|cars_0_name|cars_0_models_0|cars_0_models_1|cars_0_models_2|cars_1_name|cars_1_models_0|cars_1_models_1|cars_1_models_2|cars_2_name|cars_2_models_0|cars_2_models_1|
+----+---+---------+-------------+-------------+-----------+---------------+---------------+---------------+-----------+---------------+---------------+---------------+-----------+---------------+---------------+
|John|30 |Bajaj |Dominor |Pulsar |Ford |Fiesta |Focus |Mustang |BMW |320 |X3 |X5 |Fiat |500 |Panda |
+----+---+---------+-------------+-------------+-----------+---------------+---------------+---------------+-----------+---------------+---------------+---------------+-----------+---------------+---------------+*/

how to change the df column name in struct with colum value

df.withColumn("storeInfo", struct($"store", struct($"inhand", $"storeQuantity")))
.groupBy("sku").agg(collect_list("storeInfo").as("info"))
.show(false)
+---+---------------------------------------------------+
|sku|info |
+---+---------------------------------------------------+
|1 |[{2222, {3, 34}}, {3333, {5, 45}}] |
|2 |[{4444, {5, 56}}, {5555, {6, 67}}, {6666, {7, 67}}]|
+---+---------------------------------------------------+
when I am sending it to couchbase
{
"SKU": "1",
"info": [
{
"col2": {
"inhand": "3",
"storeQuantity": "34"
},
"Store": "2222"
},
{
"col2": {
"inhand": "5",
"storeQuantity": "45"
},
"Store": "3333"
}}
]}
can we rename the col2 with the value to the value of store? I want it to look like something as below. So the key of every struct is the value of store value.
{
"SKU": "1",
"info": [
{
"2222": {
"inhand": "3",
"storeQuantity": "34"
},
"Store": "2222"
},
{
"3333": {
"inhand": "5",
"storeQuantity": "45"
},
"Store": "3333"
}}
]}
Simply, we can't construct a column as you want. two limitation:
The field name of struct type must be fixed, we can change 'col2' to another name (eg. 'fixedFieldName' in demo 1), but it can't be dynamic(similar to Java class field name)
The key of map type could be dynamic, but the value of map must be same type, see the exception in demo 2.
maybe you should change the schema, see the outputs of demo 1, 3
demo 1
df.withColumn(
"storeInfo", struct($"store", struct($"inhand", $"storeQuantity").as("fixedFieldName"))).
groupBy("sku").agg(collect_list("storeInfo").as("info")).
toJSON.show(false)
// output:
//+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|value |
//+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"sku":1,"info":[{"store":2222,"fixedFieldName":{"inhand":3,"storeQuantity":34}},{"store":3333,"fixedFieldName":{"inhand":5,"storeQuantity":45}}]} |
//|{"sku":2,"info":[{"store":4444,"fixedFieldName":{"inhand":5,"storeQuantity":56}},{"store":5555,"fixedFieldName":{"inhand":6,"storeQuantity":67}},{"store":6666,"fixedFieldName":{"inhand":7,"storeQuantity":67}}]}|
//+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
demo 2
df.withColumn(
"storeInfo",
map($"store", struct($"inhand", $"storeQuantity"), lit("Store"), $"store")).
groupBy("sku").agg(collect_list("storeInfo").as("info")).
toJSON.show(false)
// output exception:
// The given values of function map should all be the same type, but they are [struct<inhand:int,storeQuantity:int>, int]
demo 3
df.withColumn(
"storeInfo",
map($"store", struct($"inhand", $"storeQuantity"))).
groupBy("sku").agg(collect_list("storeInfo").as("info")).
toJSON.show(false)
//+---------------------------------------------------------------------------------------------------------------------------------------------+
//|value |
//+---------------------------------------------------------------------------------------------------------------------------------------------+
//|{"sku":1,"info":[{"2222":{"inhand":3,"storeQuantity":34}},{"3333":{"inhand":5,"storeQuantity":45}}]} |
//|{"sku":2,"info":[{"4444":{"inhand":5,"storeQuantity":56}},{"5555":{"inhand":6,"storeQuantity":67}},{"6666":{"inhand":7,"storeQuantity":67}}]}|
//+---------------------------------------------------------------------------------------------------------------------------------------------+

How to delete subsequent matching rows?

I have jsonb datatype, where each row has a name, last_updated, among other keys. How would I go about creating a query, which would leave only 1 row per name per day?
i.e. this:
id | data
1 | {"name": "foo1", "last_updated": "2019-10-06T09:29:30.000Z"}
2 | {"name": "foo1", "last_updated": "2019-10-06T01:29:30.000Z"}
3 | {"name": "foo1", "last_updated": "2019-10-07T01:29:30.000Z"}
4 | {"name": "foo2", "last_updated": "2019-10-06T09:29:30.000Z"}
5 | {"name": "foo2", "last_updated": "2019-10-06T01:29:30.000Z"}
6 | {"name": "foo2", "last_updated": "2019-10-06T02:29:30.000Z"}
becomes:
id | data
1 | {"name": "foo1", "last_updated": "2019-10-06T09:29:30.000Z"}
3 | {"name": "foo1", "last_updated": "2019-10-07T01:29:30.000Z"}
4 | {"name": "foo2", "last_updated": "2019-10-06T09:29:30.000Z"}
This query will run on some 9 million rows, on roughly 300 names.
Try something like this:
Table
create table test (
id serial,
data jsonb
);
Data
insert into test (data) values
('{"name": "foo1", "last_updated": "2019-10-06T09:29:30.000Z"}'),
('{"name": "foo1", "last_updated": "2019-10-06T01:29:30.000Z"}'),
('{"name": "foo1", "last_updated": "2019-10-07T01:29:30.000Z"}'),
('{"name": "foo2", "last_updated": "2019-10-06T09:29:30.000Z"}'),
('{"name": "foo2", "last_updated": "2019-10-06T01:29:30.000Z"}'),
('{"name": "foo2", "last_updated": "2019-10-06T02:29:30.000Z"}');
Query
with latest as (
select data->>'name' as name, max(data->>'last_updated') as last_updated
from test
group by data->>'name'
)
delete from test t
where not exists (
select 1 from latest
where t.data->>'name' = name
and t.data->>'last_updated' = last_updated
);
select * from test;
Example
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=2415e6f2c9c7980e69d178a331120dcd
You might have to index your jsonb column like create index on test((data->>'name'));; you could do that for last_updated also.
I make the assumption that a user doesn't have identical last_updated.
If that assumption is not true, you could try this:
with ranking as (
select
row_number() over (partition by data->>'name' order by data->>'last_updated' desc) as sr,
x.*
from test x
)
delete from test t
where not exists (
select 1 from ranking
where sr = 1
and id = t.id
);
In this case, we first give a serial number to users' records. Each user's latest_updated time gets sr 1.
Then, we ask the database to delete all records that aren't a match for sr 1's id.
Example: https://dbfiddle.uk/?rdbms=postgres_10&fiddle=dba1879a755ed0ec90580352f82554ee

MongoDB sort by field A if field B != null otherwise sort by field C

I face this challenge:
Retrieve documents sorted by field A if field B exists/is not null. Otherwise sort by field C.
In a SQL world, I would do two queries and create a UNION SELECT, but I have no idea how to start with Mongo.
Is map/reduce the correct way to go? Or should I focus on "computed field" and use this one. I am relatively new to MongoDB and I am asking for directions.
Edit: As requested, here some sample data:
Given:
| ID | FieldA | FieldB | FieldC |
|------------|--------|--------|--------|
| Document 1 | 10 | X | 40 |
| Document 2 | 20 | <null> | 50 |
| Document 3 | 30 | Z | 60 |
Expected result (the order) including column with calculation as comment
| ID | FieldA | FieldB | FieldC | "A" if "B" !=<null> else "C" |
|------------|--------|--------|--------|------------------------------|
| Document 1 | 10 | X | 40 | 10 |
| Document 3 | 30 | Z | 60 | 30 |
| Document 2 | 20 | <null> | 50 | 50 |
Thank you,
schube
Given the following documents:
{ "a": 10, "b": "X", "c" : 40 }
{ "a": 20, "b": null, "c" : 50 }
{ "a": 30, "b": "Z", "c" : 60 }
One way of doing this would be like so:
db.collection.aggregate({
$addFields: {
"sortField": { // create a new field called "sortField"
$cond: { // and assign a value that depends on
if: { $ne: [ "$b", null ] }, // whether "b" is not null
then: "$a", // in which case our field shall hold the value of "a"
else: "$c" // or else it shall hold the value of "c"
}
}
}
}, {
$sort: {
"sortField": 1 // sort by our computed field
}
}, {
$project: {
"sortField": 0 // remove "sort" field if needed
}
})
If you had a document without a b field as in:
{ "a": 20, "c" : 50 }
then you'd need to apply one of the techniques mentioned here.
So your if part inside the $cond could e.g. look like this:
if: { $ne: [ "$b", undefined ] }, // whether "b" is null or doesn't exist at all

Loading contents of json array in redshift

I'm setting up redshift and importing data from mongo. I have succeeded in using a json path file for a simple document but am now needing to import from a document containing an array.
{
"id":123,
"things":[
{
"foo":321,
"bar":654
},
{
"foo":987,
"bar":567
}
]
}
How do I load the above in to a table like so:
select * from things;
id | foo | bar
--------+------+-------
123 | 321 | 654
123 | 987 | 567
or is there some other way?
I can't just store the json array in a varchar(max) column as the content of Things can exceed 64K.
Given
db.baz.insert({
"myid":123,
"things":[
{
"foo":321,
"bar":654
},
{
"foo":987,
"bar":567
}
]
});
The following will display the fields you want
db.baz.find({},{"things.foo":1,"things.bar":1} )
To flatten the result set use aggregation like so
db.baz.aggregate(
{"$group": {"_id": "$myid", "things": { "$push" : {"foo":"$things.foo","bar":"$things.bar"}}}},
{
$project : {
_id:1,
foo : "$things.foo",
bar : "$things.bar"
}
},
{ "$unwind" : "$foo" },
{ "$unwind" : "$bar" }
);