Malformed date field to populate into new field in elasticsearch - date

I have created a index in elasticsearch with multiple date field and formatted the column as yyyy-mm-dd HH:mm:ss. Eventually I found the date is malformed and was populating wrong data into the fields. The index has more than 600 000 records and I don't want to leave any data. Now I need to create another field or new index with same date field and format as YYYY-MM-ddTHH:mm:ss.Z and need to populate all the records into new index or new fields.
I have used the date processor pipeline as below. but it fails. Correct me anything wrong here.
PUT _ingest/pipeline/date-malform
{
"description": "convert malformed date to timestamp",
"processors": [
{
"date": {
"field": "event_tm",
"target_field" : "event_tm",
"formats" : ["YYYY-MM-ddThh:mm:ss.Z"]
"timezone" : "UTC"
}
},
{
"date": {
"field": "vendor_start_dt",
"target_field" : "vendor_start_dt",
"formats" : ["YYYY-MM-ddThh:mm:ss.Z"]
"timezone" : "UTC"
}
},
{
"date": {
"field": "vendor_end_dt",
"target_field" : "vendor_end_dt",
"formats" : ["YYYY-MM-ddThh:mm:ss.Z"]
"timezone" : "UTC"
}
}
]
}
I have created the pipeline and used reindex as below
POST _reindex
{
"source": {
"index": "tog_gen_test"
},
"dest": {
"index": "data_mv",
"pipeline": "some_ingest_pipeline",
"version_type": "external"
}
}
I am getting the below error while running the reindex
"failures": [
{
"index": "data_mv",
"type": "_doc",
"id": "rwN64WgB936y_JOyjc57",
"cause": {
"type": "exception",
"reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: unable to parse date [2019-02-12 10:29:35]",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "java.lang.IllegalArgumentException: unable to parse date [2019-02-12 10:29:35]",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "unable to parse date [2019-02-12 10:29:35]",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Illegal pattern component: T"
}
}

You can either use logstash like Shailesh Pratapwar suggested, but you also have the option to use elasticsearch reindex + ingest to do the same:
Create an ingest pipeline with the proper date processor in order to fix the date format/manipulation: https://www.elastic.co/guide/en/elasticsearch/reference/master/date-processor.html
reindex the data from the old index, to a new index, with the date manipulation. from: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
Reindex can also use the Ingest Node feature by specifying a pipeline

Use Logstash.
Read from ElasticSearch using LogStash.
Manipulate the date format.
Write to ElasticSearch using LogStash.

Related

How to update date field to a specific date format value in mongoDB compass?

My MongoDB compass document looks like this
{
"_id": "123456789",
"code": "x123",
"title": "cool",
"createdDate":2022-07-06T08:04:52.156+00:00
"expiryDate":2023-12-31T00:00:00.000+00:00
}
I tried to create a mongo DB script in my pipeline to update the "expirydate" field to a specific date value "9999-12-31T00:00:00.000Z" format. My update script looks like this as I am trying to update it via my pipeline. The createdDate field took the current date correctly.
{
"command": "updateOne",
"datasource_name": "Austrailia",
"collection_name": "Sydney",
"query": {
"code": "x123"
},
"options": {
"upsert": true,
},
"update": {
"$currentDate": {
"createdDate": {
"$type": "date"
}
},
"$set": {
"title": "hot",
"expiryDate": {
"$date": "9999-12-31T00:00:00.000"
}
}
}
}
The script is failing as it is throwing errors -
{MongoError: Unknown modifier: expiryDate. Expected a valid modifier}
The dollar ($) prefixed field \'$date\' in \'expiryDate.$date\' is not valid for storage.
What would be the correct query syntax to update the date field "expiryDate" to the value specified above in the same format here?

Error uploading documents that have null values for longitude and latitude (geo_point) in Elastic 7.15

I am using the "Upload File" option to upload csv file that has some values of longitude and latitude as null in Elastic 7.15. The Mappings and Ingest pipeline are as below
Mapping...
"Latitude": {
"type": "double"
},
"Longitude": {
"type": "double"
},
"UniqueID": {
"type": "keyword"
},
"Unit Number": {
"type": "long"
},
"User ID": {
"type": "long"
},
"location": {
"type": "geo_point"
}
...
Ingest pipeline
...
{
"set": {
"field": "location",
"value": "{{Latitude}},{{Longitude}}"
}
}
....
location field is auto added (combined fields)
When I import the csv with these settings, i am getting error that documents that are empty could not be imported
Error: 8: failed to parse field [location] of type [geo_point]
{"message":"1143266,1/4/2021,E BRECKENRIDGE 1000.0 FT E OF BERMUDA,,,1,"2,186,198""}
I would like to be able to import documents that have null values for coordinates while keeping the type as geo_point since I am creating map visualization. If I remove Set on location or add script "if": "ctx.latitude_field != null && ctx.longitude_field != null", to Set, I can upload all the docs, but then map visualization does not show any documents for location field
I was able to bypass this issue by adding a new field Location(concat(lat,long)) in the csv and updating mapping and removing Set.

Getting an error while loading data with DMS from mongodb to elasticsearch, any ideas?

I am trying to use AWS DMS and transfer data from mongodb to amazon elasticsearch.
i am encountering the following log in CloudWatch.
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "Field [_id] is a metadata field and cannot be added inside a document. Use the index API request parameters."
}
],
"type": "mapper_parsing_exception",
"reason": "Field [_id] is a metadata field and cannot be added inside a document. Use the index API request parameters."
},
"status": 400
}
This is my configuration for the mongo db source.
it has the _id as a separete column check box enabled.
i tried disabling it and it says that there is no primary key.
is there anything that you guys know that can fix it ?
quick note:
i have added mapping of the _id field to old_id and now it doesn't import all the other field, even when i add them in the mapping
As ElasticSearch will not support the LOB data type, Other fields are not migrated.
Add additional transformation rule to change the data type to String
{
"rule-type": "transformation",
"rule-id": "3",
"rule-name": "3",
"rule-action": "change-data-type",
"rule-target": "column",
"object-locator": {
"schema-name": "test",
"table-name": "%",
"column-name": "%"
},
"data-type": {
"type": "string",
"length": "30"
}
}

Elasticsearch date format parsing error

I am trying to specify a date format for an Elasticsearch field according to ISO 8601 as:
YYYY-MM-DDThh:mm:ssTZD
I put the field mapping like so:
"properties": {
"startDate": {
"type": "date",
"format": "YYYY-MM-DD'T'hh:mm:ss'TZD'"
}
}
When I try to insert a document with this field's value as: "startDate": "2018-01-10T07:07:07+01:00", I get this error:
"type": "mapper_parsing_exception",
"reason": "failed to parse [afield.startDate]",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Invalid format: \"2018-01-10T07:07:07+01:00\" is malformed at \"+01:00\""
}
Is there something wrong in the date I am inserting? I'm following the example given in this W3C guide (https://www.w3.org/TR/NOTE-datetime):
Complete date plus hours, minutes and seconds:
YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00)
Elasticsearch datetime formats can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html
You want a datetime with no milliseconds. To do that, use the following:
PUT my_index
{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type1" : {
"properties": {
"startDate": {
"type": "basic_date_time_no_millis",
"format": "yyyyMMdd'T'HHmmssZ"
}
}
}
}
}
Timezone should be handled in the timestamp, and can be pulled from a query like:
GET /my_index/_search
{
"query": {
"range" : {
"timestamp" : {
"gte": "2018-01-10 07:07:07",
"lte": "now",
"time_zone": "+01:00"
}
}
}
}
For custom date formats in Elasticsearch you have to check the Joda-time documentation.
In your case you can try with yyyy-MM-dd'T'HH:mm:ssZ:
"properties": {
"startDate": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ssZ"
}
}
With this you should be able to insert dates and make searches using the dates like the one you used in your example: 2018-01-10T07:07:07+01:00.
For all the people out there that have a date in the format :
2016-08-14T00:00:00+02:00
or :
yyyy-MM-ddTHH:mm:ss+HH:MM
You might consider using this format in elasticsearch :
format: "yyyy-MM-dd'T'HH:mm:ss+02:00"

Is there a possibility to have another timestamp as dimension in Druid?

Is it possible to have Druid datasource with 2 (or multiple) timestmaps in it?
I know that Druid is time-based DB and I have no problem with the concept but I'd like to add another dimension with which I can work as with timestamp
e.g. User retention: Metric surely is specified to a certain date, but I also need to create cohorts based on users date of registration and rollup those dates maybe to a weeks, months or filter to only a certain time periods....
If the functionality is not supported, are there any plug-ins? Any dirty solutions?
Although I'd rather wait for official implementation for timestamp dimensions full support in druid to be made, I've found a 'dirty' hack which I've been looking for.
DataSource Schema
First things first, I wanted to know, how much users logged in for each day, with being able to aggregate by date/month/year cohorts
here's the data schema I used:
"dataSchema": {
"dataSource": "ds1",
"parser": {
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"dimensionsSpec": {
"dimensions": [
"user_id",
"platform",
"register_time"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [
{ "type" : "hyperUnique", "name" : "users", "fieldName" : "user_id" }
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "DAY",
"intervals": ["2015-01-01/2017-01-01"]
}
},
so the sample data should look something like (each record is login event):
{"user_id": 4151948, "platform": "portal", "register_time": "2016-05-29T00:45:36.000Z", "timestamp": "2016-06-29T22:18:11.000Z"}
{"user_id": 2871923, "platform": "portal", "register_time": "2014-05-24T10:28:57.000Z", "timestamp": "2016-06-29T22:18:25.000Z"}
as you can see, my "main" timestamp to which I calculate these metrics is timestamp field, where register_time is only the dimension in stringy - ISO 8601 UTC format .
Aggregating
And now, for the fun part: I've been able to aggregate by timestamp (date) and register_time (date again) thanks to Time Format Extraction Function
Query looking like that:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "timeFormat",
"format": "YYYY-MM-dd",
"timeZone": "Europe/Bratislava" ,
"locale": "sk-SK"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
}
Filtering
Solution for filtering is based on JavaScript Extraction Function with which I can transform date to UNIX time and use it inside (for example) bound filter:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
"platform",
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
"filter": {
"type": "bound",
"dimension": "register_time",
"outputName": "reg_date",
"alphaNumeric": "true"
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
}
I've tried to filter it 'directly' with javascript filter but I haven't been able to convince druid to return the correct records although I've doublecheck it with various JavaScript REPLs, but hey, I'm no JavaScript expert.
Unfortunately Druid has only one time-stamp column that can be used to do rollup plus currently druid treat all the other columns as a strings (except metrics of course) so you can add another string columns with time-stamp values, but the only thing you can do with it is filtering.
I guess you might be able to hack it that way.
Hopefully in the future druid will allow different type of columns and maybe time-stamp will be one of those.
Another solution is add a longMin sort of metric for the timestamp and store the epoch time in that field or you convert the date time to a number and store it (eg 31st March 2021 08:00 to 310320210800)
As for Druid 0.22 it is stated in the documentation that secondary timestamps should be handled/parsed as dimensions of type long. Secondary timestamps can be parsed to longs at ingestion time with a tranformSpec and be transformed back if needed in querying time link.