I have a database attached to 4 forests and I want to create a change document in MarkLgic for every time any value in the document changes. The change document should contain the date of change, old value, and new value.
I was able to accomplish that by using pre-commit and post-commit triggers.
The pre-commit trigger captures the old version of the document, the post-commit has the new versions. I compare the two documents and create the change document.
This works well when updating a single document.
However, I tested this solution by loading 20000 document with MLCP from a delimited file. I changed the value of a single element in all documents, and loaded the data again.
My triggers were only able to capture 7000 of the 20000 changed documents. The rest of the documents failed to load and I received an Error in MLCP that says:
"XDMP-NEWSTAMP Timestamp too new for forest"
I did another test by removing my code from the pre-commit and post-commit triggers, and having the triggers do nothing. I loaded the documents again. Now 19000/20000 documents were successfully updated and I get the same XDMP-NEWSTAMP error.
When I entirely remove the triggers and load the documents. 20000/20000 get loaded and updated.
So it seems like executing large amount of triggers, creates problems when loading documents.
Is there a solution for this problem?
Am I going the wrong path to accomplish what I need to do?
MLCP Command:
mlcp import -host localhost -port 8000 -username uname -password pwd -input_file_path D:....\file.dsv -delimiter '|' -input_file_type delimited_text -database Overtime -output_collections test
Creating triggers:
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
import module namespace trgr="http://marklogic.com/xdmp/triggers" at "/MarkLogic/triggers.xqy";
trgr:create-trigger("PreCommitTrigger", "Trigger that fires when a document is updated",
trgr:trigger-data-event(
trgr:collection-scope("test"),
trgr:document-content("modify"),
trgr:pre-commit()),
trgr:trigger-module(xdmp:database("Overtime"), "/", "preCommit.xqy"),
fn:true(), xdmp:default-permissions()),
trgr:create-trigger("PostCommitTrigger", "Trigger that fires when a document is updated",
trgr:trigger-data-event(
trgr:collection-scope("test"),
trgr:document-content("modify"),
trgr:post-commit()),
trgr:trigger-module(xdmp:database("Overtime"), "/", "postCommit.xqy"),
fn:true(), xdmp:default-permissions())
Loading Trigger documents:
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
xdmp:document-insert('/preCommit.xqy',
text{ " '' "}).
xdmp:document-insert('/postCommit.xqy',
text{ " '' "})
MarkLogic has CPF (Content Processing Framework - https://docs.marklogic.com/guide/cpf/quickStart?hq=CPF) that would help you to make any transformation for your files, in this case you could have a workflow to manage any file inserted, analyze the file and create a DLS (https://docs.marklogic.com/dls) version of it. DLS is a library that allow you to control version of files, that I guess it's what you want to do. Hopes It help you.
Related
I am trying to check if any zip file exists in my SFTP folder. GetMetadata activity works fine if I explicitly provide the filename but I can't know the file name here as the file name is embeded with timestamp and sequence number which are dynamic.
I tried specifying *.zip but that never works and GetMetadata activity always returns false even though the zip file actually exists. is there any way to get this worked? Suggestion please.
Sample file name as below, in this the last part 0000000004_20210907080426 is dynamic and will change every time:
TEST_TEST_9999_OK_TT_ENTITY_0000000004_20210907080426
You could possibly do a Get Metadata on the folder and include the Child items under the Field List.
You'll have to iterate with a ForEach using the expression
#activity('Get Folder Files').output.childItems
and then check if item().name (within the ForEach) ends with '.zip'.
I know it's a pain when the wildcard stuff doesn't work for a given dataset, but this alternative ought to work for you.
If you are using exists in the Get Metadata activity, you need to provide the file name in it.
As a workaround, you can get the child items (with filename *.zip) using the Get Metadata activity.
Output:
Pass the output to If Condition activity, to check if the required file exists.
#contains(string(json(string(activity('Get Metadata1').output.childItems))),'.zip')
You can use other activities inside True and False activities based on If Condition.
If there is no file exists or no child items found in the Get Metadata activity.
If condition output:
For SFTP dataset, if you want to use a wildcard to filter files under the field specified folderPath, you would have to skip this setting and specify the file name in activity source settings (Get Metadata activity).
But Wildcard filter on folders/files is not supported for Get Metadata activity.
I want to copy the file from Source to target container but only when the Source file is new
(latest file is placed in source). I am not sure how to proceed this and not sure about the syntax to check the source file greater than target. Should i have to use two get metadata activity to check source and target last modified date and use if condition. i tried few ways but it didn't work.
Any help will be handy
syntax i used for the condition is giving me the error
#if(greaterOrEquals(ticks(activity('Get Metadata_File').output.lastModified),activity('Get Metadata_File2')),True,False)
error message
The function 'greaterOrEquals' expects all of its parameters to be either integer or decimal numbers. Found invalid parameter types: 'Object'
You can try one of the Pipeline Templates that ADF offers.
Use this template to copy new and changed files only by using
LastModifiedDate. This template first selects the new and changed
files only by their attributes "LastModifiedDate", and then copies
them from the data source store to the data destination store. You can
also go to "Copy Data Tool" to get the pipeline for the same scenario
with more connectors.
View
documentation
OR...
You can use Storage Event Triggers to trigger the pipeline with copy activity to copy when each new file is written to storage.
Follow detailed example here: Create a trigger that runs a pipeline in response to a storage event
I developed a small personal information directory that my client accesses and updates through a Django admin interface. That information needs to be searchable, so I set up my Django site to keep that data in a search index. I originally used Haystack and Whoosh for the search index, but I recently had to move away from those tools, and switched to Elasticsearch 5.
Previously, whenever anything in the directory was updated, the code simply cleared the entire search index and rebuilt it from scratch. There's only a few hundred entries in this directory, so that wasn't onerously non-performant. Unfortunately, attempting to do the same thing in Elasticsearch is very unreliable, due to what I presume to be a race-condition of some sort in my code.
Here's the code I wrote that uses elasticsearch-py and elasticsearch-dsl-py:
import elasticsearch
import time
from django.apps import apps
from django.conf import settings
from elasticsearch.helpers import bulk
from elasticsearch_dsl.connections import connections
from elasticsearch_dsl import DocType, Text, Search
# Create the default Elasticsearch connection using the host specified in settings.py.
elasticsearch_host = "{0}:{1}".format(
settings.ELASTICSEARCH_HOST['HOST'], settings.ELASTICSEARCH_HOST['PORT']
)
elasticsearch_connection = connections.create_connection(hosts=[elasticsearch_host])
class DepartmentIndex(DocType):
url = Text()
name = Text()
text = Text(analyzer='english')
content_type = Text()
class Meta:
index = 'departmental_directory'
def refresh_index():
# Erase the existing index.
try:
elasticsearch_connection.indices.delete(index=DepartmentIndex().meta.index)
except elasticsearch.exceptions.NotFoundError:
# If it doesn't exist, the job's already done.
pass
# Wait a few seconds to give enough time for Elasticsearch to accept that the
# DepartmentIndex is gone before we try to recreate it.
time.sleep(3)
# Rebuild the index from scratch.
DepartmentIndex.init()
Department = apps.get_model('departmental_directory', 'Department')
bulk(
client=elasticsearch_connection,
actions=(b.indexing() for b in Department.objects.all().iterator())
)
I had set up the Django signals to call refresh_index() whenever a Department got saved. But refresh_index() was frequently crashing due this error:
elasticsearch.exceptions.RequestError: TransportError(400, u'index_already_exists_exception', u'index [departmental_directory/uOQdBukEQBWvMZk83eByug] already exists')
Which is why I added that time.sleep(3) call. I'm assuming that the index hasn't been fully deleted by the time DepartmentIndex.init() is called, which was causing the error.
My guess is that I've simply been going about this in entirely the wrong way. There's got to be a better way to keep an elasticsearch index up-to-date using elasticsearch-dsl-py, but I just don't know what it is, and I haven't been able to figure it out through their docs.
Searching for "rebuild elasticsearch index from scratch" on google gives loads of results for "how to reindex your elasticsearch data", but that's not what I want. I need to replace the data with new, more up-to-date data from my app's database.
Maybe this will help: https://github.com/HonzaKral/es-django-example/blob/master/qa/models.py#L137-L146
Either way you want to have 2 methods: batch loading all of your data into new index (https://github.com/HonzaKral/es-django-example/blob/master/qa/management/commands/index_data.py) and, optionally, a synchronization using methods/or signals as mentioned above.
My simple experiment reads from an Azure Storage Table, Selects a few columns and writes to another Azure Storage Table. This experiment runs fine on the Workspace (Let's call it workspace1).
Now I need to move this experiment as is to another workspace(Call it WorkSpace2) using Powershell and need to be able to run the experiment.
I am currently using this Library - https://github.com/hning86/azuremlps
Problem :
When I Copy the experiment using 'Copy-AmlExperiment' from WorkSpace 1 to WorkSpace 2, the experiment and all it's properties get copied except the Azure Table Account Key.
Now, this experiment runs fine if I manually enter the account Key for the Import/Export Modules on studio.azureml.net
But I am unable to perform this via powershell. If I Export(Export-AmlExperimentGraph) the copied experiment from WorkSpace2 as a JSON and insert the AccountKey into the JSON file and Import(Import-AmlExperiment) it into WorkSpace 2. The experiment fails to run.
On PowerShell I get an "Internal Server Error : 500".
While running on studio.azureml.net, I get the notification as "Your experiment cannot be run because it has been updated in another session. Please re-open this experiment to see the latest version."
Is there anyway to move an experiment with external dependencies to another workspace and run it?
Edit : I think the problem is something to do with how the experiment handles the AccountKey. When I enter it manually, it's converted into a JSON array comprising of RecordKey and IndexInRecord. But when I upload the JSON experiment with the accountKey, it continues to remain the same and does not get resolved into RecordKey and IndexInRecord.
For me publishing the experiment as a private experiment for the cortana gallery is one of the most useful options. Only the people with the link can see and add the experiment for the gallery. On the below link I've explained the steps I followed.
https://naadispeaks.wordpress.com/2017/08/14/copying-migrating-azureml-experiments/
When the experiment is copied, the pwd is wiped for security reasons. If you want to programmatically inject it back, you have to set another metadata field to signal that this is a plain-text password, not an encrypted password that you are setting. If you export the experiment in JSON format, you can easily figure this out.
I think I found the issue why you are unable to export the credentials back.
Export the JSON graph into your local disk, then update whatever parameter has to be updated.
Also, you will notice that the credentials are stored as 'Placeholders' instead of 'Literals'. Hence it makes sense to change them to Literals instead of placeholders.
This you can do by traversing through the JSON to find the relevant parameters you need to update.
Here is a brief illustration.
Changing the Placeholder to a Literal:
I have a standard vanilla database in a folder location, e.g. MyDatabase.mdf, MyDatabases.ldf. My PowerShell script is copying these files to the data folder of SQL Server, and renaming in the process, e.g. MyProject.mdf, MyProject.ldf.
I then attach the databases, however the logical names of both the original vanilla .mdf and .ldf remain. I am unable to figure out how to change these with PowerShell. I can do this with a restore, but wondering how with an attach.
$mdfFileName = "DataFolder\MyProject.mdf"
$ldfFileName = "DataFolder\MyProject.ldf"
$sc = New-Object System.Collections.Specialized.StringCollection
$sc.Add($mdfFileName)
$sc.Add($ldfFileName)
$server.AttachDatabase("MyProject", $sc)
An a test, I have tried
$db.LogFiles[0].Name
and this returns the logical name, however it is only accessible as a getter.
The sample code is missing a lot of functionality. It seems you are using SMO to work with the database. Why not use TSQL instead? It can be executed with Invoke-SqlCmd, or by using System.Data.SqlClient classes from .Net.
CREATE DATABASE [MyProject] ON
(FILENAME = 'some\path\MyProject.mdf'), (FILENAME = 'some\path\MyProject.ldf')
FOR ATTACH;
You can call the rename method to rename the logical file followed by alter method. You'll need to refresh your SMO object with refresh method afterwards to see the changes.