Using each plugin in Nutch separately - plugins

I'm using extractor plugin with Nutch-1.15. The plugin makes use of parsed data.
The plugin works fine when used as a whole. The problem arises when a few changes are made to the custom-extractos.xml file.
The entire crawling process needs to be restarted even if there is a small change in the custom-extractors.xml file.
Is there a way that single plugin can be used separately on parsed data?

Since this plugin is a Parser filter, it must be used as part of the Parse step, and is not stand-alone.
However, there are a number of things you can do.
If you are looking to change the configuration on the fly (only affecting newly parsed documents), you can use the extractor.file property to specify any location on the HDFS, and replace this file as needed, it will be read by each task.
If you are want to reapply the changes to previously parsed documents, the answer is dependent on the specifics of your crawl, but you may be able to run the parse step again using nutch parse on old segments (you will need to delete the existing parse folders in the segments).

Related

Pick up changes in json files that are being read by pyspark readstream?

I have json files where each file describes a particular entity, including it's state. I am trying to pull these into Delta by using readStream and writeStream. This is working perfectly for new files. These json files are frequently updated (i.e., states are changed, comments added, history items added, etc.). The changed json files are not pulled in with the readStream. I assume that is because readStream does not reprocess items. Is there a way around this?
One thing I am considering is changing my initial write of the json to add a timestamp to the file name so that it becomes a different record to the stream (I already have to do a de-duping in my writeStream anyway), but I am trying to not modify the code that is writing the json as it is already being used in production.
Ideally I would like to find something like the changeFeed functionality for Cosmos Db, but for reading json files.
Any suggestions?
Thankss!
This is not supported by the Spark Structured Streaming - after file is processed it won't be processed again.
The closest to your requirement is only exist in Databricks' Autoloader - it has option cloudFiles.allowOverwrites option that allows to reprocess modified files.
P.S. Potentially if you use cleanSource option for file source (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources), then it may reprocess files, but I'm not 100% sure.

In which file is the _AppInfo data stored in Beckhoff TwinCAT 3 PLC

I'm looking for the 'AppTimeStamp' information so this can be used to verify that the code is not updated/changed by service personel.
Detect code changes on Beckhoff PLC using C#
At this location I already find part of my information, but I was not able to add a comment due to the 'new user' limitations
You can find the AppTimestamp in the _AppInfo instance.
So just call _AppInfo.AppTimestamp in your program to know the time of the last application start.
Make sure you also check the number of online changes since last download with the OnlineChangeCnt counter which you will also find in the _AppInfo instance.
There are many possibilities where this value is saved. The TwinCAT saves data to the C:\TwinCAT\3.1\Boot folder, different files are explained here.
The ProjectName can be found for example from the configuration data (CurrentConfig.xml), from the end of the file (TcBootProject/ProjectInfo/ProjectName). The same file contains one date (<TcBootProject CreateTime="2019-06-10T13:14:17">), but it seems to be the build time of the boot project created.
I couldn't find the date of AppTimestamp in any files, but perhaps the TwinCAT uses the creation time of the files in those folders? Or perhaps it's hidden in the binary somewhere.
When you update the software without updating the boot project, the file Port_851_act.tizip is updated. So you can check its timestamp. When you update the boot project too, Port_851_boot.tizip and other files are also updated.
So basically, to check if the code is updated by someone, check that modified dates of the files under Boot directory. I suppose only .bootdata files should update as they contain saved persistent data. Of course, you can easily change the dates with 3rd party program. So one solution is to compare the Port_851.crc file contents since it contains the CRC check value of the code. It will always change when boot project is updated.

Symfony: getting form values before and after form handling

Hello I want to be able to compare values before and after form handling, so that I can process them before flush.
What I do is collect old values in an array before handlerequest.
I then compare new values to the old values in the array.
It works perfectly on simple variables, like strings for instance.
However I want to work on uploaded files. I am able to get their fullpath and names before handling the form but when I get the values after checking if form is valid, I am still getting the same old value.
I tried both calling $entity->getVar() and $form->getData()->getVar() and I have the same output....
Hello I actually found a solution. Yet it is a departure from the strategy announced in my question, which I realize is somewhat truncated regarding my objective. Which was to compare old file names and new names (those names actually include full path) for changes, so that I would unlink those of those old names that were not in the new name list anymore. Basically, to operate a cleanup after a file was uploaded to replace another, without the first one being deleted first. And to save the webmaster the hassle of having to sort between uniqid-named files that are still used by the web site and those that are useless.
Problem is that my upload functions, that are very similar to those given in examples to the file upload code shown on the official documentation pages, seemed to take effect at flush time.
So, since what I wanted to do with those files had nothing to do with database operations, I resorted to having step two code launch after flush, which works fine.
However I am intrigued by your solutions, as they are both strategies I hadn't thought of. Thank you for suggestions.
However I am not sure if cloning the whole object will be as straightforward as comparing two arrays of file names.

How to add a file to ClearCase database, but not in source control?

On my project I have some files that are generated automatically, so you'd normally don't put those in Source Control.
But since this process takes a long time and they change quite periodically, I'd rather keep them in Clear Case database to not impose this process to every one that desires to compile the source that isn't directly related to these files.
So, is there a way that I could add files on ClearCase UCM without creating a version tree?
More directly, I'd like to know if there a way to only one version per branch. As if when delivering this file to the main branch, it would delete the old version an replace it by the new one.
I know that this is a bit unorthodox, but I ask this because I'm not interested by the generated files history and I'd like to save space in the server.
So, is there a way that I could add files on ClearCase UCM without creating a version tree?
No.
Unless those files are radically different from one generation to the next, (or are huge binary), ClearCase would only record the delta, which wouldn't consume too much space.
One trick would be to rename the stream in which the import of the newly generated source is done, and create a new stream, in order to not have a huge version tree over time.

eclipse CVS usage: clean timestamps

during synchronisation with the CVS server, eclipse compares the content of the files (of course it uses internally CVS commands). But files without any content change are also shown as different, if they have another timestamp, because they are "touched". You always have to look manually per file comparison dialog if there was really a change in it or not.
Due to auto-generation I have some files that always get new timestamps and therefore I always have to check manually if they really contain any change.
At the eclipse docu I read :
Update and Commit Operations
There are several flavours of update and commit operations available
in the Synchronize view. You can perform the standard update and
commit operation on all visible applicable changes or a selected
subset. You can also choose to override and update, thus ignoring any
local changes, or override and commit, thus making the remote resource
match the contents of the local resource. You can also choose to clean
the timestamps for files that have been modified locally (perhaps by
an external build tool) but whose contents match that of the server.
That's exactly what I want to do. But I don't know how!? There is no further description/manual ...
Did anybody use this functionality and can help me (maybe even post a screenshot)?
Thanks in advance,
Mayoares
When you perform a CVS Update on a project (using context menu Team->Update), Eclipse implicitly updates the timestamp of local files whose contents match that of the server.