Remote stream multiple files in SOLR - streaming

I want to use SOLR's remote-streaming facility to extract and index the content of files.
This works fine if I pass stream.file=xxx as a parameter to the http GET method.
However, I have a lot of these, and want to batch them up (i.e. not have to have a GET per file).
Is there a way I can do this in SOLR?
e.g. I'd like to be able to POST some xml like this:
<add>
<doc stream_file="filename">
<field name="id">123</field>
</doc>
<doc>...

This has been recently asked (and answered) in the solr-user mailing list.

I find that multiple ADDs are fast, so long as you only COMMIT the batch and don't try to COMMIT after every ADD. I would guess that the performance penalty is not worth writing your own RequestHandler.

Related

Add directory to Kodi using the command line

Is there possible to add a directory to Kodi using the command line? I've been looking for this with no luck so far.
What I'm looking for is to automate the process of adding a directory manually using command line. For some reason this doesn't appear to be a popular question out there; am I missing something?
Crawling the Kodi/XBMC forums and wiki show a few options... Here's what I've gathered...
Edit the Database Directly (not recommended)
Kodi stores this information in a sqlite database, however this location would be pretty tricky to manipulate yourself as it would require both knowledge of the path of each sqlite database file as well as the relationship of each column/table in each database file (assuming it's a strictly relational database file, which most are).
For example:
sqlite3 <path_to_kodi_preferences>/userdata/Database/MyVideos119.db
sqlite> .tables
actor movie_view studio_link
actor_link movielinktvshow tag
art musicvideo tag_link
bookmark musicvideo_view tvshow
country path tvshow_view
country_link rating tvshowcounts
director_link season_view tvshowlinkpath
episode seasons tvshowlinkpath_minview
episode_view sets uniqueid
files settings version
genre stacktimes writer_link
genre_link streamdetails
movie studio
Edit sources.xml
The official wiki mentions <path_to_kodi_preferences>/userdata/sources.xml for this but it still assumes you know how to manipulate an XML file programmatically and the community warns that this is potentially "invasive" and that the official addons/plugins aren't allowed to use this technique.
I dove into this and the XML seems like the way to go, for example, to add Videos:
<video>
<default pathversion="1"></default>
<source>
<name>Movies</name>
<path pathversion="1">/home/ubuntu/Movies/</path>
<allowsharing>true</allowsharing>
</source>
<source>
<name>Video Playlists</name>
<path pathversion="1">special://videoplaylists/</path>
<allowsharing>true</allowsharing>
</source>
+ <source>
+ <name>MyCustomDirectory</name>
+ <path pathversion="1">/home/ubuntu/MyCustomDirectory/</path>
+ <allowsharing>true</allowsharing>
+ </source>
</video>
... however comments suggest Kodi needs to be restarted and that this location still needs to be crawled/refreshed. There may be some "watchdog" add-ons that can do this for you.
Use the Add-On API
Another technique is to use the official Python API, such as through UpdateLibrary(database, path) however examples usually involve Python to call the API directly. Here's an example from the PlexKodiConnect GitHub project:
# Make sure Kodi knows we wiped the databases
xbmc.executebuiltin('UpdateLibrary(video)')
if utils.settings('enableMusic') == 'true':
xbmc.executebuiltin('UpdateLibrary(music)')
Since the simplest solution is often the best, I would recommend working on a way to automate the settings.xml file. Modifying XML files is well-documented in nearly all programming languages (to that point, you could technically brute-force and just inject an XML string at the given place without an xml parser 😈) and then handle the restart and refresh operations once the XML is confirmed as being updated properly.

LibXML: Comment-out a block of Elements

IS there a way to add/initate a comment ( e.g. $dom->createComment ... ) such that it comments out an entire block of xml tags. Basically I want to turn-off the content between the comment.
For example, it would look like this:
<TT>
<AA>keep</AA>
<!-- comment to blocking
<BB>hideme1</BB>
<CC>hideme2</CC>
-->
<DD>d's content is good</DD>
</TT>
Actually this question is a pre-cursor to my attempt to figure-out a method to be able to markup/label/identify the changes to an xml files in support of new client software functionality, but be able to have the ability to remove / back-out these xml changes in the rare event the client needs to fall back to the previous software version (and no I can't just simply point back to the original xml file because the client is allowed to make minor modifications to existing node text values). This is all going to be controlled via a perl script and LibXML's core modules (I can't use modules the client doesn't have).
So basically I've identified three possible types of xml changes as a result of new client sw functionality:
1.) ADD new element node(s) (typically to support new sw functionality)
2.) DELETE element node(s), or blocks of (would be rare, but never-the-less a possibility)
3.) CHANGE node text values (rare, but the new sw may require a new value)
For all three types, the client needs the ability to back out the changes. One thing I was thinking to use is ATTRIBUTES since the existing xml files don't use them. For example, for an ADD change type, I could include an atribute like 'ADD="sw version 4.1"' . This way if it needs to be removed, I could just simply have the perl script find those attribute strings and delete them (using LibXML methods). Same thing with CHANGE change type - I could use an attribute like CHG="newvalue_oldvalue", then again use straight perl (or LibXML) to switch back the value based on the contents of the attribute. The DELETE change type is giving me a problem though (as welll as the others lol!). I want to be able to "keep" the deleted lines in the xml file soley for the purposes if the sw falls back a version (at some late point the perl script could eventually cleanup/delete them).
I know this is a lot, I'm new to LibXML (but not to perl). I was just wonder if any of you have any thoughts as to how to go about it or seen anything resembling this kind of request ... I'd be grateful for any kind of advice! Thank you...

Is there a simple way to extract content from a webpage?

Our build software generates a webpage when the build fails, and lists the users who've committed since the last build. I'd like to have a way to parse the page for members of my team. For example:
Commit
18e1bc67b7e3123987daf8c219a4fbe2003de4
by bob.dole</b><pre>1112233- Description on header is not carried forward to BD doc after PCPROJBILL is ran<br></pre></div></td></tr><tr><td width="16"><img title="The file was modified" height="16" alt="The file was modified" width="16" src="/static/fbfd5d7f/images/16x16/document_edit.png" /></td><td><a>pcbatch/projbill.cpp</a></td></tr><tr class="pane"><td colspan="2" class="changeset"><a name="detail54"></a><div class="changeset-message"><b>
So the script would take a URL as input and search the file for 'bob.dole' and output to a file all of the details associated with him (commit hash, pre-data, etc.)
Could someone give me an idea of what would be the easiest way to accomplish this? I was thinking of using perl, but I'm not sure if there's something more straightforward.
If I got you question correctly, you want to get the webpage content and parse it to find the user name. If it is the case, I would use php
Use get_file_content("your_website"), this will return a string to you to parse.
Then you can use strpos() to find indeces of substrings. This will later help you to extract the user name by using substr() function.
Hope it helps.
The Perl module you are looking for that helps you search based on nodes is Mojo::DOM.

Trying to figure out what {s: ;} tags mean and where they come from

I am working on migrating posts from the RightNow infrastructure to another service called ZenDesk. I noticed that whenever users added files or even URL links, when I pull the xml data from RightNow it gives me a lot of weird codes like this:
{s:3:""url"";s:45:""/files/56f5be6c1/MUG_presso.pdf"";s:4:""name"";s:27:""MUG presso.pdf"";s:4:""size"";s:5:""2.1MB"";}
It wasn't too hard to write something that parses them and makes normal urls and links, but I was just wondering if this is something specific to the RightNow service, or if it is a tag system that is used. I tried googling for this but am getting some weird results so, thought stack overflow might have someone who has run into this one.
So, anyone know what these {s ;} tags are called and if there are any particular tools to use to read them?
Any answers appreciated!
This resembles partial PHP serialized data, as returned by the serialize() call. It looks like someone may have turned each " into "", which could prevent it from parsing properly. If it's wrapped with text like this before the {s: section, it's almost definitely PHP.
a:6:{i:1;a:10:{s:
These letters/numbers mean things like "an array with six elements follows", "a string of length 20 follows", etc.
You can use any PHP instance with unserialize() to handle the data. If those double-quotes are indeed returned by the API, you might need to replace :"" and ""; with " before parsing.
Parsing modules exist for other languages like Python. You can find more information in this answer.

How can I tell if two image files are the same in Perl?

I have a Perl script I wrote for my own personal use that fetches image files from a website periodically. It then saves these images to a folder. These image files are quite often the same from fetch to fetch, and I'd like to not save duplicates if I can get around it.
My question: What would be the best way to compare/check if they are the same?
My only real thought so far is to open a file handle to existing one, md5 it, md5 the $response->content from the fetch and then compare them. Would that work?
Is there a better way?
EDIT:
Wow, already tons of great suggestions. Does it help if I tell you that this script runs daily via cron? I.e. it is guaranteed to always run at the exact same time everyday? Also: I'm looking at the last-modified headers on some of these, and they don't look 100% accurate, i.e. there are some that have a last-modified of over a week ago when I know the image is more recent than that. I'm assuming that's because the image file itself hasn't been modified on the server since then... which doesn't help me much...
Don't open and hash the stored image each time - stash the hash alongside the image when you store it. Compare sizes as well.
Don't issue a GET request straight away, do a HEAD first and compare the size, last modification date and any Etags to what you got last time.
There are a number of HTTP headers you can use for this -- if you save the time that you last retrieved the file, you can do a conditional get with
If-Modified-Since: <date>
Or, if the server returns an Etag header with the response, you can store that with the image, (or a collection of all of the etags you have seen for that image), and do:
If-None-Match: <all of your etags here>
If the server supports conditional gets, then you will get a "304 Not Modified" response, with no body.
Yep that sounsd right.
Depending on how you're getting the file and how frequently you might also be able to check for HTTP 304 Not Modified and save yourself the download.
md5 would work, but you'd still have to pull the file. Are there any useful metadata in the HTTP headers, content-length, cache-control directives, ETags, etc. ?
There's also a nice fdupes tool for the purpose. Don't know what system you're using and what systems the tool can be built for.