Fastest method of checking if multiple different strings are a substring of a 2nd string - substring

Context:
I'm creating a program which will sort and rename my media files which are named e.g. The.Office.s04e03.DIVX.WaREZKiNG.avi into an organized folder structure, which will consist of a list of folders for each TV Series, each folder will have a list of folders for the seasons, and those folders will contain the media files.
The problem:
I am unsure as to what the best method for reading a file name and determining what part of that name is the TV Show. For e.g. In "The.Office.s04e03.DIVX.WaREZKiNG.avi", The Office is the name of the series. I decided to have a list of all TV Shows and to check if each TV Show is a substring in the file name, but as far as I know this means I have to check every single series against the name for every file.
My question: How should I determine if a string contains one of many other strings?
Thanks

The Aho-Corsasick algorithm[1] efficiently solves the "does this possibly long string exactly contain any of these many short strings" problem.
However, I suspect this isn't really the problem you want to solve. It seems to me that you want something to extract the likely components from a string that is in one of possibly many different formats. I suspect that having a few different regexps for likely providers, video formats, season/episode markers, perhaps a database of show names, etc, is really what you want. Then you can independently run these different 'information extractors' on your filenames to pull out their structure.
[1] http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

It depends on the overall structure of the filenames in general, for instance is the series name always first? If so a tree structure work well. Is there a standard marking between words (period in your example) if so you can split the string on those and create a case-insensitive hashtable of interesting words to boost performance.
However extracting seasons and episodes becomes more difficult, a simple solution would be to implement an algorithm to handle each format you uncover, although by using hints you could create an interesting parser if you wanted too. (Likely overkill however)

Related

Classification of gender for given names

after some research I could not find yet a suitable open source library or software I can use to classify by most likely gender a long table of first names I have.
For my application I have a set of first names from many different countries, and many of them are also pretty exotic.
For example, when I tried to use Genderize I could get only 1/8 of the names classified, while the remaining are labeled as Unknown (I made sure that the format is correct, no lower/upper case ambiguity, etc..).
Any advise would be appreciated. Thank you in advance !
For the record, the best I could find was really just do it manually looking up names from google or dedicated websites such as https://namepedia.org. I am afraid there is no automated solution for my use case. This mostly for the following reasons:
Many names are somewhat archaic (I could not even recognise several names of my own nationality)
Many names were truncated to form nicknames or had two nearby letters swapped: here a LUT approach would fail and rather one would need a score from a model
There were several names not based on Roman alphabet but where the mapping into roman characters produced some ambiguities I guess
For those curious of the original dataset, this is part of a Kaggle challenge (Spaceship Titanic, https://www.kaggle.com/competitions/spaceship-titanic).

Correcting inconsistent name spellings in flex

I am currently working on an assignment to read a BibTex file and store the data about all the categories, authors and their publications, etc...
In the BibTex file, however, many times the same names are spelled in different ways, sometimes even with unknown characters.
Here is an example of those inconsistencies:
The only way I know how to do this is to create regular expressions specific to each case, and even so I don't know if it would work for the unknown characters. However, there are way too many authors to go about doing it this way.
How could I go about automatically detecting and correcting these spelling inconsistencies to correctly save all authors and their respective publications in a flex filter?
Assuming you have a known list of good authors, for each input author, match them against the list using fuzzywuzzy.
If you do not have a list of known authors, you'll need to make one or get a list of names from somewhere such as Wikipedia.

Where are List informations stored in ApachePoi's XWPFDocument?

I want to merge two or more docx files (append them after each other) or move one part of the document(XWPFParagraph) to other place.
The problem is that listings always breaks after such an operation. Say we have a listing in a document which has sequence numbers then we have other listing in another document which has bullets or letters. Than after the copy all of the bullets becomes numbers (or worse numbers which starts from where the previous listing has been ended).
I have tried several solutions :
-traversing BodyElements and copying Paragraphs and Tables by hand like here.
-attaching a newBody into an existing one like here here
Aside from page scoped styles they work well. But the listings never. Is that means the listing symbols are stored as page scoped information (otherwise it would be copyied successfully with the XWPFParagraph)? If yes than why and where?
I have dig myself into the javadoc: https://poi.apache.org/apidocs/dev/org/apache/poi/xwpf/usermodel/XWPFDocument.html
But couldn't find anything about the listings.
The Word numberings (numbered lists but bullet lists also) in Office Open XML file format are stored in /word/numbering.xml of the *.docx ZIP archive. There are abstractNum elements describing the list format and num elements referencing the abstractNum. The numId of the numelements are referenced in paragraphs of /word/document.xml to set which numbering formats shall be used in that paragraph. Paragraphs referencing the same numId are in the same list.
Paragraphs referencing different numId are in different lists.
In apache poi there are XWPFNumbering representing the document part /word/numbering.xml and XWPFAbstractNum representing the abstractNum.
Until now there is no way creating XWPFAbstractNum from scratch without using the low level ooxml-schemas classes.
Also, as far as I know, there is no simple way to merge /word/numbering.xml document parts of different Word documents because of the need handling the different Ids in /word/numbering.xml as well as their occurrences in /word/document.xml. This is very complex and I do not know any free library which can do this properly.
In general, as far as I know, there is no simple way to merge different Word documents together because of the complex storage in Word file formats. All provided possibilities using free code are only halfway useful (traversing and copying), if not wrong and useless (simply attaching multiple document bodys one after the other) at all.

iTextSharp comparing 2 PDFs for equality

I am generating and storing PDFs in a database.
The pdf data is stored in a text field using Convert.ToBase64String(pdf.ByteArray)
If I generate the same exact PDF that already exists in the database, and compare the 2 base64strings, they are not the same. A big portion is the same, but it appears about 5-10% of the text is different each time.
What would make 2 pdfs different if both were generated using the same method?
This is a problem because I can't tell if the PDF was modified since it was last saved to the db.
Edit: The 2 pdfs visually appear exactly the same when viewing the actual pdf, but the base64string of the bytes are different
Two PDFs that look 100% the same visually can be completely different under the covers. PDF producing programs are free to write the word "hello" as a single word or as five individual letters written in any order. They are also free to draw the lines of a table first followed by the cell contents, or the cell contents first, or any combination of these such as one cell at a time.
If you are actually programmatically creating the PDFs and you create two PDFs using completely identical code you still won't get files that are 100% identical. There's a couple of reasons for this, the most obvious is that PDFs support creation and modification dates. These will obviously change depending on when they are created. You can override these (and confuse everyone else so I don't recommend this) using something like this:
var info = writer.Info;
info.Put(PdfName.CREATIONDATE, new PdfDate(new DateTime(2001,01,01)));
info.Put(PdfName.MODDATE, new PdfDate(new DateTime(2001,01,01)));
However, PDFs also support a unique identifier in the trailer's /ID entry. To the best of my knowledge iText has no support for overriding this parameter. You could duplicate your PDF, change this manually and then calculate your differences and you might get closer to a comparison.
Then there's fonts. When subsetting fonts, producers create a unique internal name based on the original name and an arbitrary selection of six uppercase ASCII letters. So for the font Calibri the font's name could be JLXWHD+Calibri one time and SDGDJT+Calibri another time. iText doesn't support overriding of this because you'd probably do more harm than good. These internal names are used to avoid font subset collisions.
So the short answer is that unless you are comparing two files that are physical duplicates of each other you can't perform a direct comparison on their binary contents. The long answer is that you can tweak some of the PDF entries to remove unique parts for comparison only but you'd probably be doing more work than it would take to just re-store the file in the database.

Check if NSString contains a common first name on iPhone

I am wondering what the best approach would be to check whether or not a common first name is contained within an NSString on an iPhone app. I've got a sorted flat text file of ~5500 common American first names delimited by new lines. The NSString I am searching within for a name is not very long, most likely the size of a normal sentence.
My original plan was to load the sorted list into memory and then iterate over every word in the NSString performing a binary search of the list to determine whether or not that word was a common name.
Am I better off trying to put this name list into CoreData or a SQLite table and performing a query with that? My understanding is I would not have to load the entire list into memory if I went that route.
I am guessing this situation is a common problem with word dictionaries for word games, so I'm just wondering what the best practice is for fast lookups. Thanks!
SQLite sounds ideal for this in terms of both speed of lookup and minimising memory usage. It would also make it potentially possible to update the first name list over the internet if so desired.
Using Core Data (which is in effect an elabourate wrapper around SQLite) would be overkill in this instance, especially as you don't require the ORM like capabilities.
An NSSet might be useful as well. Dave DeLong's answer for another question demonstrates that NSSets have constant look-up times, i.e. O(1).
Load your names into an NSMutableSet one by one. This will be the slowest part but will only need to be done once. If your file is a simple line-delimited file of names, it may be easier to use the standard C library for reading the file, since line-by-line input is not well-supported by Cocoa.
After that, simply use [nameSet containsObject:name] to check whether it is in the list.
A couple of drawbacks to this approach:
The name you want to test must be in the same case as the name in the set, that is “paul” and “Paul” are different strings. You can circumvent this by converting all names to lowercase before inserting them into the set, and then also converting the name you want to check into lowercase before checking it against the set.
It might be easier just to go with the already-accepted answer.