Let's say I have huge text file which has rows in format like
Id Name Address
All the records are sorted by Id. If I am searching for an Id, how would I make it more efficient to search using findstr or write something better than findstr?
As a native application I would not be surprised if findstr has better search performance than most anything one could implement in PowerShell code or even a compiled .NET module. The problem with findstr is it is oblivious to the structure of your data. That is, if you search for the record with ID 123 it will happily returns records with ID 1234 or address "123 Main Street" as false positives. You could potentially use the /B or /R switches to combat this, but that still doesn't help in the case where you search for an ID that doesn't exist; findstr only stops searching when it reaches the end of the file.
Your ability to perform an optimized search depends on the specific format of the text file. If lines are fixed-length, meaning you can instantly seek to the $nth line by simply calculating $n * $lineLength, then you could quickly search the file for an ID using a binary search.
If lines are variable-length, then there's really no simple way to efficiently search the file other than line-by-line. Even if you've read enough of a line to know the ID doesn't match, you still need to read the rest of the line to know where the next line begins. At best, since the lines are sorted by ID you know that if you encounter a line with an ID greater than the one you're searching for you can abort the search immediately because that ID won't be found.
In the past I have been able to employ a binary search on text files with variable-length lines (fixed-sized characters would be very helpful, too, if not required). The key is for each iteration of the search, calculate your next offset and if it happens to land on the beginning of a line, great; if not, seek backwards until you can identify the character that is the beginning of the line (e.g. preceded by a CrLf). Once you've got yourself positioned on the start of a line, you can read the ID and determine if it's a match or in which direction the next iteration of the search needs to look.
It's definitely not a quick and simple solution (to write), but, depending on how huge is "huge", it could yield significant results when searching your file. Although, at that point it might be better to invest your development time in changing to a more search-friendly way of storing your data, if at all possible.
Related
In the apache access logs I found the following code as query string (GET), submitted multiple times each second for quite a while from one IP:
**/OR/**/ASCII(SUBSTRING((SELECT/**/COALESCE(CAST(LENGTH(rn)/**/AS/**/VARCHAR(10000))::text,(CHR(32)))/**/FROM/**/"public".belegtable/**/ORDER/**/BY/**/lv/**/OFFSET/**/1492/**/LIMIT/**/1)::text/**/FROM/**/1/**/FOR/**/1))>9
What does it mean?
Is this an attempt of breaking in via injection?
I have never seen such a statement and I don't understand its meaning. PostgreSQL is used on the server.
rn and belegtable exist. Some other attempts contain other existing fields/tables. Since the application is very costum, I don't know how the information on existing SQL fields can be known to strangers. Very weird.
**/
OR ASCII(
SUBSTRING(
( SELECT COALESCE(
CAST(LENGTH(rn) AS VARCHAR(10000))::text,
(CHR(32))
)
FROM "public".belegtable
ORDER BY lv
OFFSET 1492
LIMIT 1
)::text
FROM 1
FOR 1
)
) > 9
Is this an attempt of breaking in via injection?
The query in question does not have too many characteristics of an attempted SQL injection.
An SQL injection typically involves inserting an unwanted action into some section of a bigger query, under the disguise of a single value. Typically the injected part tries to guess what comes before it, neutralise it, do something malicious and secure the entire query from syntax errors by also neutralising what comes after the injected piece, which might not be visible to the attacker.
I don't see anything that could work as an escape sequence at the beginning or anything that would neutralise the remains of the query coming in after the injection. What this query does also isn't malicious. An SQL injection would attempt to extract some additional information about the database security, structure and configuration, or - if the attacker already gathered enough data - it would try to steal the data, encrypt it or tamper with it otherwise, depending on the aim and strategy of the attacker as well as the type of data found in the database. There also wouldn't be much point looping it like that.
As to the looping part: if someone attempted to put load on the database - as in DDoS - you'd likely see more than one node doing that and probably in a more elaborate and well disguised manner, using different and more demanding queries sent at different frequencies.
What does it mean?
It's likely someone's buggy code stuck in an unterminated loop, judging by the LIMIT and OFFSET mechanism I've seen used for looping through some set of records by taking one at a time (LIMIT 1) and incrementing which one to get next (OFFSET n). The whole expression always returns true because ASCII() returns the character code of the first character in the string. That string defaults to a space ' ', ASCII code 32, or some text representation of a number between 0 and 99999. Sice all ASCII digits are between code 48 and 57, it's effectively always comparing some bigger number than 9 to a 9, checking if it indeed is bigger.
The author of that code might not have predicted the loop to be able to run infinitely and might have misinterpreted what some of the functions used in that query do. Regardless of what really happened, I think it was a good idea to cut off that IP avoiding needless stress on the database. Double-checking your security setup is always a good idea but I wouldn't call this an attempted attack. At least not this query alone, as it might be a harmless piece of a bigger, more malicious operation - but that could be said about any query.
We regularly need to perform a handful of relatively simple tests against a bunch of MS Word documents. As these checks are currently done manually, I am striving for a way to automate this. For example:
Check if every page actually has a page number and verify that it is correct.
Verify that a version identifier in the page header is identical across all pages.
Check if the document has a table of contents.
Check if the document has a table of figures.
Check if every figure has a caption.
et cetera. Is this reasonably feasible using PowerShell in conjunction with a Word API?
Powershell can access Word via its object model/Interop (on Windows, at any rate) and AIUI can also work with the Office Open XML OOXML) API, so really you should be able to write any checks you want on the document content. What is slightly less obvious is how you verify that the document content will result in a particular "printed appearance". I'm going to start with some comments on the details first.
Just bear in mind that in the following notes I'm just pointing out a few things that you might have to deal with. If you're examining documents produced by an organisation where people are already broadly speaking following the same standards, it may be easier.
Of the 5 examples you give, without checking the details I couldn't say exactly how you would do them, and there could be difficulties with all of them, but for example
Check if every page actually has a page number and verify that it is correct.
Difficult using either OOXML or the object model, because what you would really be checking is that the header for a particular section had a visible { PAGE } field code. Because that field code might be nested inside other fields that say "if don't display this field code", it's not so easy to be sure that there would be a page number.
Which is what I mean by checking the document's "printed appearance" - if, for example, you can use the object model to print to PDF and have some mechanism that lets PS inspect the PDF's content, that might be a better approach.
Verify that a version identifier in the page header is identical across all pages.
Similar problem to the above, IMO. It depends partly on how the version identifier might be inserted. Is it just a piece of text? Could it be constructed from a number of fields? Might it reference Document Properties or Variables, or Custom XML content?
Check if the document has a table of contents.
Perhaps enough to look for a TOC field that does not have certain options, such as a \c option that a Table of Figures would contain.
Check if the document has a table of figures.
Perhaps enough to check for a TOC field that does have a \c option, perhaps with a specific parameter such as "Figure"
Check if every figure has a caption.
Not sure that you can tell whether a particular image is "a Figure". But if you mean "verify that every graphic object has a caption", you could probably iterate through the inline and floating graphics in the document and verify that there was something that looked like a Word standard caption paragraph within a certain distance of that object. Word has two standard field code patterns for captions AFAIK (one where the chapter number is included and one where it isn't), so you could look for those. You could measure a distance between the image and the caption by ensuring that they were no more than a predefined number of paragraphs apart, or in the case of a floating image, perhaps that the paragraph anchoring the image was no more than so many paragraphs away from the caption.
A couple of more general problems that you might have to deal with:
- just because a document contains a certain feature, such as a ToC field, does not mean that it is visible. A TOC field might have been formatted as not visible. Even harder to detect, it could have been formatted as colored white.
- change tracking. You might have to use the Word object model to "accept changes" before checking whether any given feature is actually there or not. Unless you can find existing code that would help you do that using the OOXML representation of the document, that's probably a strong case for doing checks via the object model.
Some final observations
for future checks, perhaps worth noting that in principle you could create a "DocumentInspector" that users could call from Word BackStage to perform checks on a document. Not sure you can force users to run it, or that you could create it in PS, but perhaps a useful tool.
longer term, if you are doing a very large number of checks, perhaps worth considering whether you could train a ML model to try to detect problems.
In our web application we display a list of pulses, but for linking and such we make every pulse uniquely available. In our Couch DB we are giving every pulse a unique id by md5'ing their unique attributes. I.E.: www.foo.com/bar/
Though these md5 sums are extremely long and make for ugly URLs. Is there another way to hash the attributes that will require less characters but still guarantee uniqueness.
Thanks a lot
Instead of creating an ugly md5 you could use a method like this to create a random string of a given length containing certain characters and insert this into a row next to the md5 row that is used for retrieving the data from the database using the 'pretty url' string. One thing to think about would be to take out the vowels from the possible characters as with them, you could end up with bad words :) Also, make sure it does not already exist in the database of course, and if it does just create another one... that won't happen very often though.
Context:
I'm creating a program which will sort and rename my media files which are named e.g. The.Office.s04e03.DIVX.WaREZKiNG.avi into an organized folder structure, which will consist of a list of folders for each TV Series, each folder will have a list of folders for the seasons, and those folders will contain the media files.
The problem:
I am unsure as to what the best method for reading a file name and determining what part of that name is the TV Show. For e.g. In "The.Office.s04e03.DIVX.WaREZKiNG.avi", The Office is the name of the series. I decided to have a list of all TV Shows and to check if each TV Show is a substring in the file name, but as far as I know this means I have to check every single series against the name for every file.
My question: How should I determine if a string contains one of many other strings?
Thanks
The Aho-Corsasick algorithm[1] efficiently solves the "does this possibly long string exactly contain any of these many short strings" problem.
However, I suspect this isn't really the problem you want to solve. It seems to me that you want something to extract the likely components from a string that is in one of possibly many different formats. I suspect that having a few different regexps for likely providers, video formats, season/episode markers, perhaps a database of show names, etc, is really what you want. Then you can independently run these different 'information extractors' on your filenames to pull out their structure.
[1] http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
It depends on the overall structure of the filenames in general, for instance is the series name always first? If so a tree structure work well. Is there a standard marking between words (period in your example) if so you can split the string on those and create a case-insensitive hashtable of interesting words to boost performance.
However extracting seasons and episodes becomes more difficult, a simple solution would be to implement an algorithm to handle each format you uncover, although by using hints you could create an interesting parser if you wanted too. (Likely overkill however)
What's the secret to pulling up items that match characters typed into the search bar that react instantaneously? For instance, if I type in a letter "W" in the search bar, all phrases that contain a letter "W" in any character position within the phrase are returned immediately.
So if a database of 20,000 phrases contains 500 phrases with the letter "W", they would appear as soon as the user typed the first character. Then as additional characters are typed, the list would automatically gets shorter.
I can send query's up to a SQL server from the iPhone and get this type of response, however, no matter what we try and taking the suggestions of other users, we still can't get good response time when storing the database locally on the iPhone.
I know that this performance is available, because there are many other apps out there that display results as soon as you start typing.
Please note that this isn't the same as indexing all words in every phrase, as this only will bring up matches where the word starts with the character typed in. In this case, we're looking for characters within words.
I think asynchronous results filtering is the answer. Instead of updating the search results every time the user types a new character, put the db query on a background thread when the first character is typed. If a new character is typed before the query is finished, cancel the old query and start a new one. Finally, you will get to the point where the user stops typing long enough for the query to return. That way, the query itself never blocks the user's typing.
I believe the UISearchDisplayController class offers this type of asynchronous search, though whether you want to use that class or just adopt the asynchronous design pattern from it is up to you.
If you're willing to get away from the database for this, you could use a generalized suffix tree with all the terms in your phrases. You can build in a suffix tree in linear time and, I believe, use it to find all occurrences of a substring very quickly. The web has lots of pages about suffix trees and suffix arrays. Wikipedia is probably a good place to start.
I have a fun scheme for you. You can build an index of the characters that exist in each phrase via a 32-bit integer. Flip the bits [0-25] to represent the characters (case-insensitive) a-z that exist in the phrase. Build a second bitmap of the query string. Now you can do comparisons via bitwise operations (& and |) to determine matches. This is very fast and believe it or not SQLite actually supports bitwise operations in queries - so you can even use this scheme to go straight to the database. I have working code that does this built into one of our iPhone applications - Alphagram.