I'm looking for a practical way to make complex operations in textual files.
From time to time I have the need to develop a whole application (usually in C++ or C#.Net) just to tweak textual configuration files (as .ini, .xml, .txt, etc).
Today I have the need to modify a .txt file with a well-known pattern of setting values to variables. I need to change the value of one specific variable (that appears many times in the file) by multiplying it for a constant (I first thought to use notepadpp + regex backreference but as I found in this thread: How to do a calculation using regex backreference in notepadpp? it seems to beimpossible).
Just when I thought to start developing another heavy desktop tool to accomplish this trivial task I thought if this is the way everyone smarter than me actually do this kind of thing. I thought there could be a notepadpp plugin that allow for complex operations in text using some kind of scripting language but I couldn't find any.
Thanks in advance.
Related
My question is partly liguistic, but very related to programming (of almost anything, web pages or anything else).
I would like to know why word refactor was chosen for changing of program or its part, if else word probably would be more exact and better describing done change.
IDEs (for example NetBeans or Eclipse) use this word only for renaming of any part of chosen program (project), including moving of file to else place (from view of any OS it is probably only renaming).
But renaming is not about changing of factor (because it is something that is not changed when it is renamed).
Closer to meaning of word refactor (as changing of factor) is manual rewriting of any part, when rewritten part has changed behaviour (but not what program does from outer view - as is written in topic What is refactoring and what is only modifying code?).
The word "Refactoring" is derived from mathematics where you find an equivalent expression by applying factoring again. The equivalent expression does not change the final outcome but it is much easier to understand, use, or reuse.
There are many refactoring techniques and renaming is one of them. Other techniques include extract method, extract class, move method, move class, pull/push method to super/sub-class and many more.
I have a lot of HTML files (10,000's and GBs worth) scraped from a server and I want to check to make sure the server produces the same results after some modifications but ignore kinds of differences that don't matter, e.g. whitespace, missing newlines, timestamps, small changes in some kinds of number, etc.
Does anyone know of a tool for doing this? I'd really rather not do more filtering than I have to.
(Oh and it needs to run under linux)
You might consider using a clone detector such as our CloneDR. This tool parses large sets of computer program (HTML is special case) files, builds abstract syntax trees representing the essential structure of each files, and compares programs for similarity.
Because it is comparing essential program structure, it ignores inessential differences such as comments and whitespace, and deterimines that two code segments are either identical or one can be obtained from the other by substituting other blocks of code. The latter allows the recognition of code that has been modified in various ways. You can see samples of clone detection runs on a variety of computer languages at the web site.
In your case, what you would be looking for are files in system A which are essentially clones (exact or near misses) of files in system B. As a general rule, if a file a is a variant of file b (e.g., with a few changes) the CloneDr will report it as a clone and show the exact differences.
At the scale of 20,000 files, I can see why you want a tool, and I can see why you want near-miss matches rather than exact matches.
Doesn't run under Linux, but I assume your problem is hard to enough to solve so that isn't what you are optimizing.
I use winmerge alot in windows and from what i can see some people enjoy meld in linux, so perhaps that could do the trick for you
http://meld.sourceforge.net/
Other examples i saw from a quick googling was Kompare,xxdiff.sourceforge.net, and kdiff3.sourceforge.net
(could only post 1 link so wrote the adresses to xxdiff and kdiff3 as text)
Beyond Compare is purchased software that is actually worth the money (I never thought I'd hear myself typing that!). It is GUI based but handles thousands of files very well. It will allow you to specify unimportant changes with regular expressions as well as whitespace (beginning, middle and end of line). The feature set is very extensive, check out a trial download.
I do not work for this company, I just use Beyond Compare every day at work and enjoy it every time!
If you port code over from one language to another, how can this be detected?
Say you were porting code from c++ to Java, how could you tell?
What would be the difference between a program designed and implemented in Java, and a near identical program ported over to Java?
If the porting is done properly (by people expert in both languages and ready to translate the source language's idioms into the best similar idioms of the target language), there's no way you can tell that any porting has taken place.
If the porting is done incompetently, you can sometimes recognize goofily-transliterated idioms... but that can be hard to distinguish from people writing a new program in a language they know little just goofily transliterating the idioms from the language they do know;-).
Depending on how much effort was put into the intention to hide the porting it could be very easy to impossible to detect.
I would use pattern recognition for this task. Think about the "features" which would indicate code-similarities. Extract these feature from each code and compare them.
e.g:
One feature could be similar symbol names. Extract all symbols using ctags or regular expressions, make all lower-case, make uniq sort of both lists and compare them.
Another possible feature:
List of class + number of members e.g:
MyClass1 10
...
List of method + sequence of controll blocks. e.g:
doSth() if, while, if, ix, case
...
Another easy way, is to represent the code as a picture - e.g. load the code as text in Word and set the font size to 1. Human beings are very good on comparing pictures. For another Ideas of code Visualization you may check http://www.se-radio.net/2009/03/episode-130-code-visualization-with-michele-lanza/
What's a reliable way to automatically count the characters and/or words in a .doc or .docx file?
The only real requirement is a reasonably accurate and reasonably reliable count.
It needs to work with documents containing something other than Latin script, so counting characters is good enough for most cases.
The count does not necessarily need to match Word's, but the closer the better.
Since there are a gazillion different apps that can generate .doc files, it's okay to fail to count anything, but this case needs to be catchable so we're aware that a count may be inaccurate. For all other cases the count must be, say, at least 99% accurate at least 99% of the time.
I'm open as to the involved technologies, but something that can run on a *NIX command line would be greatly preferred.
Is there a reasonable solution for this?
Here's a link to some Linux word-to-text converters.
For example you could use
antiword file.doc | wc
to do the counting.
Edit:
This link shows that AbiWord has a command-line interface, that you could use to convert the .docx format to .txt and then count the words using "wc". AbiWord does support the docx format
Mac OS X has support for reading word files built into the system frameworks, so if you have that, it's easy. MacRuby sample:
NSSpellChecker.sharedSpellChecker.countWordsInString(NSAttributedString.alloc.initWithURL(fileURL, documentAttributes:nil), language:nil)
More portably — though it gives up support for docx — you could simply get Antiword and do antiword | wc -w.
Microsoft has published a specification for the Office binary file formats. Parsing a .DOC file doesn't look trivial, but with some care you should be able to get a dependable, repeatable result. I have no idea how closely it'll match with what Word shows -- that will probably depend (at least partly) on how you define "word" -- for example, whether you consider a group of digits a "word" or not. It probably won't take a lot to figure out how Word treats cases like that, so getting a close match shouldn't be terribly difficult.
If you consider online applications as a solution, yes, there is a solution.
This not so pretty (regarding the design) site offers both word and character count: http://allworldphone.com/count-words-characters.htm
I don't think there is a limit, and it shouldn't be a problem to just copy/paste the contents of your documents into the corresponding textarea and see the result.
Regarding the 100% or 99% accuracy, you could test it with a few (i.e. 20-50 words) by counting them yourself first.
I hope this helps.
Regards. Chris
For my app I need to see if an url is Matched by a regex string. so I created an array with all the regex strings (about 1000+ strings) and check them using RegexKit lite:
for (NSString * aString in mainDelegate.whiteListArray) {
if (![urlString isMatchedByRegex:aString]) {
it works but sadly this operation takes very very long. at least 20 seconds for a webpage like google.com
I've tried using the "normal" RegexKit.framework, because it has an method called (BOOL)isMatchedByAnyRegexInArrayNSArray *)regexArray which is much faster. I can build the app, but whenever I try to launch it it crashes with the following error:
dyld: Library not loaded: #executable_path/../Frameworks/RegexKit.framework/Versions/A/RegexKit
Referenced from: /Users/Reilly/Library/Application Support/iPhone Simulator/User/Applications/7E057EA8-5CD1-465B-8102-38A53A9B5F5B/Drowser.app/Drowser
Reason: image not found
I guess it's because the RegexKit is not meant for arm? (to include the RegexKit I followed the how to which comes in the documentation)
so my question are:
Do you know of any faster way to check a string if it's being matched by any of 1000 regexs.
or do you know how to use the "normal" RegexKit on iPhone or any other regex framework which would do what I need in under a second?
thanks in advance
Note: I am the author of RegexKit et al.
This is a fairly complicated answer.. :)
First, matching a thousand regexes with any of commonly available regex engine implementations is going to be fairly slow, save for perhaps the TCL and TRE regex engines. The reason why RegexKit.framework greatly outperforms RegexKitLite for this task is RegexKit.framework has quite a bit of non-trivial, optimized code for just this task. The reason for this is because it's used in Safari AdBlock, which needs to perform bulk matches of regexes against URLs. It keeps the list of regexes in sorted order, based on the number of times they made a successful match. This is based on the observation that some regex patterns used in Safari AdBlock match much more frequently than others, and trying those first dramatically reduces the amount of regexes that need to be tried to determine if there's a 'hit'. There is also a small negative hit cache as well, along with a lot of multithreading code to do the matches in parallel. None of this will ever make it in to the Lite version as it is definitely not a light-weight feature- there's probably 60-70KB of code just to implement this one feature alone, not to mention the huge memory footprint of keeping a thousand compiled regexes around.
Using RegexKitLite to do this kind of pattern matching is bound to be very, very slow. The first problem is that it only keeps a small cache of compiled regexes that have recently been used. By default, the cache is set to just 23, so tossing a thousand regexes at it is going to cause every regex to be compiled each time its used.
As others have pointed out, RegexKit.framework isn't really set up to be used on the iPhone. Even if you got around the "linking to external frameworks" provision, the default build of RegexKit.framework does not include the arm architecture in its fat binary (it includes ppc, ppc64, i386, and x86_64). What you really need to do is set up a new build target that creates a static library. Not terribly hard to do, really.
I'm afraid that if this kind of pattern matching is something you need to do, you're probably going to have to roll your own regex engine. What you need is a regex engine that can take your thousand regexes and concatenate them together, such as "r1|r2|r3|r4". Most regex engines, and in particular pcre and ICU (the ones used by RegexKit.framework and RegexKitLite, respectively), evaluate such a regex in an almost left to right manner. What's needed is an almost DFA like engine that evaluates all possible states concurrently. See this link for more information. I've built such a regex engine, one that even handles back-references (much easier to do than everyone says) in ~O(M*log2(N)) (M being the size of the text to match, N being the size of the regex) time, but it's not finished. If it was, it would cut through this kind of problem like a plasma torch through butter.
I am aware of at least one person porting RegexKit.framework to the iPhone, though: Mobile Safari AdBlock. AFAIK, it's also a port of the desktop version of Safari AdBlock. I don't know many details, but I think it requires a jail-broken iPhone to install.
In summary, I don't think there's any turn-key solutions available for iPhone development that do anything close to what you need. Your best bet, other than creating your own regex engine, is to look in to the TRE regex engine and try some experiments using concatenated regexes. Be prepared to roll up your sleeves, though, as you're going to have to get your hands very dirty and deal with the guts of Cocoa's strings, Unicode encodings, and all kinds of other unpleasant stuff- the kind of stuff that RegexKitLite takes care of for you behind the scenes.
Are you copying RegexKit.framework into the frameworks folder of your iPhone app?
iPhone does not support embedded frameworks, so the directions there would not work even if it was built for arm. You can only use things that are statically linked, so you will either need to modify regexkit to build as a static library, or include it's source code directly in your project.