How to find $plusargs with same string in different locations - system-verilog

Very general issue in large integration of verification environment.
Our verification development involves large group across different time zone.
Group has preference to use $plusargs instead factory mechanism.
Probably main reason it is hard to set factory from command line processor,
we have more layers of scripts to start simulation.
Recently i found that same string been used in different environment to control behavior of environment. In this case two different score board used same string to disable some checking and test pass. Both those environment some time created at run time. Also some time it is OK to re-use same string, and it will require owner to be involved.
Is there any way to find duplication like this from final elaborated model, and provide locations in code as a warning?
I thought create our own wrapper, but problem that we are integrating some code that we are not owners as in this case was.
Thanks,

This is a perfect example of how people think they can get things done quicker by not following the recommended UVM methodology and instead create time consuming complexity later on.
I see at least two possible options.
Write a script that searches the source code for $plusargs and hopefully they have used string literals for you to trace for duplicates.
You can override $plusargs with PLI code and have it trace duplicates.
The choice depends on wether you are better at writing Perl/Python or C code.

Related

How to get nunit filters at runtime?

Does anybody know how to get list of categories (provided with 'where' filter to nunit-console) at runtime?
Depending on this, I need to differently initialize the test assembly.
Is there something static like TestExecutionContext that may contain such information?
The engine doesn't pass information on to the framework about "why" it's running a particular test... i.e. if it's running all tests or if it was selected by name or category. That's deliberately kept as something the test doesn't know about with the underlying philosophy being that tests should just run based on the data provided to them.
On some platforms, it's possible to get the command-line, which ran the test. With that info you could decode the various options and make some conclusions but it seems as if it would be easier to restructure the tests so they didn't need this information.
As a secondary reason, it would also be somewhat complicated to supply the info you want and to use it. A test may have multiple categories. Imagine a test selected because two categories matched, for example!
Is it possible that what you really want to do is to pass some parameters to your tests? There is a facility for doing that of course.
I think this is a bit of an XY problem. Depending on what you are actually trying to accomplish, the best approach is likely to be different. Can you edit to tell us what you are trying to do?
UPDATE:
Based on your comment, I gather that some of your initialization is both time-consuming and not needed unless certain tests are run.
Two approaches to this (or combine them):
Do less work in all your initialization (i.e. TestCase, TestCaseSource, SetUpFixture. It's generally best not to create your classes under test or initialize databases. Instead, simply leave strings, ints, etc., which allow the actual test to do the work IFF it is run.
2.Use a SetUpFixture in some namespace containing all the tests, which require that particular initialization. If you dont' run any tests from that namespace, then the initialization won't be done.
Of course both of the above may entail a large refactoring of your tests, but the legacy app won't have to be changed.

Make A step in a SAS macro timeout after a set interval

I'm on SAS 9.1.3 (on a server) and have a macro looping over an array to feed a computationally intensive set of modelling steps which are appended out to a table. I'm wondering if it is possible to set a maximum time to run for each element of the array. This is so that any element which takes longer than 3 minutes to run is skipped and the next item fed in.
Say for example I'm using a proc nlin with a by statement to build separate models per class on a large data set, and one class is failing to converge; how do I skip over that class?
Bit of a niche requirement, hope someone can assist!
The only approach I can think of here would be to rewrite your code so that it runs each by group separately from the rest, in one or more SAS/CONNECT sessions, have the parent session kill each one after a set timeout, and then recombine the surviving output.
As Dom and Joe have pointed out, this is not a trivial task, but it's possible if you're sufficiently keen on learning about that aspect of SAS. A good place to get started for this sort of thing would be this page:
http://support.sas.com/rnd/scalability/tricks/connect.html
I was able to use the examples there and elsewhere as the basis of a simple parallel processing framework (in SAS 9.1.3, coincidentally!), but there are many details you will need to consider. To give you an idea of the sorts of adventures in store if you go down this route:
Learning how to sign on to your server via SAS/CONNECT within whatever infrastructure you're using (will the usual autoexec file work? What invocation options do you need to use?)
Explaining to your sysadmin/colleagues why you need to run multiple processes in parallel
Managing asynchronous sessions
Syncing macro variables, macro definitions, libraries and formats between sessions
Obscure bugs (I wasn't able to use the usual option for syncing libraries and had to roll my own via call execute...)
One could write a (lengthy) SUGI paper on this topic, and I'm sure there are plenty of them out there if you look around.
In general, SAS is running in a linear manner. So you cannot write a step to monitor another step in the same program. What you could do is run your code in a SAS/CONNECT session and monitor it with the process that started the session. That's not trivial and the how to is beyond the scope of Stack Overflow.
For a data step, use the datetime() function to get the current system date and time. This is measured in seconds. You can check the time inside your data step. Stop a data step with the stop; statement.
Now you specifically asked about breaking a specific step inside a PROC. That must be implemented in the PROC by the SAS developer. If it is possible, it will be documented in the procedure's documentation. View SAS documentation at http://support.sas.com/documentation/.
For PROC NLIN, I do not think there is a "break after X" parameter. You can use the trace parameters to track model execution to see what it hanging up. You can then work on changing the convergence parameters to attempt to speed up slow, badly converging, models.

How to tag a scientific data processing tool to ensure repeatability

we develop a data processing tool to extract some scientific results out of a given set of raw data. In data science it is very important that you can re-obtain your results and repeat the calculations, that led to a result set
Since the tool is evolving, we need a way to find out which revision/build of our tool generated a given result set and how to find the corresponding source from which the tool was build.
The tool is written in C++ and Python; gluing together the C++ parts using Boost::Python. We use CMake as a build system generating Make files for Linux. Currently the project is stored in a subversion repo, but some of us already use git resp. hg and we are planning to migrate the whole project to one of them in the very near future.
What are the best practices in a scenario like this to get a unique mapping between source code, binary and result set?
Ideas we are already discussing:
Somehow injecting the global revision number
Using a build number generator
Storing the whole sourcecode inside the executable itself
This is a problem I spend a fair amount of time working on. To what #VonC has already written let me add a few thoughts.
I think that the topic of software configuration management is well understood and often carefully practiced in commercial environments. However, this general approach is often lacking in scientific data processing environments many of which either remain in, or have grown out of, academia. However, if you are in such a working environment, there are readily available sources of information and advice and lots of tools to help. I won't expand on this further.
I don't think that your suggestion of including the whole source code in an executable is, even if feasible, necessary. Indeed, if you get SCM right then one of the essential tests that you have done so, and continue to do so, is your ability to rebuild 'old' executables on demand. You should also be able to determine which revision of sources were used in each executable and version. These ought to make including the source code in an executable unnecessary.
The topic of tying result sets in to computations is also, as you say, essential. Here are some of the components of the solution that we are building:
We are moving away from the traditional unstructured text file that is characteristic of the output of a lot of scientific programs towards structured files, in our case we're looking at HDF5 and XML, in which both the data of interest and the meta-data is stored. The meta-data includes the identification of the program (and version) which was used to produce the results, the identification of the input data sets, job parameters and a bunch of other stuff.
We looked at using a DBMS to store our results; we'd like to go this way but we don't have the resources to do it this year, probably not next either. But businesses use DBMSs for a variety of reasons, and one of the reasons is their ability to roll-back, to provide an audit trail, that sort of thing.
We're also looking closely at which result sets need to be stored. A nice approach would be only ever to store original data sets captured from our field sensors. Unfortunately some of our computations take 1000s of CPU-hours to produce so it is infeasible to reproduce them ab-initio on demand. However, we will be storing far fewer intermediate data sets in future than we have in the past.
We are also making it much harder (I'd like to think impossible but am not sure we are there yet) for users to edit result sets directly. Once someone does that all the provenance information in the world is wrong and useless.
Finally, if you want to read more about the topic, try Googling for 'scientific workflow' and 'data provenance' similar topics.
EDIT: It's not clear from what I wrote above, but we have modified our programs so that they contain their own identification (we use Subversion's keyword capabilities for this with an extension or two of our own) and write this into any output that they produce.
You need to consider git submodules of hg subrepos.
The best practice in this scenario os to have a parent repo which will reference:
the sources of the tool
the result set generated from that tool
ideally the c++ compiler (won't evolve every day)
ideally the python distribution (won't evolve every day)
Each of those are a component, that is an independent repository (Git or Mercurial).
One precise revision of each component will be reference by a parent repository.
The all process is representative of a component-based approach, and is key in using an SCM (here Software Configuration Management) at its fullest.

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.
Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.
Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with
Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.

Do you create your own code generators?

The Pragmatic Programmer advocates the use of code generators.
Do you create code generators on your projects? If yes, what do you use them for?
In "Pragmatic Programmer" Hunt and Thomas distinguish between Passive and Active code generators.
Passive generators are run-once, after which you edit the result.
Active generators are run as often as desired, and you should never edit the result because it will be replaced.
IMO, the latter are much more valuable because they approach the DRY (don't-repeat-yourself) principle.
If the input information to your program can be split into two parts, the part that changes seldom (A) (like metadata or a DSL), and the part that is different each time the program is run (B)(the live input), you can write a generator program that takes only A as input, and writes out an ad-hoc program that only takes B as input.
(Another name for this is partial evaluation.)
The generator program is simpler because it only has to wade through input A, not A and B. Also, it does not have to be fast because it is not run often, and it doesn't have to care about memory leaks.
The ad-hoc program is faster because it's not having to wade through input that is almost always the same (A). It is simpler because it only has to make decisions about input B, not A and B.
It's a good idea for the generated ad-hoc program to be quite readable, so you can more easily find any errors in it. Once you get the errors removed from the generator, they are gone forever.
In one project I worked on, a team designed a complex database application with a design spec two inches thick and a lengthy implementation schedule, fraught with concerns about performance. By writing a code generator, two people did the job in three months, and the source code listings (in C) were about a half-inch thick, and the generated code was so fast as to not be an issue. The ad-hoc program was regenerated weekly, at trivial cost.
So active code generation, when you can use it, is a win-win. And, I think it's no accident that this is exactly what compilers do.
Code generators if used widely without correct argumentation make code less understandable and decrease maintainability (the same with dynamic SQL by the way). Personally I'm using it with some of ORM tools, because their usage here mostly obvious and sometimes for things like searcher-parser algorithms and grammatic analyzers which are not designed to be maintained "by hands" lately. Cheers.
In hardware design, it's fairly common practice to do this at several levels of the 'stack'. For instance, I wrote a code generator to emit Verilog for various widths, topologies, and structures of DMA engines and crossbar switches, because the constructs needed to express this parameterization weren't yet mature in the synthesis and simulation tool flows.
It's also routine to emit logical models all the way down to layout data for very regular things that can be expressed and generated algorithmically, like SRAM, cache, and register file structures.
I also spent a fair bit of time writing, essentially, a code generator that would take an XML description of all the registers on a System-on-Chip, and emit HTML (yes, yes, I know about XSLT, I just found emitting it programatically to be more time-effective), Verilog, SystemVerilog, C, Assembly etc. "views" of that data for different teams (front-end and back-end ASIC design, firmware, documentation, etc.) to use (and keep them consistent by virtue of this single XML "codebase"). Does that count?
People also like to write code generators for e.g. taking terse descriptions of very common things, like finite state machines, and mechanically outputting more verbose imperative language code to implement them efficiently (e.g. transition tables and traversal code).
We use code generators for generating data entity classes, database objects (like triggers, stored procs), service proxies etc. Anywhere you see lot of repititive code following a pattern and lot of manual work involved, code generators can help. But, you should not use it too much to the extend that maintainability is a pain. Some issues also arise if you want to regenerate them.
Tools like Visual Studio, Codesmith have their own templates for most of the common tasks and make this process easier. But, it is easy to roll out on your own.
It is often useful to create a code generator that generates code from a specification - usually one that has regular tabular rules. It reduces the chance of introducing an error via a typo or omission.
Yes ,
I developed my own code generator for AAA protocol Diameter (RFC 3588).
It could generate structures and Api's for diameter messages reading from an XML file that described diameter application's grammar.
That greatly reduced the time to develop complete diameter interface (such as SH/CX/RO etc.).
in my opinion a good programming language would not need code generators because introspection and runtime code generation would be part of language e.g. in python metaclasses and new module etc.
code generators usually generate more unmanageable code in long term usage.
however, if it is absolutely imperative to use a code generator (eclipse VE for swing development is what I use at times) then make sure you know what code is being generated. Believe me, you wouldn't want code in your application that you are not familiar with.
Writing own generator for project is not efficient. Instead, use a generator such as T4, CodeSmith and Zontroy.
T4 is more complex and you need to know a .Net programming language. You have to write your template line by line and you have to complete data relational operations on your own. You can use it over Visual Studio.
CodeSmith is an functional tool and there are plenty of templates ready to use. It is based on T4 and writing your own temlate takes too much time as it is in T4. There is a trial and a commercial version.
Zontroy is a new tool with a user friendly user interface. It has its own template language and is easy to learn. There is an online template market and it is developing. Even you can deliver templates and sell them online over market.
It has a free and a commercial version. Even the free version is enough to complete a medium-scale project.
there might be a lot of code generators out there , however I always create my own to make the code more understandable and suit the frameworks and guidelines we are using
We use a generator for all new code to help ensure that coding standards are followed.
We recently replaced our in-house C++ generator with CodeSmith. We still have to create the templates for the tool, but it seems ideal to not have to maintain the tool ourselves.
My most recent need for a generator was a project that read data from hardware and ultimately posted it to a 'dashboard' UI. In-between were models, properties, presenters, events, interfaces, flags, etc. for several data points. I worked up the framework for a couple data points until I was satisfied that I could live with the design. Then, with the help of some carefully placed comments, I put the "generation" in a visual studio macro, tweaked and cleaned the macro, added the datapoints to a function in the macro to call the generation - and saved several tedious hours (days?) in the end.
Don't underestimate the power of macros :)
I am also now trying to get my head around CodeRush customization capabilities to help me with some more local generation requirements. There is powerful stuff in there if you need on-the-fly decision making when generating a code block.
I have my own code generator that I run against SQL tables. It generates the SQL procedures to access the data, the data access layer and the business logic. It has done wonders in standardising my code and naming conventions. Because it expects certain fields in the database tables (such as an id column and updated datetime column) it has also helped standardise my data design.
How many are you looking for? I've created two major ones and numerous minor ones. The first of the major ones allowed me to generate programs 1500 line programs (give or take) that had a strong family resemblance but were attuned to the different tables in a database - and to do that fast, and reliably.
The downside of a code generator is that if there's a bug in the code generated (because the template contains a bug), then there's a lot of fixing to do.
However, for languages or systems where there is a lot of near-repetitious coding to be done, a good (enough) code generator is a boon (and more of a boon than a 'doggle').
In embedded systems, sometimes you need a big block of binary data in the flash. For example, I have one that takes a text file containing bitmap font glyphs and turns it into a .cc/.h file pair declaring interesting constants (such as first character, last character, character width and height) and then the actual data as a large static const uint8_t[].
Trying to do such a thing in C++ itself, so the font data would auto-generate on compilation without a first pass, would be a pain and most likely illegible. Writing a .o file by hand is out of the question. So is breaking out graph paper, hand encoding to binary, and typing all that in.
IMHO, this kind of thing is what code generators are for. Never forget that the computer works for you, not the other way around.
BTW, if you use a generator, always always always include some lines such as this at both the start and end of each generated file:
// This code was automatically generated from Font_foo.txt. DO NOT EDIT THIS FILE.
// If there's a bug, fix the font text file or the generator program, not this file.
Yes I've had to maintain a few. CORBA or some other object communication style of interface is probably the general thing that I think of first. You have object definitions that are provided to you by the interface you are going to talk over but you still have to build those objects up in code. Building and running a code generator is a fairly routine way of doing that. This can become a fairly lengthy compile just to support some legacy communication channel, and since there is a large tendency to put wrappers around CORBA to make it simpler, well things just get worse.
In general if you have a large amount of structures, or just rapidly changing structures that you need to use, but you can't handle the performance hit of building objects through metadata, then your into writing a code generator.
I can't think of any projects where we needed to create our own code generators from scratch but there are several where we used preexisting generators. (I have used both Antlr and the Eclipse Modeling Framework for building parsers and models in java for enterprise software.) The beauty of using a code generator that someone else has written is that the authors tend to be experts in that area and have solved problems that I didn't even know existed yet. This saves me time and frustration.
So even though I might be able to write code that solves the problem at hand, I can generate the code a lot faster and there is a good chance that it will be less buggy than anything I write.
If you're not going to write the code, are you going to be comfortable with someone else's generated code?
Is it cheaper in both time and $$$ in the long run to write your own code or code generator?
I wrote a code generator that would build 100's of classes (java) that would output XML data from database in a DTD or schema compliant manner. The code generation was generally a one time thing and the code would then be smartened up with various business rules etc. The output was for a rather pedantic bank.
Code generators are work-around for programming language limitations. I personally prefer reflection instead of code generators but I agree that code generators are more flexible and resulting code obviously faster during runtime. I hope, future versions of C# will include some kind of DSL environment.
The only code generators that I use are webservice parsers. I personally stay away from code generators because of the maintenance problems for new employees or a separate team after hand off.
I write my own code generators, mainly in T-SQL, which are called during the build process.
Based on meta-model data, they generate triggers, logging, C# const declarations, INSERT/UPDATE statements, data model information to check whether the app is running on the expected database schema.
I still need to write a forms generator for increased productivity, more specs and less coding ;)
I've created a few code generators. I had a passive code generator for SQL Stored procedures which used templates. This generated generated 90% of our stored procedures.
Since we made the switch to Entity Framework I've created an active codegenerator using T4 (Text Template Transformation Toolkit) inside visual studio. I've used it to create basic repository partial classes for our entities. Works very nicely and saves a bunch of coding. I also use T4 for decorating the entity classes with certain Attributes.
I use code generation features provided by EMF - Eclipse Modeling Framework.
Code generators are really useful in many cases, especially when mapping from one format to another. I've done code generators for IDL to C++, database tables to OO types, and marshalling code just to name a few.
I think the point the authors are trying to make is that if you're a developer you should be able to make the computer work for you. Generating code is just one obvious task to automate.
I once worked with a guy who insisted that he would do our IDL to C++ mapping manually. In the beginning of the project he was able to keep up, because the rest of us were trying to figure out what to do, but eventually he became a bottleneck. I did a code generator in Perl and then we could pretty much do his "work" in a few minutes.
See our "universal" code generator based on program transformations.
I'm the architect and a key implementer.
It is worth noting that a significant fraction of this generator, is generated using this generator.
We uses Telosys code generator in our projects : http://www.telosys.org/
We have created it to reduce the development duration in recurrent tasks like CRUD screens, documentation, etc...
For us the most important thing is to be able to customize the generator's templates, in order to create new generation targets if necessary and to customize existing templates. That's why we have also created a template editor (for Velocity .vm files).
It works fine for Java/Spring/AngularJS code generator and can be adapt for other targets (PHP, C#, Python, etc )