Usage of CPD(Copy Paste Detector) - cpd

test.c
void fun(){
printf("int main char");
}
int main()
{
printf("int main int");
}
->Im giving the command like this run.sh cpd --minimum-tokens 5 --files /opt/test.c --language c and the output is as follows :
Found a 2 line (5 tokens) duplication in the following files:
Starting at line 1 of /opt/test.c
Starting at line 5 of /opt/test.c
void fun(){
printf("int main char ");
Even if there is no duplication, it is telling the code is duplicated because of minimum tokens.
Is there any way to specify the command without minimum tokens flag.
void fun(){
printf("int main int");
}
int main(){
printf("int main int");
}
I specified the command like this "run.sh cpd --minimum-tokens 9 --files /opt/test.c --language c" the output is as follows:
Added /opt/test.c <--- (No line duplication reported by tool)
This is because I specified the token value greater than the required token value which is 8. In the above case even if there is a duplicated code the tool is not returning any duplicated code.
So how to conclude on Min token size in such scenario in order to get correct duplication of code

Related

Can the equivalent of Tcl "chan push" be implemented in C code?

I have an imbedded Tcl interpreter and want to redirect its stderr and stdout to a console widget in the application.
Using a chan push command for stderr seems to work (not much testing yet), like explained here:
TCL: Redirect output of proc to a file
I could have a file with the required tcl namespace definition, etc, and do a Tcl_Eval to source that script after creating an interp with Tcl_CreateInterp.
Can I do the same thing using Tcl C library calls instead of running the Tcl commands via a Tcl_Eval?
To implement a channel transformation in C, you first have to define a Tcl_ChannelType structure. Such a structure specifies a name for the transformation and pointers to functions for the different operations that may be done on a channel. Next, you implement the functions that perform those operations. The most important ones are inputProc and outputProc. You also have to implement a watchProc. The pointers for other operations can be set to NULL, if you don't need them.
For your example it may look something like:
static const Tcl_ChannelType colorChannelType = {
"color",
TCL_CHANNEL_VERSION_5,
NULL,
ColorTransformInput,
ColorTransformOutput,
NULL, /* seekProc */
NULL, /* setOptionProc */
NULL, /* getOptionProc */
ColorTransformWatch,
NULL, /* getHandleProc */
NULL, /* close2Proc */
NULL, /* blockModeProc */
NULL, /* flushProc */
NULL, /* handlerProc */
NULL, /* wideSeekProc */
NULL,
NULL
};
Then, when you want to push the transformation onto a channel:
chan = Tcl_StackChannel(interp, &colorChannelType, clientData,
Tcl_GetChannelMode(channel), channel);
For a complete example from the Tcl sources, see tclZlib.c
Not really an answer to my question, but maybe it will help someone to see what works by using a Tcl_Eval to show the tcl code that does the redirection.
proc redir_stdout {whichChan args} {
switch -- [lindex $args 0] {
initialize {
return {initialize write finalize}
}
write {
::HT_puts $whichChan [lindex $args 2]
}
finalize {
}
}
}
chan push stderr [list redir_stdout 1]
chan push stdout [list redir_stdout 2]
Both the chan push commands use the same proc, but pass an different identifier (1 or 2) to indicate whether stdout or stderr was the originator of the output.
HT_puts is an extension provided by the C code:
Tcl_CreateObjCommand(interp,"HT_puts",putsCmd,(ClientData) NULL,NULL);
int TclInterp::putsCmd(ClientData ,Tcl_Interp *,int objcnt,Tcl_Obj * CONST *objv)
{
if (objcnt != 3)
return TCL_ERROR;
int length;
int whichChan;
Tcl_GetIntFromObj(interp,objv[1],&whichChan);
//qDebug() << "Channel is $whichChan";
QString out =Tcl_GetStringFromObj(objv[2],&length);
QColor textColor;
if (whichChan==1)
textColor = QColor(Qt::red);
else
textColor = QColor(Qt::white);
console->putData(out.toUtf8(),textColor);
//qDebug() << out;
return TCL_OK;
}
Text forwarded from stderr gets colored red and text from stdout gets colored white.
And, as I mentioned above, each subsequent command that gets executed via Tcl_Eval needs to have the Tcl_Eval return value processed something like this:
if (rtn != TCL_OK)
{
QString output = Tcl_GetVar(interp, "errorInfo", TCL_GLOBAL_ONLY);
console->putData(output.toUtf8(),QColor(Qt::red));
//qDebug("Failed Tcl_Eval: %d \n%s\n", rtn,
}
To get what's normally printed to stderr by tclsh on a TCL_ERROR into the console (instead of the app's stderr).
I was planning to do the equivalent in C to eliminate the need to run Tcl code in the interpreter for the redirect. But, really there's no need for that.
The Tcl_Eval that does the redirection is done right after doing the Tcl_CreateInterp. Any subsequent Tcl_Evals using that interp will have stdout and stderr redirected to my application's console.
Besides, I'm having trouble understanding how to use Tcl_StackChannel and can't find an example I can follow.
Honestly, can't say that I completely understand the Tcl implementation. I made some assumptions on what gets passed to the proc used in the "chan push" command based on the referenced thread.
It looks like the proc is called with the list specified in the chan push command AND an args list. The first element of the args list is a name like "write" or "initialize". The third element looks like the string to be printed.
Still trying to find a definition of what's passed without having to dig into something like namespace ensemble.
So, it's likely that this Tcl code isn't the best implementation but it's working so far (with limited testing).

Why does vscode's "Run Doctest" helper filter all of my crate's Doctests?

Expected Behavior:
Clicking "Run doctest" in vscode should execute one test from doctest snippets.
Terminal output SHOULD say ("1 passed;" or "1 failed;"), and "1 filtered out;".
Actual Behavior:
Clicking "Run doctest" in vscode executes 0 tests, and shows that 2 were filtered out.
Terminal output:
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 2 filtered out; finished in 0.00s
Source code:
My misbehaving crate: https://github.com/darrow-olykos/rusty-gists/blob/main/crates/my_cache/src/lib.rs
Behaving crate (where this is not an issue): https://github.com/darrow-olykos/rusty-gists/blob/main/crates/math/src/lib.rs
My machine:
macOS 12
rustc 1.57.0
rust analyzer v0.3.954
What I have done to try to narrow down the scope of the problem:
Running the "same" command in the terminal demonstrates expected behavior. The terminal output shows test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 1 filtered out; finished in 0.40s when I run cargo test --doc --package my_cache -- "Cacher<T>::new" --nocapture, which is exactly what the terminal says is ran when I click on "Run Doctest".
Clicking "Run Doctest" in another crate I have, (called "math") in the same repo, demonstrates expected behavior.
Looking at the differences between my misbehaving crate and my working crate:
A. This misbehaving crate has it's doctest is inside of an impl where-as that other crate's doctest is at the root level of the file.
B. This misbehaving crate's doctest is for a generic struct that accepts a closure type.
C. Executing cargo test from crates/my_cache demonstrates expected behavior, with the following terminal output:
// ... some output omitted
Doc-tests my_cache
running 2 tests
test src/lib.rs - Cacher<T>::new (line 26) ... ok
test src/lib.rs - Cacher<T>::value (line 42) ... ok
test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.44s
Again, here is the source code:
Misbehaving crate: https://github.com/darrow-olykos/rusty-gists/blob/main/crates/my_cache/src/lib.rs
Behaving crate: https://github.com/darrow-olykos/rusty-gists/blob/main/crates/math/src/lib.rs
Maybe? notable details:
I have modeled my git repo after the structure of https://github.com/rust-analyzer/rust-analyzer, such that I'm using cargo workspaces and crates/* are members. I have a root cargo.toml which can specify local dependencies with something like my_cache = {path = "crates/my_cache}, however I cannot think of a reason why this would be a contributing factor because I've proven that my math crate can be in this structure without vscode getting confused and filtering out doctests in that crate unexpected.
My suspicions?:
Something is happening that is causing the doctest to be filtered out when it should not be filtered out.
Maybe the command that claims to be executing when I click Run Doctest isn't the ACTUAL command getting executed.
Maybe the (bug?) has something to do with the closure type. I forget where I read this, but I vaguely recall Rust closure types are "unnamed" in a way that makes referencing them strange. Unfortunately I cannot find the resource that I was reviewing that walked through this in detail. (It might have been a resource that was covering the Rust compiler and how the Rust compiler manifests data types in memory, but I do not recall details (maybe someone reading this will know what I'm referring to here)).
Here is the misbehaving crate's source code copied into this answer for the sake of longevity in case I make changes to my github repo:
// Credit to: https://doc.rust-lang.org/book/ch13-01-closures.html
// (I've modified their example to use a HashMap instead of a single value)
use std::collections::HashMap;
/// cacher for calculation with two u64's as input and u64 as output
/// can be generalized more
pub struct Cacher<T>
where
T: FnMut(u64, u64) -> u64,
{
calculation: T,
values: HashMap<String, u64>,
}
impl<T> Cacher<T>
where
T: FnMut(u64, u64) -> u64,
{
/// Returns a Cacher<T> which can cache results of calculations for the provided closure.
///
/// # Arguments
///
/// `T` - Closure that computes produces a value. Value is cached based on args. Cached value is returend on subsequent calls if args are the same.
///
/// # Examples
/// ```rust
/// use my_cache::Cacher;
/// let mut cacher = Cacher::new(|x,y|x+y);
/// ```
pub fn new(calculation: T) -> Self {
let values = HashMap::new();
Cacher {
calculation,
values,
}
}
/// Returns value of calculation `T`. Cached value is returned if unique `n`, `k` pair provided, otherwise calcuation runs and then value is cached.
///
/// # Examples
///
/// ```rust
/// use std::rc::Rc;
/// use std::cell::{RefCell, RefMut};
///
/// use my_cache::Cacher;
///
/// let mut count = Rc::new(RefCell::new(0));
/// let add = |x, y| {
/// let mut count_mut_ref = count.borrow_mut();
/// *count_mut_ref += 1; x + y
/// };
/// let mut cacher = Cacher::new(add);
///
/// assert_eq!(*count.borrow(), 0);
/// assert_eq!(cacher.value(2, 3), 5); // new calculation, count += 1
/// assert_eq!(*count.borrow(), 1);
/// assert_eq!(cacher.value(2, 3), 5); // repeat, use cache
/// assert_eq!(*count.borrow(), 1);
/// assert_eq!(cacher.value(2, 4), 6); // new calculation, count += 1
/// assert_eq!(*count.borrow(), 2);
/// ```
pub fn value(&mut self, n: u64, k: u64) -> u64 {
let key = n.to_string() + &k.to_string();
let cached_result = self.values.get(&key);
if let Some(value) = cached_result {
*value
} else {
let v = (self.calculation)(n, k);
self.values.insert(key, v);
v
}
}
}
This is a bug in a pre-release version of vscode's rust analyzer extension (v3.0.954) and can be mitigated by switching back to the latest release version of rust analyzer.
I realized to check this once I posted my question and confirmed that the unexpected behavior was only present in v3.0.954, however the latest release of rust analyzer vscode extension v0.2.948 works as expected (does not filter out the doctest unexpectedly).

reading files and folders in order with apache beam

I have a folder structure of the type year/month/day/hour/*, and I'd like the beam to read this as an unbounded source in chronological order. Specifically, this means reading in all the files in the first hour on record and adding their contents for processing. Then, add the file contents of the next hour for processing, up until the current time where it waits for new files to arrive in the latest year/month/day/hour folder.
Is it possible to do this with apache beam?
So what I would do is to add timestamps to each element according to the file path. As a test I used the following example.
First of all, as explained in this answer, you can use FileIO to match continuously a file pattern. This will help as, per your use case, once you have finished with the backfill you want to keep reading new arriving files within the same job. In this case I provide gs://BUCKET_NAME/data/** because my files will be like gs://BUCKET_NAME/data/year/month/day/hour/filename.extension:
p
.apply(FileIO.match()
.filepattern(inputPath)
.continuously(
// Check for new files every minute
Duration.standardMinutes(1),
// Never stop checking for new files
Watch.Growth.<String>never()))
.apply(FileIO.readMatches())
Watch frequency and timeout can be adjusted at will.
Then, in the next step we'll receive the matched file. I will use ReadableFile.getMetadata().resourceId() to get the full path and split it by "/" to build the corresponding timestamp. I round it to the hour and do not account for timezone correction here. With readFullyAsUTF8String we'll read the whole file (be careful if the whole file does not fit into memory, it is recommended to shard your input if needed) and split it into lines. With ProcessContext.outputWithTimestamp we'll emit downstream a KV of filename and line (the filename is not needed anymore but it will help to see where each file comes from) and the timestamp derived from the path. Note that we're shifting timestamps "back in time" so this can mess up with the watermark heuristics and you will get a message such as:
Cannot output with timestamp 2019-03-17T00:00:00.000Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-05T15:41:29.645Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.
To overcome this I set getAllowedTimestampSkew to Long.MAX_VALUE but take into account that this is deprecated. ParDo code:
.apply("Add Timestamps", ParDo.of(new DoFn<ReadableFile, KV<String, String>>() {
#Override
public Duration getAllowedTimestampSkew() {
return new Duration(Long.MAX_VALUE);
}
#ProcessElement
public void processElement(ProcessContext c) {
ReadableFile file = c.element();
String fileName = file.getMetadata().resourceId().toString();
String lines[];
String[] dateFields = fileName.split("/");
Integer numElements = dateFields.length;
String hour = dateFields[numElements - 2];
String day = dateFields[numElements - 3];
String month = dateFields[numElements - 4];
String year = dateFields[numElements - 5];
String ts = String.format("%s-%s-%s %s:00:00", year, month, day, hour);
Log.info(ts);
try{
lines = file.readFullyAsUTF8String().split("\n");
for (String line : lines) {
c.outputWithTimestamp(KV.of(fileName, line), new Instant(dateTimeFormat.parseMillis(ts)));
}
}
catch(IOException e){
Log.info("failed");
}
}}))
Finally, I window into 1-hour FixedWindows and log the results:
.apply(Window
.<KV<String,String>>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply("Log results", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
String file = c.element().getKey();
String value = c.element().getValue();
String eventTime = c.timestamp().toString();
String logString = String.format("File=%s, Line=%s, Event Time=%s, Window=%s", file, value, eventTime, window.toString());
Log.info(logString);
}
}));
For me it worked with .withAllowedLateness(Duration.ZERO) but depending on the order you might need to set it. Keep in mind that a value too high will cause windows to be open for longer and use more persistent storage.
I set the $BUCKET and $PROJECT variables and I just upload two files:
gsutil cp file1 gs://$BUCKET/data/2019/03/17/00/
gsutil cp file2 gs://$BUCKET/data/2019/03/18/22/
And run the job with:
mvn -Pdataflow-runner compile -e exec:java \
-Dexec.mainClass=com.dataflow.samples.ChronologicalOrder \
-Dexec.args="--project=$PROJECT \
--path=gs://$BUCKET/data/** \
--stagingLocation=gs://$BUCKET/staging/ \
--runner=DataflowRunner"
Results:
Full code
Let me know how this works. This was just an example to get started and you might need to adjust windowing and triggering strategies, lateness, etc to suit your use case

Why isn't my LLVM alias analysis pass being called?

I'm attempting to do some alias analysis & other memory inspection. I've written a pointless AliasAnalysis pass (that says everything must alias) to attempt to verify that my pass is getting picked up & run by opt.
I run opt with: opt -load ~/Applications/llvm/lib/MustAA.so -must-aa -aa-eval -debug < trace0.ll -debug-pass=Structure
I see my pass being initialized, but never being called (I see only may alias results).
Any ideas as to what to do to debug this? Or what I'm missing? I've read through http://llvm.org/docs/AliasAnalysis.html and don't see anything that I'm missing.
Here's the full source code of my pass:
#define DEBUG_TYPE "must-aa"
#include "llvm/Pass.h"
#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Support/raw_ostream.h"
#include "llvm/Support/Debug.h"
using namespace llvm;
namespace {
struct EverythingMustAlias : public ImmutablePass, public AliasAnalysis {
static char ID;
EverythingMustAlias() : ImmutablePass(ID) {}
virtual void *getAdjustedAnalysisPointer(AnalysisID ID) {
errs() << "called getAdjustedAnalysisPointer with " << ID << "\n";
if (ID == &AliasAnalysis::ID)
return (AliasAnalysis*)this;
return this;
}
virtual void initializePass() {
DEBUG(dbgs() << "Initializing everything-must-alias\n");
InitializeAliasAnalysis(this);
}
virtual void getAnalysisUsage(AnalysisUsage &AU) const {
AliasAnalysis::getAnalysisUsage(AU);
AU.setPreservesAll();
}
virtual AliasResult alias(const Location &LocA, const Location &LocB) {
DEBUG(dbgs() << "Everything must alias!\n");
return AliasAnalysis::MustAlias;
}
};
}
namespace llvm {
void initializeEverythingMustAliasPass(PassRegistry &Registry);
}
char EverythingMustAlias::ID = 0;
static RegisterPass<EverythingMustAlias> A("must-aa", "Everything must alias");
INITIALIZE_AG_PASS(EverythingMustAlias, AliasAnalysis, "must-aa",
"Everything must alias", false, true, false)
Running opt as above produces:
Args: opt -load /home/moconnor/Applications/llvm/lib/MustAA.so -must-aa -aa-eval -debug -debug-pass=Structure
WARNING: You're attempting to print out a bitcode file.
This is inadvisable as it may cause display problems. If
you REALLY want to taste LLVM bitcode first-hand, you
can force output with the `-f' option.
Subtarget features: SSELevel 8, 3DNowLevel 0, 64bit 1
Initializing everything-must-alias
Pass Arguments: -targetlibinfo -datalayout -notti -basictti -x86tti -no-aa -must-aa -aa-eval -preverify -domtree -verify
Target Library Information
Data Layout
No target information
Target independent code generator's TTI
X86 Target Transform Info
No Alias Analysis (always returns 'may' alias)
Everything must alias
ModulePass Manager
FunctionPass Manager
Exhaustive Alias Analysis Precision Evaluator
Preliminary module verification
Dominator Tree Construction
Module Verifier
===== Alias Analysis Evaluator Report =====
163 Total Alias Queries Performed
0 no alias responses (0.0%)
163 may alias responses (100.0%)
0 partial alias responses (0.0%)
0 must alias responses (0.0%)
Alias Analysis Evaluator Pointer Alias Summary: 0%/100%/0%/0%
168 Total ModRef Queries Performed
0 no mod/ref responses (0.0%)
0 mod responses (0.0%)
0 ref responses (0.0%)
168 mod & ref responses (100.0%)
Alias Analysis Evaluator Mod/Ref Summary: 0%/0%/0%/100%
Note the 163 may alias responses when my pass is returning MustAlias.
Edit: On a suggestion on the mailing list, I added the following member function since my pass uses multiple inheritance. It doesn't seem to change anything or get called.
virtual void *getAdjustedAnalysisPointer(AnalysisID ID) {
errs() << "called getAdjustedAnalysisPointer with " << ID << "\n";
if (ID == &AliasAnalysis::ID)
return (AliasAnalysis*)this;
return this;
}
I changed:
static RegisterPass<EverythingMustAlias> A("must-aa", "Everything must alias");
INITIALIZE_AG_PASS(EverythingMustAlias, AliasAnalysis, "must-aa",
"Everything must alias", false, true, false)
to
static RegisterPass<EverythingMustAlias> X("must-aa", "Everything must alias", false, true);
static RegisterAnalysisGroup<AliasAnalysis> Y(X);
Apparently INITIALIZE_AG_PASS only defines the registration function & so is only useful for a pass that is statically linked into an LLVM executable (or something). RegisterAnalysisGroup is run when this module is dynamically linked in so it is then registered.

Doxygen #code line numbers

Is there a way to display code line numbers inside a #code ... #endcode block? From the screenshots in the doxygen manual it would seem that there is, but I was unable to find an option for doxygen itself, or a tag syntax to accomplish this.
I need this to be able to write something like "In the above code, line 3" after a code block.
Tested also for fenced code blocks, still getting no numbers.
Short Answer
It seems that at least in the current version (1.8.9) line numbers are added:
to C code only when using \includelineno tag
to any Python code
Details
Python code formatter
Python code formatter includes line numbers if g_sourceFileDef evaluates as TRUE:
/*! start a new line of code, inserting a line number if g_sourceFileDef
* is TRUE. If a definition starts at the current line, then the line
* number is linked to the documentation of that definition.
*/
static void startCodeLine()
{
//if (g_currentFontClass) { g_code->endFontClass(); }
if (g_sourceFileDef)
( https://github.com/doxygen/doxygen/blob/Release_1_8_9/src/pycode.l#L356
)
It's initialized from FileDef *fd passed into parseCode/parsePythonCode if it was provided (non-zero) or from new FileDef(<...>) otherwise:
g_sourceFileDef = fd;
<...>
if (fd==0)
{
// create a dummy filedef for the example
g_sourceFileDef = new FileDef("",(exName?exName:"generated"));
cleanupSourceDef = TRUE;
}
( https://github.com/doxygen/doxygen/blob/Release_1_8_9/src/pycode.l#L1458 )
so it seems all Python code is having line numbers included
C code formatter
C code formatter has an additional variable g_lineNumbers and includes line numbers if both g_sourceFileDef and g_lineNumbers evaluate as TRUE:
/*! start a new line of code, inserting a line number if g_sourceFileDef
* is TRUE. If a definition starts at the current line, then the line
* number is linked to the documentation of that definition.
*/
static void startCodeLine()
{
//if (g_currentFontClass) { g_code->endFontClass(); }
if (g_sourceFileDef && g_lineNumbers)
( https://github.com/doxygen/doxygen/blob/Release_1_8_9/src/code.l#L486 )
They are initialized in the following way:
g_sourceFileDef = fd;
g_lineNumbers = fd!=0 && showLineNumbers;
<...>
if (fd==0)
{
// create a dummy filedef for the example
g_sourceFileDef = new FileDef("",(exName?exName:"generated"));
cleanupSourceDef = TRUE;
}
( https://github.com/doxygen/doxygen/blob/Release_1_8_9/src/code.l#L3623 )
Note that g_lineNumbers remains FALSE if provided fd value was 0
HtmlDocVisitor
Among parseCode calls in HtmlDocVisitor::visit there is only one (for DocInclude::IncWithLines, what corresponds to \includelineno) which passes non-zero fd:
https://github.com/doxygen/doxygen/blob/Release_1_8_9/src/htmldocvisitor.cpp#L540
so this seems to be the only command which will result in line numbers included into C code listing