Code folding in external files with knitr and RStudio - knitr

I can find no way to insert syntactically acceptable RStudio style folds into an external R code file that is set up for use from a knitr document. Or am I missing something. There are several ways this might be done:
1) Allow a code header such as:
## #knitr Q1 ----
or perhaps
## #knitr 'Q1' ----
2) Fold every code chunk (this would be a change in RStudio), but this is not as
general as I would ideally like.
3) Allow the inclusion of some kind of comment line in code files that would indicate a fold. I have not been able to find a way to do this that does not add the comment line to the previous code chunk.
[Since initially posting this, I have noticed that the arguments 'from' and 'to' in read_chunk() can be regular expressions that specify start and from character strings for code chunks. So this gives one way to allow the insertion of comment lines that can specify folds. It would be nice however to be able to use one or more of mechanisms 1-3 above.]

From knitr v1.2.11 and above, the RStudio style code headers are supported consistently in knitr. The rule is basically # ---- label:
one or more hashes # in the beginning
followed by at least four dashes ----
followed by the chunk label
and optionally followed by any number of dashes
This is supported in both read_chunk() and purl(), i.e., this style of comments is used in both importing and exporting code in knitr.
For RStudio to support code-folding, however, you will have to add at least four dashes to the end of the comment header, e.g.,
# ---- chunk-label -----------------------------
knitr 1.2.11 is a development version on Github, which will become 1.3 eventually on CRAN.

Related

Dataprep import dataset does not detect headers in first row automatically

I am importing a dataset from Google Cloud Storage (parameterized) into Dataprep. So far, this worked perfectly fine and one of the feature that I liked is that it auto detects that the first row in my (application/octet-stream) .csv file are my headers.
However, today I tried to import a new dataset and it did not detect the headers, but it auto assigned column1, column2...
What has changed and or why is this the case. I have checked the box auto-detect and use UTF-8:
While the auto-detect option is usually pretty good, there are times that it fails for numerous reasons. I've specifically noticed this when the field names contain certain characters (e.g. comma, invisible characters like zero-width-non-joiners, null bytes), or when multiple different styles of newline delimiters are used within the same file.
Another case I saw this is when there were more columns of data than there were headers.
As you already hit on, you can use the following snippet to do mostly the same thing:
rename type: header method: filter sanitize: true
. . . or make separate recipe steps to convert the first row to header and then bulk-rename to your own liking.
More often than not, however, I've found that when auto-detect fails on a previously working file, it tends to be a sign of some sort of issue with the source file. I would look for mismatched data, as well as misplaced commas within the output, as well as comparing the header and some data rows to the original source using a plaintext editor.
When all else fails, you can try a CSV validator . . . but in my experience they tend to be incredibly opinionated when it comes to the formatting options of the file—so depending on the system generating the CSV, it could either miss any errors or give false-positives. I have had two experiences where auto-detect fails for no apparent reason on perfectly clean files, so it is possible that process was just skipped for some reason.
It should also be noted that if you have a structured file that was correctly detected but want to revert it, you can go to the dataset details, select the "..." (More) button, and choose "Remove structure..." (I'm hoping that one day they'll let you do the opposite when you want to add structure to a raw dataset or work around bugs like this!)
Best of luck!
Can be resolved as a transformation within a Flow:
rename type: header method: filter sanitize: true

Two closely matching files: get corresponding lines?

I'm in a situation where I'm programmatically generating LaTeX code, and I want my Synctex to point to the correct lines in the original file.
The generation is basically doing template expansion, so the original files are nearly identical to the generated ones, but with some snippets expanded.
I'm wondering, is there a diff tool or library that will easily give me the line number of the original file that corresponds to a given line in the generated one? Can this be extracted from a normal Unix diff somehow?
This is part of a build script, so ideally something easy to run, like bash or python, is preferred to something that needs to be compiled.
Google’s diff-match-patch lib is a neat solution to questions like these: https://github.com/google/diff-match-patch

Postgres full text search ignore url

I am trying to use PostgreSQL to implement a full-text search system.
I encounter this strange or may be intended feature with that.
While trying to index or search for a column which contains names of files with extension (e.g. myimage.jpg), the system treats it as a url and does not properly tokenize.
I referred to the documentation and see that via ts_debug that the file name is taken as a host of a url.
Could some one tell how to take all inputs as normal word in the FTS of PostgreSQL.
Also, on a second request, how can one do a contains, startswith, and endswith searches with it?
Update
I have now tried the statement create text search configuration..., copied from pg_catalog.english and removed host,url, and url_path and then specified the configuration for the ts_debug method. But still no go., myimage.jpg is still identified as host.
Version
I use version 9.4
tl;dr Look at pre-parsing your input and removing punctuation if you really only want words (and not emails, urls, hosts, etc).
So after trying to figure this out myself the issue is that you don't seem to be able to easily customise the parser. From my understanding the parser runs first, which generates tokens. Those tokens are then matched to dictionaries.
By removing host, url, url_path from the configuration all you are doing is making it so that these tokens don't get looked up in a dictionary, resulting in no lexeme from these tokens. Which essentially means that they don't exist in terms of search. Which is not want you want...
Ideally what you need to do is customise the parser to not generate those tokens in the first place, or to also generate overlapping tokens (similar to how hyphenated words generate a token for the entire word as well as individual components) . This doesn't seem to be possible at the moment without writing a custom parser.
The only solution to this would be to pre-parse the text to remove the full stop. Note that if you rely on other types of tokens like version (e.g. 8.3.0) or email (e.g. name#domain.com) this will break those. So you may need to be a bit clever on how you remove characters.
select ts_debug('english', replace('this-is-a-file.jpg', '.', ' '));
"(asciihword,"Hyphenated word, all ASCII",this-is-a-file,{english_stem},english_stem,{this-is-a-fil})"
"(hword_asciipart,"Hyphenated word part, all ASCII",this,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",is,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",a,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",file,{english_stem},english_stem,{file})"
"(blank,"Space symbols"," ",{},,)"
"(asciiword,"Word, all ASCII",jpg,{english_stem},english_stem,{jpg})"
In terms of your second question. Are you talking about partial word matches? You get this a little bit with the stemming when using a config like english, so running becomes run which will match if you search for run or running. If you're talking about fuzzy matching it gets a little more complicated. I suggest reading this article http://rachbelaid.com/postgres-full-text-search-is-good-enough/

parse text file to find lines which contain a date after 05/05/2011

I could like to be able to list the line no.s in a file which contain a date, in the format dd,mm,yy, which is greater than 05/05/11
the file in question is prolog source code - the dates form part of a comment indicating when the modification after the comment was made
I'm using emacs, so thought an emacs solution would be the obvious way forward but willing to think more laterally if need be
a typical line of interest would be:
4.00 21/10/12 Modified to incorporate proportional match tolerances
I would like to report the file and line no. e.g.
filename line no.
my_source_1.pl 37
or alternatively, just being able to step through the file using Regexp I-search highlighting each dd/mm/yy that is greater than 05/05/2011 would be very useful

How do I configure BeyondCompare to ignore SCM replaced text in comments?

I do have some text sequences that are replaced by the SCM (Perforce in my case).
I do want to configure BeyondCompare to consider these sequences as unimportant differences in order to be able to ignore them when I compare files.
In my case it's about Python source files and the sequences are looking like
# $Id: //depot/.../filename#7 $
# $DateTime: 2010/09/01 10:45:29 $
# $Author: username $
# $Change: 1234 $
Sometimes these sequences can be outside comments, but even in this cases I would like to be able ignore these lines because they are not really changed.
You need to define a new grammar element (let's call it "SCM") and mark it as unimportant (see the tutorial here; choose "Basic" and make sure to check "Regular Expression").
The grammar element should be (if I interpret your examples correctly):
^.*\$(Id|DateTime|Author|Change):.*$
This will ignore any line that contains $Id:, $DateTime: etc.
If you only want to ignore lines that start with # $..., use
^\s*#s*\$(Id|DateTime|Author|Change):.*$
And if you only want to ignore stuff between $ (and treat everything else as important), use
\$[^$\r\n]*\$
or
\$(Id|DateTime|Author|Change)[^$\r\n]*\$
depending on whether you care about those keywords or not.
Beyond Compare's parser doesn't currently (v3/v4) support nested elements, so file formats grammars can't be used to mark an SCM sequence as unimportant for a specific file type if the text is already classified as a comment, string, etc.
Beyond Compare 4.0 added support for marking arbitrary text as unimportant across an entire comparison, separate from the grammar.
Load the files you're interested in
Click the Session Settings button (aka Rules w/ umpire icon) or use the Session->Session Settings menu item.
Switch to the Importance tab
Click the + button at the bottom of the Unimportant text list.
Add the plain text or regular expression to Text to find edit and check the Regular Expression checkbox if necessary. In this case the regular expression would be:
\$(Id|DateTime|Author|Change):.*\$
Click Ok.
By default these changes will only affect the current comparison. You can change the combobox at the bottom of the Session Settings dialog from Use for this view only to Also update session defaults to make it affect all future comparisons for all file types.