Search a pattern in the first 100 characters of a string - pcre

I want to display first 1000 characters of a string (literals are replaced by special symbol). I am using pcre library to replace the literal. After replacing every literal I am checking for the length of the string and if it is > 1000 then stop matching and display the string.
My problem is, Suppose I am sending a string with length 1GB, and if there is no literal in that string, pcre will check for the entire string. I want to search the pattern in the first 1000 characters. Is there any way to do this?

Just cut a 1000-chars head of your string and use substitution for it, not for the whole text.
In case you get less than 1000 chars after substitution, just cut another 1000-chars head, use substitution and concatenate two results. Do it in loop until you get 1000-chars string or reach the end of the whole text.

Related

Alphanumeric substitution with vim

I'm using the vscode vimplugin. I have a bunch of lines that look like:
Terry,169,80,,,47,,,22,,,6,,
I want to remove all the alphanumeric characters after the first comma so I get:
Terry,,,,,,,,,,,,,
In command mode I tried:
s/^.+\,[a-zA-Z0-9-]\+//g
But this does not appear to do anything. How can I get this working?
edit:
s/^[^,]\+,[a-zA-Z0-9-]\+//g
\+ is greedy; ^.\+, eats the entire line up to the last ,.
Instead of the dot (which means "any character") use [^,] which means "any but a comma". Then ^[^,]\+, means "any characters up to the first comma".
The problem with your requirement is that you want to anchor at the beginning using ^ so you cannot use flag g — with the anchor any substitution will be done once. The only way I can solve the puzzle is to use expressions: match and preserve the anchored text and then use function substitute() with flag g.
I managed with the following expression:
:s/\(^[^,]\+\)\(,\+\)\(.\+\)$/\=submatch(1) . submatch(2) . substitute(submatch(3), '[^,]', '', 'g')/
Let me split it in parts. Searching:
\(^[^,]\+\) — first, match any non-commas
\(,\+\) — any number of commas
\(.\+\)$ — all chars to the end of the string
Substituting:
\= — the substitution is an expression
See http://vimdoc.sourceforge.net/htmldoc/change.html#sub-replace-expression
submatch(1) — replace with the first match (non-commas anchored with ^)
submatch(2) — replace with the second match (commas)
substitute(submatch(3), '[^,]', '', 'g') — replace in the rest of the string
The last call to substitute() is simple, it replaces all non-commas with empty strings.
PS. Tested in real vim, not vscode.

Inserting hyphens into length limited String using regex

Within a Swift project I have some regex which at present ensures that an input can only be 10 characters long:
"^[\\da-zA-Z]{10,10}$"
I need to tweak this slightly, so that the string which this is working on will have the below format:
#####-####
i.e, inserting a character after the fifth character.
So far I have tried combining what I have with some other regex, however this is incorrect and I can't figure out what I need to do differently to make this work:
"^[\\da-zA-Z]{10,10}$(.{5}),$1-$2"
If you have as string of 10 characters and you want to replace the character after the sixth character you could use 2 capturing groups.
Capture the first 5 characters in the first group, then match the sixth character which you want to replace and capture the last 4 in the second group.
^([\\da-zA-Z]{5})[\\da-zA-Z]([\\da-zA-Z]{4})$
regex demo
In the replacement use $1-$2 which in total will be 10 characters as in your desired pattern #####-####
Note that {10,10} can be written as {10}

How to get rid of some characters from a field of string in Hive?

I want to get rid of some characters in a field which are stting format. For example, replace the punctuation as blank character. How to realize it given the candidate characters to erase.
Use regexp_replace with pattern for example '[_.,!?-]', list characters you want to get rid in the pattern. This will replace character in the pattern with spaces:
select regexp_replace('test_string_with-puctuations,.!?','[_.,!?-]',' ');
Output:
test string with puctuations

Text file search for match strings regex

I am trying to understand how regex works and what are the possibilities of working with it.
So I have a txt file and I am trying to search for 8 char long strings containing numbers. for now I use a quite simple option:
clear
Get-ChildItem random.txt | Select-String -Pattern [0-9][a-z] | foreach {$_.line}
It sort of works but I am trying to find a better option. ATM it takes too long to read through the left out text since it writes entire lines and it does not filter them by length.
You can use a lookahead to assert that a string contains at least 1 digit, then specify the length of the match and finally anchor it with ^ (start of string) and $ (end of string) if the string is on a line of its own, or \b (word boundary) if it's part of an HTML document as your comments seem to suggest:
Get-ChildItem C:\files\ |Select-String -Pattern '^(?=.*\d)\w{8}$'
Get-ChildItem C:\files\ |Select-String -Pattern '\b(?=.*\d)\w{8}\b'
The pattern [0-9][a-z] matches a digit followed by a letter. If you want to match a sequence of 8 characters use .{8}. The dot in regular expressions matches any character except newlines. A number in curly brackets matches the preceding expression the given number of times.
If you want to match non-whitespace characters use \S instead of .. If you want to match only digits and letters use [0-9a-z] (a character class) instead of ..
For a more thorough introduction please go find a tutorial. The subject is way too complex to be covered by a single answer on SO.
What you're currently searching for is a single number ranging from 0-9 followed by a single lowercase letter ranging from a-z.
this, for example, will match any 8 char long strings containing only alphanumeric characters.
\w{8}
i often forget what some regex classes are, and it may be useful to you as a learning tool, but i use this as a point of reference: http://regexr.com/
It can also validate what you're typing inline via a text field so you can see if what you're doing works or not.
If you need more of a tutorial than a reference, i found this extremely useful when i learned: regexone.com

sed - remove specific subscript from string

please provide me a sed oneliner which provides this output:
sdc3 sdc2
for Input :
sdc3[1] sdc2[0]
I mean remove all subscript value from the string ..
sed 's/\[[^]]*\]//g'
reads: substitute any string with literal "[" followed by zero or more characters that aren't a "]", and then the closing "]", with an empty string.
You need the [^]] bit to prevent greedy matching treating "[1] sdc2[0]" as a single match in your sample string.
As for your comment:
sed 's#\([^[ ]*\)\[[^]]*\]#/dev/\1#g'
I switch the seperator from the usual '/' to '#', just to avoid escaping the /dev/ bit you asked for (I won't say "for clarity")
the \(...\) bit matches a subgroup, here sdc2 or whatever, so we can refer to it in the replacement
the subgroup uses a similar character class to the one we used discarding the index: [^[ ] means any character except an "[" (again, to avoid greedily matching the index) or a space (assuming your values are space-delimited as per your post)
the replacement is now the literal "/dev/" followed by the first (and only) subgroup match
the g flag at the end tells it to perform multiple matches per line, instead of stopping at the first one