How I can remove all spaces and & from text?
col1
H&M
H & M
H & M
H &M
H& M
output
col1
HM
HM
HM
HM
HM
Help me to fix the following code or give me new one:
df.withColumn('col1', F.regexp_replace("col1", "&", ""))
Also how I can get '' if there is space between characters and words and ' ' if there is no white space between character and words.
col1
H&M
H & M
H & M
H &M
H& M
output
col1
H M
H M
H M
H M
H M
Replace all characters except alphabets
df = df.withColumn('col1',regexp_replace('col1','[^A-Z]','')).show()
Related
I'm trying to read a csv file with comments and empty lines, from which I need to fetch lines which are not empty or commented.
File looks like this:
Test File for dry run:
#This is a comment
# This is a comment with, comma
# This,is,a,comment with exact number of commas as valid lines
h1,h2,h3,h4
a,b,c,d
e,f,g,h
i,j,k,l
m,n,o,p
Expected Output:
h1 h2 h3 h4
-----------
a b c d
e f g h
i j k l
m n o p
Unsuccessful attempt:
q)("SSSS";enlist ",")0: ssr[;;]each read0 `:test.csv // tried various options with ssr but since '*' wildcard gives error with ssr so not sure of how to use regex here
This provides the required result:
q)("SSSS";enlist",")0:t where not""~/:t:5_read0`:test.csv
h1 h2 h3 h4
-----------
a b c d
e f g h
i j k l
m n o p
To ignore any number of comments you could use:
q)("SSSS";enlist",")0:t where not any each(" ";"#")~\:/:first each t:read0`:test.csv
h1 h2 h3 h4
-----------
a b c d
e f g h
i j k l
m n o p
I have two 1x6 vectors that I am eventually trying to just sum up, but I need to get all of the possible combinations of these vectors before doing so. The vectors will look like so:
V1=[a b c d e f];
V2=[A B C D E F];
What I need is to find all possible combinations of variables that will remain a 1x6 vector. I have been messing around for a while now and I think I have found a way by using various matrices but it seems terribly inefficient. An example of what I am looking for is as follows.
M=[a b c d e f;
A b c d e f;
A B c d e f;
A B C d e f;
A B C D e f;
A B C D E f;
A B C D E F;
. . .]
And so on and so forth until all combinations are found. Unfortunately I am not a MATLAB whiz hence the reason I'm reaching out. I'm sure there has to be a much simpler way than what I have been trying. I hope that my question was relatively clear. Any help is much appreciated! Thanks!
I used cellfun to create the indexes:
V1=['abcdef'];
V2=['ABCDEF'];
VV = [V1;V2];
l = length(V1);
pows = 0:l-1;
x = num2cell(2.^pows);
L = x{end};
rows = cellfun(#(x) reshape([ones(x,L/x);2*ones(x,L/x)],[2*L 1]),x,'Uniformoutput',0);
rows = cell2mat(rows);
cols = repmat(1:l,[2*L 1]);
idxs = sub2ind(size(VV),rows,cols);
M = VV(idxs);
and you get:
M =
abcdef
Abcdef
aBcdef
ABcdef
abCdef
AbCdef
aBCdef
ABCdef
abcDef
AbcDef
...
Using octave I want to split a string into its individual characters.
How do I do this?
For example converting
x = "hello"
to
c = [h, e, l, l, o]
Thanks in advance for the help
It's already split in Matlab's string actually:
x = 'hello'
x(1)
>> h
I'm using Ruby to read and then print a file to stdout, redirecting the output to a file in Windows PowerShell.
However, when I inspect the files, I get this for the input:
PS D:> head -n 1 .\inputfile
<text id="http://observer.guardian.co.uk/osm/story/0,,1009777,00.html"> <s> Hooligans NNS hooligan
, , , unbridled JJ unbridled passion NN passion
- : - and CC and no DT no executive JJ executiv
e boxes NNS box . SENT . </s>
... yet this for the output:
PS D:> head -n 1 .\outputfile
ΓΏ_< t e x t i d = " h t t p : / / o b s e r v e r . g u a r d i a n . c o . u k / o s m / s t o r y / 0 , , 1 0 0 9 7 7 7 , 0
0 . h t m l " > < s > H o o l i g a n s N N S h o o l i g a n , ,
, u n b r i d l e d J J u n b r i d l e d p a s s i o n N N p a s s i o n
- : - a n d C C a n d n o D T n o e x e c u t i v e J J
e x e c u t i v e b o x e s N N S b o x . S E N T . < / s >
How can this happen?
Edit: since my problem didn't have anything to do with Ruby, I've removed the Ruby-code, and included my usage of the Windows shell.
In PowerShell > is effectively the same as | Out-File and Out-File defaults to Unicode encoding. Try this instead of using >:
... | Out-File outputfile -encoding ASCII
Hi all and thanks in advance.
This is my first post here, please let me know if I should do this differently.
I have a large textfile containing lines like the following:
"DATE TIMESTAMP T W M T AL M C A_B_C"
At first I read this in using the fopen and fget1 commands, so that I get a string:
Readout = DATE TIMESTAMP T W M T AL M C A_B_C
I want to transform this via e.g. textscan. While I feel I know matlab I am by no means expert with this command and have trouble using it.
I want to get:
A = 'Date' 'TIMESTAMP' 'T W M' 'T AL M C' 'A_B_C'
However using the following code:
A = textscan(Readout,'%s');
A = A{1}';
I get:
A = 'DATE' 'TIMESTAMP' 'T' 'W' 'M' 'T' 'AL' 'M' 'C' 'A_B_C'
As I asked in the title, is there a way to ignore the single spaces?
PS:
At the end of writing this I just came up with a not very elegent solution I would still like to know if there is any nicer solution, however:
ReadBetter = [];
for n = 1:length(Read)-1
if Read(n) == ' ' & Read(n+1) ~= ' '
else
ReadBetter = [ReadBetter Read(n)];
end
end
ReadBetter = [ReadBetter Read(n+1)];
Read
ReadBetter
Output:
Read =
DATE TIMESTAMP T W M T AL M C A_B_C
ReadBetter =
DATE TIMESTAMP TWM TALMC A_B_C
Now I can use ReadBetter with textscan.
Thanks for this awesome webpage and the help I found here, in many other posts
Newer versions of matlab have a 'split' option for regexp similar to perl's split.
>> str = 'DATE TIMESTAMP T W M T AL M C A_B_C';
>> out = regexp(str, ' +', 'split')
out =
'DATE' 'TIMESTAMP' 'T W M' 'T AL M C' 'A_B_C'
A simpler solution to parse your string would be to use the function REGEXP to find the indices where you have 2 or more whitespace characters in a row, use these indices to break your string up into a cell array of strings using the function MAT2CELL, then use the function STRTRIM to remove leading and trailing whitespace from each substring. For example:
>> str = 'DATE TIMESTAMP T W M T AL M C A_B_C';
>> cutPoints = regexp(str,'\s{2,}');
>> cellArr = mat2cell(str,1,diff([0 cutPoints numel(str)]));
>> cellArr = strtrim(cellArr)
cellArr =
'DATE' 'TIMESTAMP' 'T W M' 'T AL M C' 'A_B_C'
I think that you are making things too complicated. Just use:
fid = fopen('pathandnameoffile');
textscan(fid,'%s','Delimiter','\t');
The example above assumes that you have tabs as delimiters. Change it to something else if required.
Here's one way to read your file:
file.dat
DATE TIMESTAMP T W M T AL M C A_B_C
DATE TIMESTAMP T W M T AL M C A_B_C
DATE TIMESTAMP T W M T AL M C A_B_C
DATE TIMESTAMP T W M T AL M C A_B_C
DATE TIMESTAMP T W M T AL M C A_B_C
DATE TIMESTAMP T W M T AL M C A_B_C
MATLAB code:
fid = fopen('file.dat', 'rt');
C = textscan(fid, '%s %s %c%c%c %c%2c%c%c %s');
fclose(fid);
C = [ C{1}, C{2}, ...
cellstr( strcat(C{3},{' '},C{4},{' '},C{5}) ), ...
cellstr( strcat(C{6},{' '},C{7},{' '},C{8},{' '},C{9}) ), ...
C{10}
]
The resulting cell-array:
C =
'DATE' 'TIMESTAMP' 'T W M' 'T AL M C' 'A_B_C'
'DATE' 'TIMESTAMP' 'T W M' 'T AL M C' 'A_B_C'
'DATE' 'TIMESTAMP' 'T W M' 'T AL M C' 'A_B_C'
'DATE' 'TIMESTAMP' 'T W M' 'T AL M C' 'A_B_C'
'DATE' 'TIMESTAMP' 'T W M' 'T AL M C' 'A_B_C'
'DATE' 'TIMESTAMP' 'T W M' 'T AL M C' 'A_B_C'