What is the '...' in sklearn.feature_extraction.text TfidfVectorizer return output? - tfidfvectorizer

When I used TfidfVectorizer to encode a paragraph, I receive the output that include '...' inside, this look like the matrix is too large and python cut off something. This problem happens occasionally, but not always, and cause error in another code because '...' has no meaning.
I run in on Conda with Python 36
My source code is like this:
vectorizer = TfidfVectorizer(stop_words='english',use_idf=True)
# print(data)
X = vectorizer.fit_transform(data).todense()
with open("spare_matrix.txt","w") as file:
file.write(str(X))
file.close()
Please suggest me some ideas for that with thanks

Related

Issues with fopen understanding the filepath I am giving it

Basically, I build a list of files using ls, then want to loop through that list, read in a file, and do some stuff. But when I try to read the file in it fails.
Here is an example
r=ls(['Event_2006_334_21_20_11' '/*.r'])
Event_2006_334_21_20_11/IU.OTAV_1.0.i.r
which is a 1x80 char
fopen(r(1,:))
-1
but
fopen('Event_2006_334_21_20_11/IU.OTAV_1.0.i.r')
12 (or whatever its on)
works. I've tried string(r) and char(r) and sprintf('%s',r). If I just build the string like r = ['Event_2006_334_21_20_11' '/IU.OTAV_1.0.i.r'] it works. So it seems something about combining the different variable types that messes it up but I can't seem to find a workaround. Probably something obvious I'm missing.
Any suggestions?
ls returns a matrix of characters, which means each row contains the same number of characters. To indicate the problem, try:
['-' r(1,:) '-']
You will probably notice some whitespaces in front of the -. Unless you want to print the output to the command line, ls is not really useful. As mentioned by Alex, use dir instead.
A further tip regarding your last comment, concatenate file path using fullfile. It makes sure you get one file separator whenever concatenating:
>> fullfile('myfolder','mysubfolder','myfile.m')
ans = myfolder/mysubfolder/myfile.m
>> fullfile('myfolder/','mysubfolder','myfile.m')
ans = myfolder/mysubfolder/myfile.m
>> fullfile('myfolder/','/mysubfolder','myfile.m')
ans = myfolder/mysubfolder/myfile.m

Removing quotes that seem to be nonexistent in MatLab

I have an application that translates the .csv files I currently have into the correct format I need. But, the files I that I do have seem to have '"' double quotes around them, as seen in this image, which will not work with the program. As a result, I'm using this command to remove them:
for m = 1:currentsize(1)
for n = 1:currentsize(2)
replacement{m,n} = strrep(current{m,n}, '"', '');
end
end
I'm not entirely sure that this works though, as it spits this back at me as it runs:
Warning: Inputs must be character arrays or cell arrays of strings.
When I open the file in matlab, it seems to only have the single quotes around it, which is normal for any other file. However, when I open it in notepad++, it seems to have '"' double quotes around everything. Is there a way to remove these double quotes in any other way? My code doesn't seem to do anything as seen here:
After using xlswrite to write the replacement cell-array to a .csv file, one appears corrupted. Any idea why?
So, my questions are:
Is there any way to remove the quotes in a more efficient manner or without rewriting to a csv?
and
What exactly is causing the corruption in the xlswrite function? The variable replacement seems perfectly normal.
Thanks in advance!
Regarding the "corrupted" file. That's not a corrupted file, that's an xls file (not xlsx). You could verify this opening the text file in a hex editor to compare the signature. This happens when you chose no file extension or ask excel to write a file which can't be encoded into csv. I assume it's some character which causes the problems, try to isolate the critical line writing only parts of the cell.
Regarding your warning, not having the actual input I could only guess what's wrong. Again, I can only give some advices to debug the problem yourself. Run this code:
lastwarn('')
for m = 1:currentsize(1)
for n = 1:currentsize(2)
replacement{m,n} = strrep(current{m,n}, '"', '');
if ~isempty(lastwarn)
keyboard
lastwarn('')
end
end
end
This will launch the debugger when a warning is raised, allowing you to check the content of current{m,n}
It is mentionned in the documentation of strrep that :
The strrep function does not find empty strings for replacement. That is, when origStr and oldSubstr both contain the empty string (''), strrep does not replace '' with the contents of newSubstr.
You can use this user function substr like this :
for m = 1:currentsize(1)
for n = 1:currentsize(2)
replacement{m,n} = substr(current{m,n}, 2, -1);
end
end
Please use xlswrite function like this :
[status,message] = xlswrite(___)
and check the status if it is zero, that means the function did not succeed and an error message is generated in message as string.

Perl Code : Output not displayed properly

I have a perl code where I access multiple txt files and produce output for them.
While I run the code, the output lines on the console are overwritten.
2015-04-21:12-04-54|getFilesInInputDir| ********** name : PEPORT **********
PEPORT4-21:12-04-54|readNFormOutputFile| name :
PEPORT" is : -04-54|readNFormOutputFile| Frequency for name "
Please note, that the second and third line it should have been like
2015-04-21:12-04-54|readNFormOutputFile| name : PEPORT
2015-04-21:12-04-54|readNFormOutputFile| Frequency for name "PEPORT"
Also, after this the code stops processing my files. The code seems fine. May I know what may be the possible cause for this.
Thanks.
Seems like CR/LF versus LF issue. Convert your input from MSWin to Linux by running dos2unix or fromdos, or remove the "\r" characters from within the Perl code.
As choroba says, I guess you are reading a file on Linux that has been generated on Windows. The easiest fix is to replace chomp with s/\s+\z//or s/\p{cntrl}+\z//
Or, if trailing spaces are significant, you can use s/[\r\n]+\z// or, if you are running version 10 or later of Perl 5, s/\R\z//

Piping gdalinfo.exe to matlab/octave with system() does not return any output

I am trying to use the system() function to pipe the output of the gdalinfo (version 1.10 x64) utility directly to Matlab/Octave. The function consistently returns status=0, but does not return any output. For example:
[status output] = system('"C:\Program Files\GDAL\gdalinfo.exe" "E:\DATA\image.tif"')
will only return:
status =
0
output =
''
Any idea why no output is returned?
It appears there is something strange about `gdalinfo.exe'. Several people have reported difficulty piping the output of the program to a textfile - see for example http://osgeo-org.1560.x6.nabble.com/GDALINFO-cannot-pipe-to-text-file-td3747928.html
So the first test would be - can you do something like this:
"C:\Program Files\GDAL\gdalinfo.exe" "E:\DATA\image.tif" > myFile.txt
and see whether the file is created and has any content? If it doesn't, it may be that the program is using a different way to produce output (for example, using stderr instead of stdout). If it is possible to get data into a text file but not directly to matlab, I suppose a workaround would be to write to file, then read that file in separately:
tempFile = tempname; % handy built in function to create temporary file name
execCmd = '"C:\Program Files\GDAL\gdalinfo.exe ';
targetFile = '"E:\DATA\image.tif"';
status = system([execCmd targetFile ' > ' tempFile]);
output = textread( tempFile, '%s' );
system(['del ' tempFile);
Now the output variable will be a cell array with one cell per line in the input file.
This works on my Windows machine if I am in the Octave directory:
[status output] = system('ls bin')
I was having the same issue trying to pipe the output from within C#. It turns out that the ECW plugin breaks the capability (I don't know how). If this plugin is not crucial to you, go into the plugins directory and delete gdal_ECWJP2ECW.dll. You should be able to use '>' and other stuff to dump your output to a file.

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}