How to make unigram, bigram, trigram model from sentences or data train on file.txt in Python? - python-3.7

i have some code to make unigram, bigram, and trigram from some sentences, but i want this code can make it from file.txt , im newbie in programing let me know what i must do ?
def ngrams(s, n=2, i=0):
while len(s[i:i+n]) == n:
yield s[i:i+n]
i += 1
txt ='Python is one of the awesomest languages'
unigram = ngrams(txt.split(), n=1)
a = list(unigram)
bigram = ngrams(txt.split(), n=2)
b = list(bigram)
trigram = ngrams(txt.split(), n=3)
c = list(trigram)
print('unigram:')
print(a)
print('bigram:')
print(b)
print('trigram:')
print(c)

When you want to write a file (notice the 'w' in the open statement):
with (open('some.txt', 'w')) as file:
file.write("this is an example!")
When you want to read a file (notice the 'r' in the open statement):
with (open('some.txt', 'r')) as file:
line = file.readline();
print(line)
This should be a good starting point for you.

Related

Read textfile with a mix of floats, integers and strings in the same column

Loading a well formatted and delimited text file in Matlab is relatively simple, but I struggle with a text file that I have to read in. Sadly I can not change the structure of the source file, so I have to deal with what I have.
The basic file structure is:
123 180 (two integers, white space delimited)
1.5674e-8
.
.
(floating point numbers in column 1, column 2 empty)
.
.
100 4501 (another two integers)
5.3456e-4 (followed by even more floating point numbers)
.
.
.
.
45 String (A integer in column 1, string in column 2)
.
.
.
A simple
[data1,data2]=textread('filename.txt','%f %s', ...
'emptyvalue', NaN)
Does not work.
How can I properly filter the input data? All examples I found online and in the Matlab help so far deal with well structured data, so I am a bit lost at where to start.
As I have to read a whole bunch of those files >100 I rather not iterate trough every single line in every file. I hope there is a much faster approach.
EDIT:
I made a sample file available here: test.txt (google drive)
I've looked at the text file you supplied and tried to draw a few general conclusions -
When there are two integers on a line, the second integer corresponds to the number of rows following.
You always have (two integers (A, B) followed by "B" floats), repeated twice.
After that you have some free-form text (or at least, I couldn't deduce anything useful about the format after that).
This is a messy format so I doubt there are going to be any nice solutions. Some useful general principles are:
Use fgetl when you need to read a single line (it reads up to the next newline character)
Use textscan when it's possible to read multiple lines at once - it is much faster than reading a line at a time. It has many options for how to parse, which it is worth getting to know (I recommend typing doc textscan and reading the entire thing).
If in doubt, just read the lines in as strings and then analyse them in MATLAB.
With that in mine, here is a simple parser for your files. It will probably need some modifications as you are able to infer more about the structure of the files, but it is reasonably fast on the ~700 line test file you gave.
I've just given the variables dummy names like "a", "b", "floats" etc. You should change them to something more specific to your needs.
function output = readTestFile(filename)
fid = fopen(filename, 'r');
% Read the first line
line = '';
while isempty(line)
line = fgetl(fid);
end
nums = textscan(line, '%d %d', 'CollectOutput', 1);
a = nums{1}(1);
b = nums{1}(2);
% Read 'b' of the next lines:
contents = textscan(fid, '%f', b);
floats1 = contents{1};
% Read the next line:
line = '';
while isempty(line)
line = fgetl(fid);
end
nums = textscan(line, '%d %d', 'CollectOutput', 1);
c = nums{1}(1);
d = nums{1}(2);
% Read 'd' of the next lines:
contents = textscan(fid, '%f', d);
floats2 = contents{1};
% Read the rest:
rest = textscan(fid, '%s', 'Delimiter', '\n');
output.a = a;
output.b = b;
output.c = c;
output.d = d;
output.floats1 = floats1;
output.floats2 = floats2;
output.rest = rest{1};
end
You can read in the file line by line using the lower-level functions, then parse each line manually.
You open the file handle like in C
fid = fopen(filename);
Then you can read a line using fgetl
line = fgetl(fid);
String tokenize it on spaces is probably the best first pass, storing each piece in a cell array (because a matrix doesn't support ragged arrays)
colnum = 1;
while ~isempty(rem)
[token, rem] = strtok(rem, ' ');
entries{linenum, colnum} = token;
colnum = colnum + 1;
end
then you can wrap all of that inside another while loop to iterate over the lines
linenum = 1;
while ~feof(fid)
% getl, strtok, index bookkeeping as above
end
It's up to you whether it's best to parse the file as you read it or read it into a cell array first and then go over it afterwards.
Your cell entries are all going to be strings (char arrays), so you will need to use str2num to convert them to numbers. It does a good job of working out the format so that might be all you need.

embed a number and extension in a variable file name

I want to save data to files that have consecutive numbers in these file names within a for-loop.
first I have a function "SetConfeguration.m" in which I specifie the input directory and the file name as fields in a structure as below
StrConf.InputDirectory = 'C:/ElastixMatlab/elx_input';
StrConf.ParameterFilename = 'parameter.%d.txt';
the structure "StrConf" will be used as a parameter in the main function as below
ParameterFilename = fullfile(Conf.InputDirectory, Conf.ParameterFilename);
for Cpt = 1:NbParameterFiles
TmpParameterFilename = sprintf(ParameterFilename, Cpt - 1);
disp('ParameterFilename: '); disp(ParameterFilename);
end
I have the following error:
Warning: Invalid escape sequence appears in format string. See help sprintf for
valid escape sequences.
> In elxElastix at 153
In elxExampleStreet at 93
ParameterFilename :
C:\ElastixMatlab\elx_input\parameter.%d.txt
TmpParameterFilename :
C:
I think you forgot to call the structure StrConf to access the parameters
TmpParameterFilename = sprintf(StrConf.ParameterFilename, Cpt - 1);
disp('ParameterFilename: '); disp(StrConf.ParameterFilename);
Also, i suggest you to make a little change in the for loop, since it loops from 0 to n-1.
ParameterFilename = fullfile(Conf.InputDirectory, Conf.ParameterFilename);
for Cpt = 0:NbParameterFiles-1
TmpParameterFilename = sprintf(StrConf.ParameterFilename, Cpt);
disp('ParameterFilename: '); disp(StrConf.ParameterFilename);
end
This way you save an operation in every iteration, since you don't make the substraction of Cpt - 1, making your code a little bit more efficient.
You need to use sprintf before fullfile. The problem is that fullfile is normalizing your path separator from / used in your code, to \ which is the standard on Windows. But \ is also used for escape sequences which sprintf recognizes.
This will work better:
for Cpt = 1:NbParameterFiles
TmpParameterFilename = fullfile(Conf.InputDirectory, ...
sprintf(StrConf.ParameterFilename, Cpt - 1));
disp('ParameterFilename: '); TmpParameterFilename;
end
I think you want
TmpParamterFilename = sprinf('%s%d.txt',ParameterFilename, Cpt-1);
And then ParameterFilename wouldn't have .txt at the end.

How to read digits from file to matrix, no delimeter

I have a data stored in below format, no delimeter and digit domain is {0,1}. With using octave, taking the digits and storing them in martix is reaised a problem for me. I have not managed below scnerio. So, How can I take those digits and store them on matrix as told at below?
Data in File, 32 x 32 digits
00000000000000000000000000000000
00000000001111110000000000000000
...
00000010000000100001000000000000
how to store data
matrix[1, 1:32] = 00000000000000000000000000000000
matrix[2, 1:32] = 00000000001111110000000000000000
. . .
matrix[32, 1:32] = 00000010000000100001000000000000
OR
matrix[1, 1:32] = 00000000000000000000000000000000
matrix[1, 33:64] = 00000000001111110000000000000000
. . .
matrix[1, 993:1024] = 00000010000000100001000000000000
One possible solution is to read the data as a string first:
octave> textread('foo.dat', '%s', 'headerlines', 2)
ans =
{
[1,1] = 00000000000000000000000000000000
[2,1] = 00000000001111110000000000000000
...
}
If these are binary representations of decimals, you may find bin2dec() useful.
This would do the trick (though I don't know how well that third input to fread and arrayfun work with Octave, tested this on Matlab):
fid = fopen('a.txt','rt');
str = fread(fid,inf,'char=>char');
st = fclose(fid);
qrn = str==10|str==13;
str(qrn) = [];
yourMat = reshape(arrayfun(#str2num,str),find(qrn,1)-1,[]).'
Assuming you don't have header lines, you can read the text in as a cell arrray of strings like so:
C = textread('names.txt', '%s');
Then, in general for all numbers from 0 to 9, you can transform this into a matrix like so:
M = vertcat(S{:})-'0';
If performance is an issue you can look into other ways to import the strings, but this should get the job done.
I have never used Matlab, but asuming it reads files the same way Octave does, and if using an external tool is OK, you could try replacing the characters to add a delimiter using a text editor. You could change every "0" to "0," and every "1" to "1," and then simply load the file.
(This would add a delimiter at the end of every line. In case that creates a problem, you could try replacing your text by pairs instead "00"->"0,0" "10" -> "1,0" and so on)
In case the file is too big for a normal editor, you might even try replacing the characters with sed:
sed -i 's/charactertoreplace/newcharacter/g' yourfile.txt

Separating an array based on whether it contains a phrase or not

I am really just a noob at Matlab, so please don't get upset if I use wrong syntax. I am currently writing a small program in which I put all .xlsx file names from a certain directory into an array. I now want to separate the files into two different arrays based on their name. This is what I tried:
files = dir('My_directory\*.xlsx')
file_number = 1;
file_amount = length(files);
while file_number <= file_amount;
file_name = files(file_number).name;
filescs = [];
filescwf = [];
if strcmp(file_name,'*cs.xlsx') == 1;
filescs = [filescs,file_name];
else
filescwf = [filescwf,file_name];
end
file_number = file_number + 1
end
The idea here is that strcmp(file_name,'*cs.xlsx') checks if file_name contains 'cs' at the end. If it does, it is put into filescs, if it doesn't it is put into filescwf. However, this does not seem to work...
Any thoughts?
strcmp(file_name,'*cs.xlsx') checks whether file_name is identical to *cs.xlsx. If there is no file by that name (hint: few file systems allow '*' as part of a file name), it will always be false. (btw: there is no need for the '==1' comparison or the semicolon on the respective line)
You can use array indexing here to extract the relevant part of the filename you want to compare. file_name(1:5), will give you the first 5 characters, file_name(end-5:end) will give you the last 6, for example.
strcmp doesn't work with wildcards such as *cs.xlsx. See this question for an alternative approach.
You can use regexp to check the final letters of each of your files, then cellfun to apply regexp to all your filenames.
Here, getIndex will have 1's for all the files ending with cs.xlsx. The (?=$) part make sure that cs.xlsx is at the end.
files = dir('*.xlsx')
filenames = {files.name}'; %get filenames
getIndex = cellfun(#isempty,regexp(filenames, 'cs.xlsx(?=$)'));
list1 = filenames(getIndex);
list2 = filenames(~getIndex);

strrep MAtlab function

I would like to remove the first letter and replace the second one by its lowercase
Example :
a = 'iSvalid' to a = 'svalid'
I've tried strrep( a,'i','') which gives 'Svalid' but I would like also to convert the first capital letter to lower case.
>> a = 'iSvalid';
>> b = strcat(lower(a(2)), a(3:end))
b =
svalid
You can also use brackets:
>> b = [lower(a(2)) a(3:end)]
b =
svalid
For a general solution, which will e.g. work on cell arrays of strings, or on multiple words in the same string, there is regexprep:
a = 'iSvalid';
%# discard first letter of word, replace second by lower-case version
b = regexprep(a,'\<\w(\w)','${lower($1)}')
b =
svalid
Here is my version os #petrichor's answer. I've separated each function to make the code more readable.
a = 'isValid';
b = a(2:end);
b(1) = lower(b(1));