Related
I'm trying to take in a complex csv file in MATLAB and edit/rearrange it. I used tableread to pull out the numerical data that I then manipulated. Afterwards I aimed to take the first 52 rows of text metadata, add three more rows then combine with the numerical data and output as a csv file.
However, I can't seem to find a way to work with the text metadata rows that doesn't ultimately add new lines between them when output.
e.g. this is how my csv outputs:
#HEADER
,,,,,,
Instrument, Data Processed Analysis
,,,,,
Version,1.3.1,1.25
,,,,
Created,Monday, August 16, 2021 3:13:38 PM
,,,
Filename,D5-116.NGXDP
,,,,,
When it should look like:
#HEADER
NGX, Data Processed Analysis
Version,1.3.1,1.25
Created,Monday, August 16, 2021 3:13:38 PM
Filename,D5-116.NGXDP
Here is my code:
`
close
clear
fileName = 'D5-116.NGXDP';
fileOut = fileName(1:end-6);
fileOut = append(fileOut, '.csv');
eEnergy = 1.602176634e-19; %electron energy in coulombs
resisTor = 1e11; %Resistor value for EM
%% Reading and Edit Mass Data - this is too hardcoded right now
T = readtable(fileName, 'FileType', 'text');
T = T{:,:}; %convert from table to matrix
dataOrg(:,1:3) = T(1087:1236, 1:3); %cycle:time:Ar
dataOrg(:,4) = T(935:1084, 3); %39
dataOrg(:,5) = T(783:932, 3); %38
dataOrg(:,6) = T(631:780, 3); %37
dataOrg(:,7) = T(479:628, 3); %36
dataOrg(:,7) = dataOrg(:,7) * eEnergy * resisTor; %36Ar - convert CPS to V
dataOrg(:,3:7) = dataOrg(:,3:7) * 1000; %Convert all to mV
%% Reading the metadata from the file
T2 = fopen(fileName);
metaData = fread(T2, '*char')'; %read content
fclose(T2);
results = strsplit(metaData, '\n'); %split the char data by entry spaces
metaNeat = results(:,1:52)'; %Takes the metadata prior to #collectors
metaNeat(end+1, 1) = {'Blocks, 1'}; %added data
metaNeat(end+1, 1) = {'Cycles, 150'}; %added data
metaNeat(end+1, 1) = {'Cycle, Time, 40, 39, 38, 37, 36'};
for n = 1:length(metaNeat) %scan each row, if there is a comma split them
if contains(metaNeat(n,1), ',') == 1
tempMeta = split(metaNeat(n,1), ',')';
metaNeat(n,1:size(tempMeta, 2)) = tempMeta;
clear tempMeta
end
end
outPut = metaNeat; %metadata section
outPut(end+1:(end+150), 1:7) = num2cell(dataOrg); %mass value section
writecell(outPut,fileOut,'Delimiter',',', 'FileType', 'text')
I'm sure there is a way to do this in three lines of course. My guess is that the strsplit function I'm using to separate the char cell into multiple rows is preserving the \n value. The added lines (e.g. Blocks, 1) do not show gaps in the csv file. Any tips or advice would be appreciated.
Thanks for your time,
I am trying to follow a tutorial on how to make a deeplearning chatbot with pytorch. However, this code is quite complex for me and it has stopped with a "IndexError: list index out of range". I looked the error up and get the gist of what it usually means, but seeing as this code is very complex for me I can't figure out how to solve the error.
this is the source tutorial: [https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/chatbot_tutorial.ipynb#scrollTo=LTzdbPF-OBL9][1]
Line 198 seems to be causing the error
return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH
This is the error log
Start preparing training data ...
Reading lines...
Traceback (most recent call last):
File "D:\Documents\Python\python pycharm files\pythonProject4\3.9 Chatbot.py", line 221, in <module>
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)
File "D:\Documents\Python\python pycharm files\pythonProject4\3.9 Chatbot.py", line 209, in loadPrepareData
pairs = filterPairs(pairs)
File "D:\Documents\Python\python pycharm files\pythonProject4\3.9 Chatbot.py", line 202, in filterPairs
return [pair for pair in pairs if filterPair(pair)]
File "D:\Documents\Python\python pycharm files\pythonProject4\3.9 Chatbot.py", line 202, in <listcomp>
return [pair for pair in pairs if filterPair(pair)]
File "D:\Documents\Python\python pycharm files\pythonProject4\3.9 Chatbot.py", line 198, in filterPair
return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH
IndexError: list index out of range
Read 442563 sentence pairs
Process finished with exit code 1
And this is my code copied from my pycharm up to the block with the error. Seeing as its a huge code I could not copy the entire code. The rest of the code can be found in the github source link above.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math
USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")
corpus_name = "cornell movie-dialogs corpus"
corpus = os.path.join("D:\Documents\Python\intents", corpus_name)
def printLines(file, n=10):
with open(file, 'rb') as datafile:
lines = datafile.readlines()
for line in lines[:n]:
print(line)
printLines(os.path.join(corpus, "movie_lines.txt"))
# Splits each line of the file into a dictionary of fields
def loadLines(fileName, fields):
lines = {}
with open(fileName, 'r', encoding='iso-8859-1') as f:
for line in f:
values = line.split(" +++$+++ ")
# Extract fields
lineObj = {}
for i, field in enumerate(fields):
lineObj[field] = values[i]
lines[lineObj['lineID']] = lineObj
return lines
# Groups fields of lines from `loadLines` into conversations based on *movie_conversations.txt*
def loadConversations(fileName, lines, fields):
conversations = []
with open(fileName, 'r', encoding='iso-8859-1') as f:
for line in f:
values = line.split(" +++$+++ ")
# Extract fields
convObj = {}
for i, field in enumerate(fields):
convObj[field] = values[i]
# Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")
lineIds = eval(convObj["utteranceIDs"])
# Reassemble lines
convObj["lines"] = []
for lineId in lineIds:
convObj["lines"].append(lines[lineId])
conversations.append(convObj)
return conversations
# Extracts pairs of sentences from conversations
def extractSentencePairs(conversations):
qa_pairs = []
for conversation in conversations:
# Iterate over all the lines of the conversation
for i in range(len(conversation["lines"]) - 1): # We ignore the last line (no answer for it)
inputLine = conversation["lines"][i]["text"].strip()
targetLine = conversation["lines"][i+1]["text"].strip()
# Filter wrong samples (if one of the lists is empty)
if inputLine and targetLine:
qa_pairs.append([inputLine, targetLine])
return qa_pairs
# Define path to new file
datafile = os.path.join(corpus, "formatted_movie_lines.txt")
delimiter = '\t'
# Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))
# Initialize lines dict, conversations list, and field ids
lines = {}
conversations = []
MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]
MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]
# Load lines and process conversations
print("\nProcessing corpus...")
lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)
print("\nLoading conversations...")
conversations = loadConversations(os.path.join(corpus, "movie_conversations.txt"),
lines, MOVIE_CONVERSATIONS_FIELDS)
# Write new csv file
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:
writer = csv.writer(outputfile, delimiter=delimiter)
for pair in extractSentencePairs(conversations):
writer.writerow(pair)
# Print a sample of lines
print("\nSample lines from file:")
printLines(datafile)
# Default word tokens
PAD_token = 0 # Used for padding short sentences
SOS_token = 1 # Start-of-sentence token
EOS_token = 2 # End-of-sentence token
class Voc:
def __init__(self, name):
self.name = name
self.trimmed = False
self.word2index = {}
self.word2count = {}
self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
self.num_words = 3 # Count SOS, EOS, PAD
def addSentence(self, sentence):
for word in sentence.split(' '):
self.addWord(word)
def addWord(self, word):
if word not in self.word2index:
self.word2index[word] = self.num_words
self.word2count[word] = 1
self.index2word[self.num_words] = word
self.num_words += 1
else:
self.word2count[word] += 1
# Remove words below a certain count threshold
def trim(self, min_count):
if self.trimmed:
return
self.trimmed = True
keep_words = []
for k, v in self.word2count.items():
if v >= min_count:
keep_words.append(k)
print('keep_words {} / {} = {:.4f}'.format(
len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)
))
# Reinitialize dictionaries
self.word2index = {}
self.word2count = {}
self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
self.num_words = 3 # Count default tokens
for word in keep_words:
self.addWord(word)
MAX_LENGTH = 10 # Maximum sentence length to consider
# Turn a Unicode string to plain ASCII, thanks to
# http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
return ''.join(
c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'
)
# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
s = unicodeToAscii(s.lower().strip())
s = re.sub(r"([.!?])", r" \1", s)
s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
s = re.sub(r"\s+", r" ", s).strip()
return s
# Read query/response pairs and return a voc object
def readVocs(datafile, corpus_name):
print("Reading lines...")
# Read the file and split into lines
lines = open(datafile, encoding='utf-8').\
read().strip().split('\n')
# Split every line into pairs and normalize
pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
voc = Voc(corpus_name)
return voc, pairs
# Returns True iff both sentences in a pair 'p' are under the MAX_LENGTH threshold
def filterPair(p):
# Input sequences need to preserve the last word for EOS token
return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH
# Filter pairs using filterPair condition
def filterPairs(pairs):
return [pair for pair in pairs if filterPair(pair)]
# Using the functions defined above, return a populated voc object and pairs list
def loadPrepareData(corpus, corpus_name, datafile, save_dir):
print("Start preparing training data ...")
voc, pairs = readVocs(datafile, corpus_name)
print("Read {!s} sentence pairs".format(len(pairs)))
pairs = filterPairs(pairs)
print("Trimmed to {!s} sentence pairs".format(len(pairs)))
print("Counting words...")
for pair in pairs:
voc.addSentence(pair[0])
voc.addSentence(pair[1])
print("Counted words:", voc.num_words)
return voc, pairs
# Load/Assemble voc and pairs
save_dir = os.path.join("data", "save")
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)
# Print some pairs to validate
print("\npairs:")
for pair in pairs[:10]:
print(pair)
MIN_COUNT = 3 # Minimum word count threshold for trimming
I really hope someone can help me fix this problem and help me understand why it happens.
In the end I changed
return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH
to
try:
return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH
except:
return False
And now the code seems to be working.
I followed the tutorial about Reading data with TF and made some tries myself. Now, the problem is that my tests show duplicate data in the batches I created when reading data from a CSV.
My code looks like this:
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import collections
import numpy as np
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
class XICSDataSet:
def __init__(self, height=20, width=195, batch_size=1000, noutput=15):
self.depth = 1
self.height = height
self.width = width
self.batch_size = batch_size
self.noutput = noutput
def trainingset_files_reader(self, data_dir, nfiles):
fnames = [os.path.join(data_dir, "test%d"%i) for i in range(nfiles)]
filename_queue = tf.train.string_input_producer(fnames, shuffle=False)
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [[.0],[.0],[.0],[.0],[.0]]
data_tuple = tf.decode_csv(value, record_defaults=record_defaults, field_delim = ' ')
features = tf.pack(data_tuple[:-self.noutput])
label = tf.pack(data_tuple[-self.noutput:])
depth_major = tf.reshape(features, [self.height, self.width, self.depth])
min_after_dequeue = 100
capacity = min_after_dequeue + 30 * self.batch_size
example_batch, label_batch = tf.train.shuffle_batch([depth_major, label], batch_size=self.batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue)
return example_batch, label_batch
with tf.Graph().as_default():
ds = XICSDataSet(2, 2, 3, 1)
im, lb = ds.trainingset_files_reader(filename, 1)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
tf.train.start_queue_runners(sess=sess)
for i in range(1000):
lbs = sess.run([im, lb])[1]
_, nu = np.unique(lbs, return_counts=True)
if np.array_equal(nu, np.array([1, 1, 1])) == False:
print('Not unique elements found in a batch!')
print(lbs)
I tried with different batch sizes, different number of files, different values of capacity and min_after_dequeue, but I always get the problem. In the end, I would like to be able to read data from only one file, creating batches and shuffling the examples.
My files, created ad hoc for this test, have 5 lines each representing samples, and 5 columns. The last column is meant to be the label for that sample. These are just random numbers. I'm using only 10 files just to test this out.
The default behavior for tf.train.string_input_producer(fnames) is to produce an infinite number of copies of the elements in fnames. Therefore, since your tf.train.shuffle_batch() capacity is larger than the total number of elements in your input files (5 elements per file * 10 files = 50 elements), and the min_after_dequeue is also larger than the number of elements, the queue will contain at least two full copies of the input data before the first batch is produced. As a result, it is likely that some batches will contain duplicate data.
If you only want to process each example once, you can set an explicit num_epochs=1 when creating the tf.train.string_input_producer(). For example:
def trainingset_files_reader(self, data_dir, nfiles):
fnames = [os.path.join(data_dir, "test%d" % i) for i in range(nfiles)]
filename_queue = tf.train.string_input_producer(
fnames, shuffle=False, num_epochs=1)
# ...
When using the rapid annotator tool brat, it appears that the created annotations file will present the annotation in the order that the annotations were performed by the user. If you start at the beginning of a document and go the end performing annotation, then the annotations will naturally be in the correct offset order. However, if you need to go earlier in the document and add another annotation, the offset order of the annotations in the output .ann file will be out of order.
How then can you rearrange the .ann file such that the annotations are in offset order when you are done? Is there some option within brat that allows you to do this or is it something that one has to write their own script to perform?
Hearing nothing, I did write a python script to accomplish what I had set out to do. First, I reorder all annotations by begin index. Secondly, I resequence the label numbers so that they are once again in ascending order.
import optparse, sys
splitchar1 = '\t'
splitchar2 = ' '
# for brat, overlapped is not permitted (or at least a warning is generated)
# we could use this simplification in sorting by simply sorting on begin. it is
# probably a good idea anyway.
class AnnotationRecord:
label = 'T0'
type = ''
begin = -1
end = -1
text = ''
def __repr__(self):
return self.label + splitchar1
+ self.type + splitchar2
+ str(self.begin) + splitchar2
+ str(self.end) + splitchar1 + self.text
def create_record(parts):
record = AnnotationRecord()
record.label = parts[0]
middle_parts = parts[1].split(splitchar2)
record.type = middle_parts[0]
record.begin = middle_parts[1]
record.end = middle_parts[2]
record.text = parts[2]
return record
def main(filename, out_filename):
fo = open(filename, 'r')
lines = fo.readlines()
fo.close()
annotation_records = []
for line in lines:
parts = line.split(splitchar1)
annotation_records.append(create_record(parts))
# sort based upon begin
sorted_records = sorted(annotation_records, key=lambda a: int(a.begin))
# now relabel based upon the sorted order
label_value = 1
for sorted_record in sorted_records:
sorted_record.label = 'T' + str(label_value)
label_value += 1
# now write the resulting file to disk
fo = open(out_filename, 'w')
for sorted_record in sorted_records:
fo.write(sorted_record.__repr__())
fo.close()
#format of .ann file is T# Type Start End Text
#args are input file, output file
if __name__ == '__main__':
parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(),
usage=globals()['__doc__'],
version='$Id$')
parser.add_option ('-v', '--verbose', action='store_true',
default=False, help='verbose output')
(options, args) = parser.parse_args()
if len(args) < 2:
parser.error ('missing argument')
main(args[0], args[1])
sys.exit(0)
How do I get to read my file with increment .htm file with correct file format and path?
path:DATA\WEBPAGE_SOURCE\train75_phish_data\1.htm
file:1.htm,2.htm,3.htm....etc
Inside 1.htm,2.htm,3.htm....etc are the soucre code of webpage
I do try with the following example, but got the error when i=21.
data2=fopen(strcat('DATA\WEBPAGE_SOURCE\train75_phish_data\',int2str(i),'.htm'),'r')
I have refer to this, still cannot work, any ideas?
http://www.mathworks.com/help/matlab/ref/fopen.html
Here is my code:
data = importdata('DATA/URL/trainURL')
domain_URL = regexp(data,'\w*://[^/]*','match','once')
[sizeData b] = size(domain_URL);
for i = 1:150
A7_data = domain_URL{i};
data2=fopen(strcat('DATA\WEBPAGE_SOURCE\train75_phish_data\',int2str(i),'.htm'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
img_only = regexp(CharData, '<img.*?>', 'match');
feature7_data=(cellfun(#(n) isempty(n), strfind(img_only, A7_data)))
B7(i)=sum(feature7_data)
end
feature7(B7>=10)=1;
feature7(B7<10&B7>5)=0;
feature7(B7<=5)=-1;
feature7'
Here is my output:
data = importdata('DATA/URL/trainURL') is a list of URL being saved inside
I could not loop the results for i=20, it will come to error when iteration=21, I want to loop until 150, it cnt read the 'data2' for 'i=21'
I think you need to handle possible exceptions that can come in a more principled way. Try this:
data = importdata('DATA/URL/trainURL')
domain_URL = regexp(data,'\w*://[^/]*','match','once')
[sizeData b] = size(domain_URL);
for i = 1:150
A7_data = domain_URL{i};
filename = fullfile('DATA\WEBPAGE_SOURCE\train75_phish_data\',strcat(int2str(i),'.htm'));
if (exist(filename,'file')),
disp(sprintf('file %s exists, processing it',filename));
data2=fopen(filename,'r');
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
img_only = regexp(CharData, '<img.*?>', 'match');
feature7_data=(cellfun(#(n) isempty(n), strfind(img_only, A7_data)))
B7(i)=sum(feature7_data)
else,
disp(sprintf('file %s does not exist, skipping it!',filename));
end
end
feature7(B7>=10)=1;
feature7(B7<10&B7>5)=0;
feature7(B7<=5)=-1;
feature7'
after the line that does the fread.