Trying to install a corpus for countVectorizer in sklearn package - corpus

I am trying to load a corpus from my local drive into python at one time with a for loop and then read each text file and save it for analysis with countVectorizer. But, I am only getting the last file. How do I get the results from all of the files to be stored for analysis with countVectorizer?
This code brings out the text from last file in folder.
folder_path = "folder"
#import and read all files in animal_corpus
for filename in glob.glob(os.path.join(folder_path, '*.txt')):
with open(filename, 'r') as f:
txt = f.read()
print(txt)
MyList= [txt]
## Create a CountVectorizer object that you can use
MyCV1 = CountVectorizer()
## Call your MyCV1 on the data
DTM1 = MyCV1.fit_transform(MyList)
## get col names
ColNames=MyCV1.get_feature_names()
print(ColNames)
## convert DTM to DF
MyDF1 = pd.DataFrame(DTM1.toarray(), columns=ColNames)
print(MyDF1)
This code works, but would not work for a huge corpus that I am preparing it for.
#import and read text files
f1 = open("folder/animal_1.txt",'r')
f1r = f1.read()
f2 = open("/folder/animal_2.txt",'r')
f2r = f2.read()
f3 = open("/folder/animal_3.txt",'r')
f3r = f3.read()
#reassemble corpus in python
MyCorpus=[f1r, f2r, f3r]
## Create a CountVectorizer object that you can use
MyCV1 = CountVectorizer()
## Call your MyCV1 on the data
DTM1 = MyCV1.fit_transform(MyCorpus)
## get col names
ColNames=MyCV1.get_feature_names()
print(ColNames)
## convert DTM to DF
MyDF2 = pd.DataFrame(DTM1.toarray(), columns=ColNames)
print(MyDF2)

I figured it out. Just gotta keep grinding.
MyCorpus=[]
#import and read all files in animal_corpus
for filename in glob.glob(os.path.join(folder_path, '*.txt')):
with open(filename, 'r') as f:
txt = f.read()
MyCorpus.append(txt)

Related

How to process the data from a table.txt file from a series of folders and save the output in the same folder using Matlab?

Could you please help me to read the data from a table.txt in a series of subfolders from a directory? In all the subfolders, the output to read has the same name, 'table.txt'. I want to process the data and save the output in the same folder.
I can process it using the following code.
a = readmatrix('table.txt');
a4 = a(:,4);
a4 = a4 - mean(a4);
N = 2^(nextpow2(length(a4)));
freq = (abs(fftshift(fft(a4,N))));
t=[0:1e-12:20e-9].';
ts=t(2)-t(1);
F = ((-N/2:N/2-1)/N)*(1/ts);
fmr=[(F(N/2+1:end)/1e9)' freq(N/2+1:end)];
writematrix(fmr, 'fmr.csv');
cd folder
But how to perform the same action on all the subfolders?
Could somebody please help me out?
You can use the "find files in subfolders" behaviour of dir. Something like this:
allTables = dir('**/table.txt');
for ii = 1:numel(allTables)
thisFolder = allTables(ii).folder;
inFile = fullfile(thisFolder, allTables(ii).name);
a = readmatrix(inFile);
% do stuff ...
fmr = ...
outFile = fullfile(thisFolder, 'fmr.csv');
writematrix(fmr, outFile);
end

How to efficiently rename many files or partially select the names of these files during import?

how to rename the files efficiently by the number in the name (see picture)? I did not succeed with Windows PowerToys and I dont wana click each file and rename to the number (e.g. 290)
or how to read the files in this order and define a name? If I try it with a script (see below) the following output occurs:
[![ValueError: invalid literal for int() with base 10: '211001_164357_P_Scripted_Powermeasurement_Wavelength_automatic_Powermeter1_0'][1]][1]
or how to select only the numbers (290 to 230 - see picture) within the name when reading?
Script:
#import libraries
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
data_location = r'C:\Users\...\Characterization_OPO\Data'
data_folder = Path(data_location)
data = {}
allist = list(data_folder.glob('*'))
for i, file in enumerate(allist):
file = str(file)
file_name = file.split('\\')[-1]
wavelength = int(file_name.split('.')[0])
tmp = pd.read_csv(file, skiprows=20, skipfooter=59, index_col="PixelNo")
data[f'{wavelength} nm'] = tmp;
#data.plot(x='Wavelength',y='CCD_1', label=f"{wavelength} nm")
Picture:
I removed all words with windows power rename and than took the last three digits:
for i, file in enumerate(allist):
file = str(file)
file_name = file.split('\\')[-1]
wavelength = int(file_name.split('.')[0])
tmp = pd.read_csv(file, skiprows=26, skipfooter=5)
data[f'{wavelength % 1000} nm'] = tmp;
#data.plot(x='Wavelength',y='CCD_1', label=f"{wavelength} nm")

How read data from csv file in locust performance test scripts?

I am trying to read the data from csv file that contails 1 row and 5 column using the following code
def __init__(self):
super(data, self).__init__()
global data
if (data == None):
with open('var.csv', 'r') as l:
reader = csv.reader(l)
data = list(reader)
def on_start(self):
if len(data) > 0:
self.my_value = data.pop()
My output is ('sample') and I want it to be sample
Change the last line from self.my_value = data.pop() to self.my_value = data.pop()[0].
But you could also use locust plugins csv reader: https://github.com/SvenskaSpel/locust-plugins/blob/master/examples/csvreader_ex.py

How to read kaggle zip file dataset in the databricks

I want to read the zip file dataset from the kaggle but I am unable to read that dataset:
import urllib
urllib.request.urlretrieve("https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants/downloads/zomato-bangalore-restaurants.zip", "/tmp/zomato-bangalore-restaurants.zip")
then I run shell scripting to extracting the file:
%sh
unzip /tmp/zomato-bangalore-restaurants.zip
tail -n +2 zomato-bangalore-restaurants.csv > temp.csv
rm zomato-bangalore-restaurants.csv
Then I got an error:
Archive: /tmp/zomato-bangalore-restaurants.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of /tmp/zomato-bangalore-restaurants.zip or
/tmp/zomato-bangalore-restaurants.zip.zip, and cannot find /tmp/zomato-bangalore-restaurants.zip.ZIP, period.
tail: cannot open 'zomato-bangalore-restaurants.csv' for reading: No such file or directory
rm: cannot remove 'zomato-bangalore-restaurants.csv': No such file or directory
Note: Attempt to download a file from Kaggle is blocked because you are not logged in yet.
Here is the script to download all the competition data sets.
from requests import get, post
from os import mkdir, remove
from os.path import exists
from shutil import rmtree
import zipfile
def purge_all_downloads(db_full_path):
# Removes all the downloaded datasets
if exists(db_full_path): rmtree(db_full_path)
def datasets_are_available_locally(db_full_path, datasets):
# Returns True only if all the competition datasets are available locally in Databricks CE
if not exists(db_full_path): return False
for df in datasets:
# Assumes all the datasets end with '.csv' extention
if not exists(db_full_path + df + '.csv'): return False
return True
def remove_zip_files(db_full_path, datasets):
for df in datasets:
remove(db_full_path + df + '.csv.zip')
def unzip(db_full_path, datasets):
for df in datasets:
with zipfile.ZipFile(db_full_path + df + '.csv.zip', 'r') as zf:
zf.extractall(db_full_path)
remove_zip_files(db_full_path, datasets)
def download_datasets(competition, db_full_path, datasets, username, password):
# Downloads the competition datasets if not availible locally
if datasets_are_available_locally(db_full_path, datasets):
print 'All the competition datasets have been downloaded, extraced and are ready for you !'
return
purge_all_downloads(db_full_path)
mkdir(db_full_path)
kaggle_info = {'UserName': username, 'Password': password}
for df in datasets:
url = (
'https://www.kaggle.com/account/login?ReturnUrl=' +
'/c/' + competition + '/download/'+ df + '.csv.zip'
)
request = post(url, data=kaggle_info, stream=True)
# write data to local file
with open(db_full_path + df + '.csv.zip', "w") as f:
for chunk in request.iter_content(chunk_size = 512 * 1024):
if chunk: f.write(chunk)
# extract competition data
unzip(db_full_path, datasets)
print('done !')
For more details, refer "Download the competition data sets directly".
Hope this helps.

How can I find the same elements in a column in PySpark?

I have a txt file .there is a dataset with 4 columns . First column's mean is telephone numbers . I have to find same telephone numbers. My txt file like this
...
0544147,23,86,40.761650,29.940929
0544147,23,104,40.768749,29.968599
0538333,21,184,40.764679,29.929543
05477900,21,204,40.773071,29.975010
0561554,23,47,40.764694,29.927397
0556645,24,6,40.821587,29.920273
...
and my code is
from pyspark import SparkContext
sc = SparkContext()
rdd_data = sc.textFile("dataset.txt")
data1 = []
lines = rdd_data.collect()
lines = [x.strip() for x in lines]
for line in lines:
data1.append([float(x.strip()) for x in line.split(',')])
column0 = [row[0] for row in data1] #first column is founded as a list
So I don't know how can I get same telephone numbers in first column. I'm too new about pyspark and python. Thanks in advance .
from pyspark import SparkContext
sc = SparkContext()
rdd_data = sc.textFile("dataset.txt")
rdd_telephone_numbers = rdd_data.map(lambda line:line.split(",")).map(lambda line: int(line[0]))
print (rdd_telephone_numbers.collect()) # [544147, 544147, 538333, 5477900, 561554, 556645]
If you want an explanation of the step by step data transformations:
from pyspark import SparkContext
sc = SparkContext()
rdd_data = sc.textFile("dataset.txt")
rdd_data_1 = rdd_data.map(lambda line: line.split(","))
# this will transform every row of your dataset
# you had these data in your dataset:
# 0544147,23,86,40.761650,29.940929
# 0544147,23,104,40.768749,29.968599
# ...........
# now you have a single RDD like this:
# [[u'0544147', u'23', u'86', u'40.761650', u'29.940929'], [u'0544147', u'23', u'104', u'40.768749', u'29.968599'],....]
rdd_telephone_numbers = rdd_data_1.map(lambda line: int(line[0]))
# this will take only the first element of every line of the rdd, so now you have:
# [544147, 544147, 538333, 5477900, 561554, 556645]