I have a List of entries in the csv file records. I want to limit the length of the elements to 50 characters and save it into the list. My approach does not work.
def readfile():
records = []
with open(fpath, 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
for row in csvreader:
if len(row) == 0:
continue
records.append([row[1]] + [x.strip() for x in row[3]])
return records
def cut_words(records):
for lines in records:
for word in lines:
word = word[0:50]
return records
it does not seem to be saved in the list.... thanks
Related
I have multiple .csv files in a folder on Azure. Using PySpark I am trying to create a dataframe that has two columns, filename and firstrow, which are captured for each file within the folder.
Ideally I would like to avoid having to read the files in full as some of them can be quite large.
I am new to PySpark so I do not yet understand the basics so I would appreciate any help.
I have write a code for your scenario and it is working fine.
Create a empty list and append it with all the filenames stored in the source
# read filenames
filenames = []
l = dbutils.fs.ls("/FileStore/tables/")
for i in l:
print(i.name)
filenames.append(i.name)
# converting filenames to tuple
d = [(x,) for x in filenames]
print(d)
Read the data from multiple files and store in a list
# create data by reading from multiple files
data = []
i = 0
for n in filename:
temp = spark.read.option("header", "true").csv("/FileStore/tables/" + n).limit(1)
temp = temp.collect()[0]
temp = str(temp)
s = d[i] + (temp,)
data.append(s)
i+=1
print(data)
Now create DataFrame from the data with column names.
column = ["filename", "filedata"]
df = spark.createDataFrame(data, column)
df.head(2)
In Azure Databricks using PySpark, I'm reading file names from a directory. I am able to print the rows I need:
df_ls = dbutils.fs.ls('/mypath/')
for row in df_ls:
filename = row.name.lower()
if 'mytext' in filename:
print(filename)
Outputs, for example:
mycompany_mytext_2020-12-22_11-34-46.txt
mycompany_mytext_2021-02-01_10-40-57.txt
I want to put those rows into a dataframe but have not been able to make it work. Some of my failed attempts include:
df_ls = dbutils.fs.ls('/mypath/')
for row in df_ls:
filename = row.name.lower()
if 'mytext' in filename:
print(filename)
# file_list = row[filename].collect() #tuple indices must be integers or slices, not str
# file_list = filename # last row
# file_list = filename.collect() # error
# file_list = spark.sparkContext.parallelize(list(filename)).collect() # breaks last row into list of each character
# col = 'fname' # this and below generates ParseException
# df = spark.createDataFrame(data = file_list, schema = col)
The question is, how do I collect the row output into a single dataframe column with a row per value?
you can collect the filenames into a list and spark expects a nested list.
the program will be as follows :
df_ls = dbutils.fs.ls('/mypath/')
file_names =[]
for row in df_ls:
if 'mytext' in row.name.lower():
file_names.append([row.name])
df = spark.createDataFrame(file_names,['filename'])
display(df)
filename= input('please give name of file', )
lines= open(filename).readlines()
for i,line in enumerate(lines,start=1,):
print(str(i),str(line))
i've numbered the lines of the text document
How do i create another index which shows each word and on which line it appears?
it should look like this:
numbered lines in text document below:
1)test
2)this
3)this
4)this
5)dog
6)dog
7)cat
8)cat
9)hamster
10)hamster
# i'm struggling to make this output:
index:
this [2,3,4]
test [1]
dog [5,6]
cat [7,8]
hamster [9,10]
This is a job for a dictionary, where you can store the line numbers for each word easily:
index = dict()
for i, line in enumerate(lines, 1):
if not line in index: index[line] = []
index[line].append(i)
for word in index: print(word.strip(), index[word], sep='\t')
the numbers (which represent line numbers) should be contained in the text document.
In this case we have to separate the number from the word:
index = dict()
for line in lines:
i, word = line.strip().split(')')
if not word in index: index[word] = []
index[word].append(int(i))
for word in index: print(word, index[word], sep='\t')
I have this csv file called data.csv:
Name, Animal and Total are the header of the file
Name Animal Total
Ann Fish 6
Bob Cat 4
Jim Dog, Cat 5
I want to drop the row if any cells contain a comma, so that the result is this:
Name Animal Total
Ann Fish 6
Bob Cat 4
Here's what I tried to do in scala:
val data = sc.textFile("file:/home/user/data.csv")
val new_data = data.filter(x => x.contains(","))
Unfortunately, this code did not produce the results I wanted. What can I do? Any help is greatly appreciated.
Initially, if your input data is csv, you are going to have always true for this condition: x.contains(",").
So, using .textFile, every item on data will be a line from your file. Assuming that your source file will split every item in your list with a comma (,), you can do something like:
val new_data = data.filter(x => x.split(",").count() > 3) // where: 3 is the ideal scenario.
this is a good way to do this:
def readCsv(filePath: String): List[List[String]] = {
val bufferedReader = Source.fromFile(filePath)
bufferedReader.getLines
.drop(1)
.map { line =>
val row = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1).toList
row
}
.toList
}
I'm doing HBase range scans, and I need to return all rows that begin with a particular letter. Given a single letter as input, how do I create the correct startRow and endRow keys?
val letter = "s"
val startRow = letter
val endRow = (letter.charAt(0) + 1).toChar.toString