Reading flat file with multi-line string without quote with PySpark - pyspark

I have a flat file delimited by | (pipe), without quote character. Sample data looks as following:
SOME_NUMBER|SOME_MULTILINE_STRING|SOME_STRING
23|multiline
text1|text1
24|multi
mulitline
text2|text2
25|text3|text4
What I'm trying to do is to load it into a dataframe to look something like this:
SOME_NUMBER
SOME_MULTILINE_STRING
SOME_STRING
23
multilinetext1
text1
24
multimulitlinetext2
text2
25
text3
text4
I tried to specify multiLine option with no luck. Regardless of it being set to True or False, the output doesn't change. I suppose what I'm trying to achieve there is to specify that I'm expecting multi-line data, and every record has the same number of columns specified in the schema.
df_file = spark.read.csv(filePath, \
sep="|", \
header=True, \
enforceSchema=True, \
schema=df_table.schema, \ # I need to explicitly specify the schema
quote='', \
multiLine=True)

To fix that type of psv (pipe-separated values) file without quotes but with multiline values (newline in cell values) you need a stateful algorithm that loops through the rows and decides when to insert the quotes. This means the operation is not easily parallelizable. So you might as well do it using python on rdd rows:
def from_psv_without_quotes(path, sep='|', quote='"'):
rddFromFile = sc.textFile(path)
rdd = rddFromFile.zipWithIndex().keys()
headers = rdd.first()
cols = headers.split(sep)
schema = ", ".join([f"{col} STRING" for col in cols])
n_pipes = headers.count('|')
rows = rdd.collect()
processed_rows = []
n_pipes_in_current_row = 0
complete_row = ""
for row in rows:
if n_pipes_in_current_row < n_pipes:
complete_row += row if n_pipes_in_current_row == 0 else "\n"+row
n_pipes_in_current_row += row.count('|')
if n_pipes_in_current_row == n_pipes:
complete_row = quote + complete_row.replace('|', f'{quote}|{quote}') + quote
processed_rows.append(complete_row)
n_pipes_in_current_row = 0
complete_row = ""
processed_rdd = sc.parallelize(processed_rows)
print(processed_rdd.collect())
df = spark.read.csv(
processed_rdd,
sep=sep,
quote=quote,
ignoreTrailingWhiteSpace=True,
ignoreLeadingWhiteSpace=True,
header=True,
mode='PERMISSIVE',
schema=schema,
enforceSchema=False,
)
return df
df = from_psv_without_quotes('/path/to/unqoted_multiline.psv')
df.show()
I am assuming you are constrained to reading from Hadoop, so the example solution is naive first attempt. It is really inefficient because of the rdd.collect() etc. I am sure you could do this much more efficiently if you avoided the whole spark infrastructure and preprocessed the unquoted multiline file with some gnu tools like sedand awk.

Related

PySpark - Return first row from each file in a folder

I have multiple .csv files in a folder on Azure. Using PySpark I am trying to create a dataframe that has two columns, filename and firstrow, which are captured for each file within the folder.
Ideally I would like to avoid having to read the files in full as some of them can be quite large.
I am new to PySpark so I do not yet understand the basics so I would appreciate any help.
I have write a code for your scenario and it is working fine.
Create a empty list and append it with all the filenames stored in the source
# read filenames
filenames = []
l = dbutils.fs.ls("/FileStore/tables/")
for i in l:
print(i.name)
filenames.append(i.name)
# converting filenames to tuple
d = [(x,) for x in filenames]
print(d)
Read the data from multiple files and store in a list
# create data by reading from multiple files
data = []
i = 0
for n in filename:
temp = spark.read.option("header", "true").csv("/FileStore/tables/" + n).limit(1)
temp = temp.collect()[0]
temp = str(temp)
s = d[i] + (temp,)
data.append(s)
i+=1
print(data)
Now create DataFrame from the data with column names.
column = ["filename", "filedata"]
df = spark.createDataFrame(data, column)
df.head(2)

Databricks PySpark: Make Dataframe from Rows of Strings

In Azure Databricks using PySpark, I'm reading file names from a directory. I am able to print the rows I need:
df_ls = dbutils.fs.ls('/mypath/')
for row in df_ls:
filename = row.name.lower()
if 'mytext' in filename:
print(filename)
Outputs, for example:
mycompany_mytext_2020-12-22_11-34-46.txt
mycompany_mytext_2021-02-01_10-40-57.txt
I want to put those rows into a dataframe but have not been able to make it work. Some of my failed attempts include:
df_ls = dbutils.fs.ls('/mypath/')
for row in df_ls:
filename = row.name.lower()
if 'mytext' in filename:
print(filename)
# file_list = row[filename].collect() #tuple indices must be integers or slices, not str
# file_list = filename # last row
# file_list = filename.collect() # error
# file_list = spark.sparkContext.parallelize(list(filename)).collect() # breaks last row into list of each character
# col = 'fname' # this and below generates ParseException
# df = spark.createDataFrame(data = file_list, schema = col)
The question is, how do I collect the row output into a single dataframe column with a row per value?
you can collect the filenames into a list and spark expects a nested list.
the program will be as follows :
df_ls = dbutils.fs.ls('/mypath/')
file_names =[]
for row in df_ls:
if 'mytext' in row.name.lower():
file_names.append([row.name])
df = spark.createDataFrame(file_names,['filename'])
display(df)

How to strip extra spaces when writing from dataframe to csv

Read in multiple sheets (6) from an xlsx file and created individual dataframes. Want to write each one out to a pipe delimited csv.
ind_dim.to_csv (r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
Currently outputs like this:
1|value1 |value2 |word1 word2 word3 etc.
Want to strip trailing blanks
Suggestion
Include the method .apply(lambda x: x.str.rstrip()) to your output string (prior to the .to_csv() call) to strip the right trailing blank from each field across the DataFrame. It would look like:
Change:
ind_dim.to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
To:
ind_dim.apply(lambda x: x.str.rstrip()).to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
It can be easily inserted to the output code string using '.' referencing. To handle multiple data types, we can enforce the 'object' dtype on import by including the argument dtype='str':
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
Or on the DataFrame itself by:
df = pd.DataFrame(df, dtype='str')
Proof
I did a mock-up where the .xlsx document has 5 sheets, with each sheet having three columns: The first column with all numbers except an empty cell in row 2; the second column with both a leading blank and a trailing blank on strings, an empty cell in row 3, and a number in row 4; and the third column * with all strings having a leading blank, and an empty value in row 4*. Integer indexes and integer columns have been included. The text in each sheet is:
0 1 2
0 11111 valueB1 valueC1
1 valueB2 valueC2
2 33333 valueC3
3 44444 44444
4 55555 valueB5 valueC5
This code reads in our .xlsx testing_xlsx_dtype.xlsx to the DataFrame dictionary ind_dim.
Next, it loops through each sheet using a for loop to place the sheet name variable as a key to reference the individual sheet DataFrame. It applies the .str.rstrip() method to the entire sheet/DataFrame by passing the lambda x: x.str.rstrip() lambda function to the .apply() method called on the sheet/DataFrame.
Finally, it outputs the sheet/DataFrame as a .csv with the pipe delimiter using .to_csv() as seen in the OP post.
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for sheet in ind_dim:
ind_dim[sheet].apply(lambda x: x.str.rstrip()).to_csv(sheet + '_ind_dim_out.csv', sep='|')
Returns:
|0|1|2
0|11111| valueB1| valueC1
1|| valueB2| valueC2
2|33333|| valueC3
3|44444|44444|
4|55555| valueB5| valueC5
(Note our column 2 strings no longer have the trailing space).
We can also reference each sheet using a loop that cycles through the dictionary items; the syntax would look like for k, v in dict.items() where k and v are the key and value:
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for k, v in ind_dim.items():
v.apply(lambda x: x.str.rstrip()).to_csv(k + '_ind_dim_out.csv', sep='|')
Notes:
We'll still need to apply the correct arguments for selecting/ignoring indexes and columns with the header= and names= parameters as needed. For these examples I just passed =None for simplicity.
The other methods that strip leading and leading & trailing spaces are: .str.lstrip() and .str.strip() respectively. They can also be applied to an entire DataFrame using the .apply(lambda x: x.str.strip()) lambda function passed to the .apply() method called on the DataFrame.
Only 1 Column: If we only wanted to strip from one column, we can call the .str methods directly on the column itself. For example, to strip leading & trailing spaces from a column named column2 in DataFrame df we would write: df.column2.str.strip().
Data types not string: When importing our data, pandas will assume data types for columns with a similar data type. We can override this by passing dtype='str' to the pd.read_excel() call when importing.
pandas 1.0.1 documentation (04/30/2020) on pandas.read_excel:
"dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion."
We can pass the argument dtype='str' when importing with pd.read_excel.() (as seen above). If we want to enforce a single data type on a DataFrame we are working with, we can set it equal to itself and pass it to pd.DataFrame() with the argument dtype='str like: df = pd.DataFrame(df, dtype='str')
Hope it helps!
The following trims left and right spaces fairly easily:
if (!require(dplyr)) {
install.packages("dplyr")
}
library(dplyr)
if (!require(stringr)) {
install.packages("stringr")
}
library(stringr)
setwd("~/wherever/you/need/to/get/data")
outputWithSpaces <- read.csv("CSVSpace.csv", header = FALSE)
print(head(outputWithSpaces), quote=TRUE)
#str_trim(string, side = c("both", "left", "right"))
outputWithoutSpaces <- outputWithSpaces %>% mutate_all(str_trim)
print(head(outputWithoutSpaces), quote=TRUE)
Starting Data:
V1 V2 V3 V4
1 "Something is interesting. " "This is also Interesting. " "Not " "Intereting "
2 " Something with leading space" " Leading" " Spaces with many words." " More."
3 " Leading and training Space. " " More " " Leading and trailing. " " Spaces. "
Resulting:
V1 V2 V3 V4
1 "Something is interesting." "This is also Interesting." "Not" "Intereting"
2 "Something with leading space" "Leading" "Spaces with many words." "More."
3 "Leading and training Space." "More" "Leading and trailing." "Spaces."

how to parse specific portion of text file using two delimiters or strings in SCALA

I have sample.txt file
The file contains logs with date and time.
For example,
10.10.2012:
erewwetrt=1
wrtertret=2
ertertert=3
;
10.10.2012:
asdafdfd=1
adadfadf=2
adfdafdf=3
;
10.12.2013:
adfsfsdfgg=1
sdfsdfdfg=2
sdfsdgsdg=3
;
12.12.2012:
asdasdas=1
adasfasdf=2
dfsdfsdf=3
;
I just want to retrive only year 2012 data, that is between12.12.2012: to ;
How can I do this in scla or spark scala.
finally I need to remove = with comma and save it in csv format.
How can I do it.
To extract that specific part you can use this:
def main(args:Array[String]):Unit={
val text = "10.10.2012:\nerewwetrt=1\nwrtertret=2\nertertert=3\n;\n10.10.2012:\nasdafdfd=1\nadadfadf=2\nadfdafdf=3\n;\n10.12.2013:\nadfsfsdfgg=1\nsdfsdfdfg=2\nsdfsdgsdg=3\n;\n12.12.2012:\nasdasdas=1\nadasfasdf=2\ndfsdfsdf=3\n;"
val lines = text.split("\n")
val extracted = lines.dropWhile(_ != "12.12.2012:").drop(1).takeWhile(_ != ";")
extracted.foreach(println(_))
}

how to split scala string with regular expression

I come up a pattern like
val pattern = "(\\w+)\\|(.*)\\|\\[(.*)\\]\\|\"(.*)\"\\|\"(.*)\"\\|\\[(.*)\\]\\|\\[(.*)\\]\\|(.*)\\|\\[(.*)\\]\\|\\[(.*)\\]".r
and I have a original string
var str = """AuthLogout|vmlxapp21a|[13/Jan/2016:16:33:15 +0100]|"66.77.444.44 uid=XXXXX,ou=People,o=Bank,o=External,dc=xxxx,dc=com"|"abcd_123_portalweb_w "|[]|[41]||[]|[]"""
then apply pattern to the string, but it is always empty.
val items = pattern.findAllIn(str).toList
If I understand what you're trying to do, perhaps using a giant regex isn't the easiest way: You can split by | and get rid of the unwanted separators ([, ], ") using replaceAll:
val str = """AuthLogout|vmlxapp21a|[13/Jan/2016:16:33:15 +0100]|"66.77.444.44 uid=XXXXX,ou=People,o=Bank,o=External,dc=xxxx,dc=com"|"abcd_123_portalweb_w "|[]|[41]||[]|[]"""
val withoutBoundaries = str.replaceAll("[\"\\]\\[]","")
val result = withoutBoundaries.split("\\|")
result.foreach(println)
Which prints:
AuthLogout
vmlxapp21a
13/Jan/2016:16:33:15 +0100
66.77.444.44 uid=XXXXX,ou=People,o=Bank,o=External,dc=xxxx,dc=com
abcd_123_portalweb_w
41
If you do want to use a regex here, I'd create sub-regex vars representing the different text parts that you're after, to make this somewhat manageable:
val plain = "(.*)" // no boundary characters
val boxed = s"\\[$plain\\]" // same, encapsulated by square brackets
val quoted = '"' + plain + '"' // same, encapsulated by double quotes
// the whole thing, separated by pipes:
val r = s"$plain\\|$plain\\|$boxed\\|$quoted\\|$quoted\\|$boxed\\|$boxed\\|$plain\\|$boxed\\|$boxed".r
val result = r.findAllIn(str).toList // this list has one item, as expected.
Now, if you want to see how this regex looks like, here it is - but I don't recommend having this in your code...:
val r = """(.*)\|(.*)\|\[(.*)\]\|"(.*)"\|"(.*)"\|\[(.*)\]\|\[(.*)\]\|(.*)\|\[(.*)\]\|\[(.*)\]""".r