Escape hyphen when reading multiple dataframes at once in pyspark - pyspark

I can't seem to find any documentation on this in pyspark's docs.
I am trying to read multiple parquets at once, like this:
df = sqlContext.read.option("basePath", "/some/path")\
.load("/some/path/date=[2018-01-01, 2018-01-02]")
And receive the following exception
java.util.regex.PatternSyntaxException: Illegal character range near index 11
E date=[2018-01-01,2018-01-02]
I tried replacing the hyphen with \-, but then I just receive a file not found exception.
I would appreciate help on that

Escaping the - is not your issue. You can not specify multiple dates in a list like that. If you'd like to read multiple files, you have a few options.
Option 1: Use * as a wildcard:
df = sqlContext.read.option("basePath", "/some/path").load("/some/path/date=2018-01-0*")
But this will also read any files named /some/path/data=2018-01-03 through /some/path/data=2018-01-09.
Option 2: Read each file individually and union the results
dates = ["2018-01-01", "2018-01-02"]
df = reduce(
lambda a, b: a.union(b),
[
sqlContext.read.option("basePath", "/some/path").load(
"/some/path/date={d}".format(d=d)
) for d in dates
]
)

Related

How to strip extra spaces when writing from dataframe to csv

Read in multiple sheets (6) from an xlsx file and created individual dataframes. Want to write each one out to a pipe delimited csv.
ind_dim.to_csv (r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
Currently outputs like this:
1|value1 |value2 |word1 word2 word3 etc.
Want to strip trailing blanks
Suggestion
Include the method .apply(lambda x: x.str.rstrip()) to your output string (prior to the .to_csv() call) to strip the right trailing blank from each field across the DataFrame. It would look like:
Change:
ind_dim.to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
To:
ind_dim.apply(lambda x: x.str.rstrip()).to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
It can be easily inserted to the output code string using '.' referencing. To handle multiple data types, we can enforce the 'object' dtype on import by including the argument dtype='str':
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
Or on the DataFrame itself by:
df = pd.DataFrame(df, dtype='str')
Proof
I did a mock-up where the .xlsx document has 5 sheets, with each sheet having three columns: The first column with all numbers except an empty cell in row 2; the second column with both a leading blank and a trailing blank on strings, an empty cell in row 3, and a number in row 4; and the third column * with all strings having a leading blank, and an empty value in row 4*. Integer indexes and integer columns have been included. The text in each sheet is:
0 1 2
0 11111 valueB1 valueC1
1 valueB2 valueC2
2 33333 valueC3
3 44444 44444
4 55555 valueB5 valueC5
This code reads in our .xlsx testing_xlsx_dtype.xlsx to the DataFrame dictionary ind_dim.
Next, it loops through each sheet using a for loop to place the sheet name variable as a key to reference the individual sheet DataFrame. It applies the .str.rstrip() method to the entire sheet/DataFrame by passing the lambda x: x.str.rstrip() lambda function to the .apply() method called on the sheet/DataFrame.
Finally, it outputs the sheet/DataFrame as a .csv with the pipe delimiter using .to_csv() as seen in the OP post.
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for sheet in ind_dim:
ind_dim[sheet].apply(lambda x: x.str.rstrip()).to_csv(sheet + '_ind_dim_out.csv', sep='|')
Returns:
|0|1|2
0|11111| valueB1| valueC1
1|| valueB2| valueC2
2|33333|| valueC3
3|44444|44444|
4|55555| valueB5| valueC5
(Note our column 2 strings no longer have the trailing space).
We can also reference each sheet using a loop that cycles through the dictionary items; the syntax would look like for k, v in dict.items() where k and v are the key and value:
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for k, v in ind_dim.items():
v.apply(lambda x: x.str.rstrip()).to_csv(k + '_ind_dim_out.csv', sep='|')
Notes:
We'll still need to apply the correct arguments for selecting/ignoring indexes and columns with the header= and names= parameters as needed. For these examples I just passed =None for simplicity.
The other methods that strip leading and leading & trailing spaces are: .str.lstrip() and .str.strip() respectively. They can also be applied to an entire DataFrame using the .apply(lambda x: x.str.strip()) lambda function passed to the .apply() method called on the DataFrame.
Only 1 Column: If we only wanted to strip from one column, we can call the .str methods directly on the column itself. For example, to strip leading & trailing spaces from a column named column2 in DataFrame df we would write: df.column2.str.strip().
Data types not string: When importing our data, pandas will assume data types for columns with a similar data type. We can override this by passing dtype='str' to the pd.read_excel() call when importing.
pandas 1.0.1 documentation (04/30/2020) on pandas.read_excel:
"dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion."
We can pass the argument dtype='str' when importing with pd.read_excel.() (as seen above). If we want to enforce a single data type on a DataFrame we are working with, we can set it equal to itself and pass it to pd.DataFrame() with the argument dtype='str like: df = pd.DataFrame(df, dtype='str')
Hope it helps!
The following trims left and right spaces fairly easily:
if (!require(dplyr)) {
install.packages("dplyr")
}
library(dplyr)
if (!require(stringr)) {
install.packages("stringr")
}
library(stringr)
setwd("~/wherever/you/need/to/get/data")
outputWithSpaces <- read.csv("CSVSpace.csv", header = FALSE)
print(head(outputWithSpaces), quote=TRUE)
#str_trim(string, side = c("both", "left", "right"))
outputWithoutSpaces <- outputWithSpaces %>% mutate_all(str_trim)
print(head(outputWithoutSpaces), quote=TRUE)
Starting Data:
V1 V2 V3 V4
1 "Something is interesting. " "This is also Interesting. " "Not " "Intereting "
2 " Something with leading space" " Leading" " Spaces with many words." " More."
3 " Leading and training Space. " " More " " Leading and trailing. " " Spaces. "
Resulting:
V1 V2 V3 V4
1 "Something is interesting." "This is also Interesting." "Not" "Intereting"
2 "Something with leading space" "Leading" "Spaces with many words." "More."
3 "Leading and training Space." "More" "Leading and trailing." "Spaces."

Spark - read text file, string off first X and last Y rows using monotonically_increasing_id

I have to read in files from vendors, that can get potentially pretty big (multiple GB). These files may have multiple header and footer rows I want to strip off.
Reading the file in is easy:
val rawData = spark.read
.format("csv")
.option("delimiter","|")
.option("mode","PERMISSIVE")
.schema(schema)
.load("/path/to/file.csv")
I can add a simple row number using monotonically_increasing_id:
val withRN = rawData.withColumn("aIndex",monotonically_increasing_id())
That seems to work fine.
I can easily use that to strip off header rows:
val noHeader = withRN.filter($"aIndex".geq(2))
but how can I strip off footer rows?
I was thinking about getting the max of the index column, and using that as a filter, but I can't make that work.
val MaxRN = withRN.agg(max($"aIndex")).first.toString
val noFooter = noHeader.filter($"aIndex".leq(MaxRN))
That returns no rows, because MaxRN is a string.
If I try to convert it to a long, that fails:
noHeader.filter($"aIndex".leq(MaxRN.toLong))
java.lang.NumberFormatException: For input string: "[100000]"
How can I use that max value in a filter?
Is trying to use monotonically_increasing_id like this even a viable approach? Is it really deterministic?
This happens because first will return a Row. To access the first element of the row you must do:
val MaxRN = withRN.agg(max($"aIndex")).first.getLong(0)
By converting the row to string you will get [100000] and of course this is not a valid Long that's why the casting is failing.

PySpark - ValueError: Cannot convert column into bool

So I've seen this solution:
ValueError: Cannot convert column into bool
which has the solution I think. But I'm trying to make it work with my dataframe and can't figure out how to implement it.
My original code:
if df2['DayOfWeek']>=6 :
df2['WeekendOrHol'] = 1
this gives me the error:
Cannot convert column into bool: please use '&' for 'and', '|' for
'or', '~' for 'not' when building DataFrame boolean expressions.
So based on the above link I tried:
from pyspark.sql.functions import when
when((df2['DayOfWeek']>=6),df2['WeekendOrHol'] = 1)
when(df2['DayOfWeek']>=6,df2['WeekendOrHol'] = 1)
but this is incorrect as it gives me an error too.
To update a column based on a condition you need to use when like this:
from pyspark.sql import functions as F
# update `WeekendOrHol` column, when `DayOfWeek` >= 6,
# then set `WeekendOrHol` to 1 otherwise, set the value of `WeekendOrHol` to what it is now - or you could do something else.
# If no otherwise is provided then the column values will be set to None
df2 = df2.withColumn('WeekendOrHol',
F.when(
F.col('DayOfWeek') >= 6, F.lit(1)
).otherwise(F.col('WeekendOrHol')
)
Hope this helps, good luck!
Best answer as provided by pault:
df2=df2.withColumn("WeekendOrHol", (df2["DayOfWeek"]>=6).cast("int"))
This is a duplicate of:
this

PySpark list() in withColumn() only works once, then AssertionError: col should be Column

I have a DataFrame with 6 string columns named like 'Spclty1'...'Spclty6' and another 6 named like 'StartDt1'...'StartDt6'. I want to zip them and collapse into a columns that looks like this:
[[Spclty1, StartDt1]...[Spclty6, StartDt6]]
I first tried collapsing just the 'Spclty' columns into a list like this:
DF = DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6')))
This worked the first time I executed it, giving me a new column called 'Spclty' containing rows such as ['014', '124', '547', '000', '000', '000'], as expected.
Then, I added a line to my script to do the same thing on a different set of 6 string columns, named 'StartDt1'...'StartDt6':
DF = DF.withColumn('StartDt', list(DF.select('StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'))))
This caused AssertionError: col should be Column.
After I ran out of things to try, I tried the original operation again (as a sanity check):
DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6'))).collect()
and got the assertion error as above.
So, it would be good to understand why it only worked the first time (only), but the main question is: what is the correct way to zip columns into a collection of dict-like elements in Spark?
.withColumn() expects a column object as second parameter and you are supplying a list.
Thanks. After reading a number of SO posts I figured out the syntax for passing a set of columns to the col parameter, using struct to create an output column that holds a list of values:
DF_tmp = DF_tmp.withColumn('specialties', array([
struct(
*(col("Spclty{}".format(i)).alias("spclty_code"),
col("StartDt{}".format(i)).alias("start_date"))
)
for i in range(1, 7)
]
))
So, the col() and *col() constructs are what I was looking for, while the array([struct(...)]) approach lets me combine the 'Spclty' and 'StartDt' entries into a list of dict-like elements.

I am unable parse date info from a csv file into ipython

I am running python 3.5, I have imported pandas. My csv file (payinfo.csv) looks like:
"01 DEC",1234.45,2344,11,1212.66
"01 NOV", 9898.33, 2343,12,1009.33
When I run the following:
dateparse = lambda x: pd.datetime.strptime(x,"%d %b")
pay_data = pd.read_csv('payinfo.csv', parse_dates = ['Date'], date_parse
I always get
"ValueError: time data '“01 DEC”' does not match format '%d %b'
I am a new programmer to python, and would appreciate any help.
I think it was just the double quotes around string that caused that error. Try stripping away any hardcoded (not 'python generated') single or double quote marks with .strip('"')
Example:
a = '"01 DEC"'
# Gives error
#a = pd.datetime.strptime(a,"%d %b")
# string without unneccessary quote marks
a = pd.datetime.strptime(a.strip('"'),"%d %b")
print a
Output:
1900-12-01 00:00:00
You haven't included the headers in the question. But this works:
import io
import pandas as pd
a = io.StringIO(u""""01 DEC",1234.45,2344,11,1212.66
"01 NOV", 9898.33, 2343,12,1009.33""")
dateparse = lambda x: pd.datetime.strptime(x,"%d %b")
df = pd.read_csv(a,header=None, parse_dates=[0], date_parser=dateparse)
print df
You can append custom year to x before converting it to datetime
.strptime(year + x,"%Y%d %b")
Output:
0 1 2 3 4
0 1900-12-01 1234.45 2344 11 1212.66
1 1900-11-01 9898.33 2343 12 1009.33
Thank you both for your input. From your answers I modified the csv file to remove the quotes around the date entry, then things worked fine! I am puzzled because I have used the read_csv method before on similar data that looked like this:
"12/31/2016","The UPS Store","THE UPS STORE 031","10.74","debit","Business Services","Interest Checking","",""
"12/31/2016","Hospice of The East Bay","HOSPICE OF THE EAST","14.00","debit","Clara","Interest Checking","",""
and had no problems – in fact I didn't need to parse the data at all and the reader was able to correctly identify the date. Huh! I guess the real issue was that the date was stored in an unconventional format. In any case, I have the answer and thank you both for your answers.