pyspark doesn't support delimiter greater than 127

pyspark doesn't support delimiter greater than 127 - pyspark

I am using pyspark on AWS EMR where I am using Spark 2.4.3 to read CSV file with a separator passed as a command line argument.
The code is as follows.
loadDF = spark.read.csv("s3://TEST/sample.csv", header='false', inferSchema='false', sep=chr(self.delimiter))
If self.delimiter is set to any value less than 127 then no problem in reading CSV file. But I want it to work for a delimiter value 198, 199 or 200.
Is this a limit in pyspark?

I found the solution. The limitation is in Spark 2.4.2. The latest Spark 3.0.1 which comes with the latest AWS EMR supports any delimiter.

Related

How to save dataframe as shp/geojson in PySpark/Databricks?

I have a DataFrame that has WKT in one of the columns. That column can be transformed to geojson if needed.
Is there a way to save (output to storage) this data as a geojson or shapefile in Databricks/PySpark?
Example of a DataFrame:
Id
Color
Wkt
1
Green
POINT (3 7)
2
Yellow
POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))
The DataFrame can have ~100K rows and more.
I've tried using Geopandas library, but it doesn't work:
import geopandas as gpd
# df is as PySpark dataframe
# Covert it to Pandas dataframe
pd_df = df.toPandas()
pd_df['geometry']=pd['point_wkt'].apply(wkt.loads)
# Convert it to GeoPandas dataframe
gdf = gpd.GeoDataFrame(pd, geometry='geometry')
# The following fails:
gdf.to_file(os.path.join(MOUNT_POINT,output_folder,"shapefile.shp"))
The error is:
Failed to create file /mnt/traces/output_folder/shapefile.shp: No such file or directory
The error makes no sense as the folder /mnt/traces/output_folder/ does exist, and I've successfully saved the PySpark dataframe as CSV to it.
df.write.csv(os.path.join(MOUNT_POINT,output_folder), sep='\t')
I'm able to save GeoPandas dataframe to shapefile with the above code when running locally, but not on Spark (Databricks).

If you are using Databricks then
dbutils.library.installPyPI("geopandas")
dbutils.library.installPyPI("shapely")
dbutils.library.installPyPI("geojsonio")
If you are using pyspark then it will be similar to Python Environment
pip3 install shapely
pip3 install geopandas
pip3 install geojsonio
Before writing to the path please check whether the path is mounted in databricks
display(dbutils.fs.ls('/mnt/traces'))

If you use dbutils.fs.ls("/mnt/traces/output_folder/") you'll see this path:
dbfs:/mnt/ traces/output_folder/shapefile.shp which takes us to our solution:
Solution: when writing we CAN use /dbfs/mnt/ for path instead of /mnt/lab/:
gdf.to_file("/dbfs/mnt/traces/output_folder/shapefile.shp")
Good luck!

How to save the dataframe in scala as a csv ? csv function not working

I am getting below error while trying to save the dataframe as a csv file in cloudera
scala> df.write.csv("/home/cloudera/Desktop/thakur2")
<console>:31: error: value csv is not a member of
org.apache.spark.sql.DataFrameWriter
df.write.csv("/home/cloudera/Desktop/thakur2")

csv() method is not available on DataFrameWriter class for spark version <2.0.
If you are using spark version<2.0 on cloudera then you should use spark-csv library.

pyspark : AnalysisException when reading csv file

I am new to pyspark . I am migrating my project to pyspark . I am trying to read csv file from S3 and create df out of it. file name is assigned to variable cfg_file and I am using key variable for reading from S3. I am able to do same using pandas but get AnalysisException when I read using spark . I am using boto lib for S3 connection
df = spark.read.csv(StringIO.StringIO(Key(bucket,cfg_file).get_contents_as_string()), sep=',')
AnalysisException: u'Path does not exist: file:

How to read .pkl file in pyspark

I have a dictionary saved in .pkl format using the following code in python 3.X
import pickle as cpick
OutputDirectory="My data file path"
with open("".join([OutputDirectory, 'data.pkl']), mode='wb') as fp:
cpick.dump(data_dict, fp, protocol=cpick.HIGHEST_PROTOCOL)
I want to read this file in pyspark. Can you suggest me how to do that? Currently I'm using spark 2.0 & python 2.7.13

reading compressed file in spark with scala

I am trying to read the content of .gz file in spark/scala in a dataframe/rdd using the following code
val conf = new SparkConf()
val sc = new SparkContext(conf)
val data = sc.wholeTextFiles("path to gz file")
data.collect().foreach(println);
.gz file is 28 mb and when i do the spark submit using this command
spark-submit --class sample--master local[*] target\spark.jar
It gives ma Java Heap space issue in the console .
Is this the best way of reading .gz file and if yes how could i solve java heap error issue .
Thanks

Disclaimer: That code and description will purely read in a small compressed text file using spark, collect it to an array of every line and print every line in the entire file to console. The number of ways and reasons to do this outside far outnumber those to do it in spark
1) use SparkSession instead of SparkContext if you can swing it. sparkSession.read.text() is the command to use (it automatically handles a few compression formats)
2) Or at least use sc.textFile() instead of wholeTextFiles
3) you're calling .collect on that data which brings the entire file back to the driver (in this case since you're local not network bound). Add the --driver-memory option to the spark shell to increase memory if you MUST do the collect.