Pyspark: SaveTable in windows cannot handle windows path - pyspark

I am trying to save a CSV file using a windows path (with "" instead of "/"). I think it does not works, because of the windows path.
Is this the problem why the code does not works?
Is there a workaround for the problem?
The code:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
def init_spark(appname):
spark = SparkSession.builder.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark():
spark,sc = init_spark(appname="bucket_analysis")
p_configs_RDD = sc.parallelize([1,4,5])
p_configs_RDD=p_configs_RDD.map(mul)
schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
df=spark.createDataFrame(p_configs_RDD,schema)
df.write.saveAsTable(r"C:\Users\yuvalr\Desktop\example_csv",format="csv")
def mul(x):
return (x,x**2)
run_on_configs_spark()
The error code:
Traceback (most recent call last):
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 426, in <module>
analysis()
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 408, in analysis
run_CDH()
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 420, in run_CDH
max_prob_for_extension=None, max_base_size_B=4096,OP_arr=[0.2],
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 173, in settings_print
dic=get_map_of_worst_seq(params)
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 245, in get_map_of_worst_seq
run_over_settings_spark_test(info_obj)
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 239, in run_over_settings_spark_test
run_on_configs_spark(configs)
File "C:\Users\yuvalr\Desktop\Git_folder\algo_sim\Bucket_analysis\Set_multiple_configurations\spark_parallelized_configs.py", line 17, in run_on_configs_spark
df.write.saveAsTable(r"C:\Users\yuvalr\Desktop\example_csv",format="csv")
File "C:\Users\yuvalr\Desktop\spark\Spark\python\pyspark\sql\readwriter.py", line 868, in saveAsTable
self._jwrite.saveAsTable(name)
File "C:\Users\yuvalr\venv\lib\site-packages\py4j\java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\yuvalr\Desktop\spark\Spark\python\pyspark\sql\utils.py", line 137, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
mismatched input ':' expecting {<EOF>, '.', '-'}(line 1, pos 1)
== SQL ==
C:\Users\yuvalr\Desktop\example_csv
-^^^

As I see it the problem is with your output line:
Try this instead:
df.write.csv("file:///C:/Users/yuvalr/Desktop/example_csv.csv")
Yes, I know you're on Windows so you're expecting backslashes, but PySpark isn't
Windows is very sensitive to file extensions - without the .csv, you'll probably just make a folder called example_csv
You don't need a Regex r"" string for this
Using the file:/// doubly-confirms that this is a file we're talking about

As you can see saveAsTable() expects a tablename to be provided which can written in
directory spark.sql.warehouse.dir
saveAsTable(name, format=None, mode=None, partitionBy=None, **options)
Parameters
name – the table name
format – the format used to save
mode – one of append, overwrite, error, errorifexists, ignore (default: error)
partitionBy – names of partitioning columns
options – all other string options
Source: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter
Workaround: (mind for windows C:\\)
set spark.sql.warehouse.dir pointing to destination directory as below
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
def init_spark(appname):
spark = SparkSession.builder\
.config("spark.sql.warehouse.dir", "C:\\Users\yuvalr\Desktop")\
.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark():
spark,sc = init_spark(appname="bucket_analysis")
p_configs_RDD = sc.parallelize([1,4,5])
p_configs_RDD=p_configs_RDD.map(mul)
schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
df=spark.createDataFrame(p_configs_RDD,schema)
df.write.saveAsTable("example_csv",format="csv",mode="overwrite")
def mul(x):
return (x,x**2)
run_on_configs_spark()
Edit 1:
If it is an external table (external path where underlying file is stored), you can use below
#df.write.option("path","C:\\Users\yuvalr\Desktop").saveAsTable("example_csv",format="csv",mode="overwrite")
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
def init_spark(appname):
spark = SparkSession.builder\
.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark():
spark,sc = init_spark(appname="bucket_analysis")
p_configs_RDD = sc.parallelize([1,4,5])
p_configs_RDD=p_configs_RDD.map(mul)
schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
df=spark.createDataFrame(p_configs_RDD,schema)
df.write.option("path","C:\\Users\yuvalr\Desktop").saveAsTable("example_csv",format="csv",mode="overwrite")
def mul(x):
return (x,x**2)
run_on_configs_spark()

Related

udf using class method pyspark

My problem: How can i call a function inside another function in a class using pyspark udf.
I am trying to write a pyspark udf using a method from a class called Anomalie in the file devAM_hive.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import re
class Anomalie():
def __init__(self):
self.Anomalie_udf = F.udf(Anomalie.aux,ArrayType(StringType()))
def aux(texte):
code_utilisateur=re.findall(r'[\s]*\d{2}.\d{2}.\d{4}[\s]*\d{2}.\d{2}.\d{2}\s(\w?\.?\s?.*)\s\(', texte)
return code_utilisateur
def auto_test(self,df):
df=df.withColumn("name",self.Anomalie_udf(F.col("Description")))
return df
When i call this from the main file. I am getting an error named " No module named 'devAM_hive'".But my module in which I defined the class is imported.
from devAM_hive import *
A=Anomalie()
df=A.auto_test(row_data)
df.select("name").show(50)
The error message:
22/04/09 14:30:58 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/mapr/spark/spark-3.1.2/python/lib/pyspark.zip/pyspark/worker.py", line 588, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/mapr/spark/spark-3.1.2/python/lib/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/opt/mapr/spark/spark-3.1.2/python/lib/pyspark.zip/pyspark/worker.py", line 249, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/mapr/spark/spark-3.1.2/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/opt/mapr/spark/spark-3.1.2/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/opt/mapr/spark/spark-3.1.2/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'devAM_hive'
When i call this from the main file. I am getting an error named " No module named 'devAM_hive'". But my module in which I defined the class is imported.
Importing works because you were importing it from the driver where it's available (sitting next to your main file). But running won't work because your executors don't have it. So what you wanted to do is distributing that class using --py-files. By doing that, the class will be in executor's classpath.
spark = (SparkSession
.builder
.appName('Test App')
.config('spark.submit.pyFiles', '/path/to/devAM_hive.py')
.getOrCreate()
)

including external jar into pyspark using pycharm

i'm facing a problem trying to include com.databricks:spark-xml_2.10:0.4.1 to my pyspark code in pycharm
import pyspark
from pyspark.shell import sc
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import SparkSession
import os
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell"
)
if __name__ == '__main__':
df = sqlContext.read.format('org.apache.spark.sql.xml') \
.option('rowTag', 'lei:Extension')
.load('C:\\Users\\Consultant\\Desktop\\20170501-gleif-concatenated-file'
'-lei2.xml')
df.show()
but what it returns is
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/C:/spark-2.4.5-bin-hadoop2.7/python/dependency
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:221)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:907)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:907)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "C:/spark-2.4.5-bin-hadoop2.7/python/test.py", line 2, in <module>
from pyspark.shell import sc
File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\shell.py", line 38, in <module>
SparkContext._ensure_initialized()
File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
i'd like to add external jar directly in pycharm. Is this possible?
Thanks in advance.
You should set your environmet variable as the 1st step of your script:
import os
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages com.databricks:spark-xml_2.10:0.4.1"
)
import pyspark
...
Then, if you want to do this for any script you run, use Run Configurations of pycharm. You can add a template following these steps:
Go to Edit Configurations
In Templates, edit the python template
Add an Environment value like PYSPARK_SUBMIT_ARGS="--packages com.databricks:spark-xml_2.10:0.4.1"
Hope it helps.

Getting "invalid syntax" error while reading data from text file using pyspark

I'm trying to read text file using pyspark. Data in file is comma separated.
I've already tried reading data using sqlcontext.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
sc = SparkContext._active_spark_context
filePath = './data_files/data.txt'
sqlContext = SQLContext(sc)
print(fileData)
schema = StructType([StructField('ID', IntegerType(), False),
StructField('Name', StringType(), False),
StructField('Project', StringType(), False),
StructField('Location', StringType(), False)])
print(schema)
fileRdd = sc.textFile(fileData).map(_.split(",")).map{x => org.apache.spark.sql.Row(x:_*)}
sqlDf = sqlContext.createDataFrame(fileRdd,schema)
sqlDf.show()
I'm getting following error.
File "", line 1
fileRdd = sc.textFile(fileData).map(.split(",")).map{x => org.apache.spark.sql.Row(x:*)}
^ SyntaxError: invalid syntax
I've tried using following code and it is working fine.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
sc = SparkContext._active_spark_context
sc = SparkContext("local", "first app")
sqlContext = SQLContext(sc)
filePath = "./data_files/data.txt"
# Load a text file and convert each line to a Row.
lines = sc.textFile(filePath)
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0].strip(), p[1], p[2], p[3]))
# The schema is encoded in a string.
schemaString = "ID Name Project Location"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
schemaPeople = sqlContext.createDataFrame(people, schema)
schemaPeople.show()

Reading parquet file with PySpark

I am new to Pyspark and nothing seems to be working out. Please rescue.
I want to read a parquet file with Pyspark. I wrote the following codes.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.read.parquet("my_file.parquet")
I got the following error
Py4JJavaError Traceback (most recent call
last) /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
then I tried the following codes
from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
SQLContext.read.parquet("my_file.parquet")
Then the error was as follows :
AttributeError: 'property' object has no attribute 'parquet'
You need to create an instance of SQLContext first.
This will work from pyspark shell:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.read.parquet("my_file.parquet")
If you are using spark-submit you need to create the SparkContext in which case you would do this:
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
sqlContext.read.parquet("my_file.parquet")
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sc.stop()
conf = (conf.setMaster('local[*]'))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet("my_file.parquet")
Try this.

How to get files name with spark sc.textFile?

I am reading a directory of files using the following code:
val data = sc.textFile("/mySource/dir1/*")
now my data rdd contains all rows of all files in the directory (right?)
I want now to add a column to each row with the source files name, how can I do that?
The other options I tried is using wholeTextFile but I keep getting out of memory exceptions.
5 servers 24 cores 24 GB (executor-core 5 executor-memory 5G)
any ideas?
You can use this code. I have tested it with Spark 1.4 and 1.5.
It gets the file name from the inputSplit and adds it to each line using the iterator using the mapPartitionsWithInputSplit of the NewHadoopRDD
import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.{NewHadoopRDD}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
val sc = new SparkContext(new SparkConf().setMaster("local"))
val fc = classOf[TextInputFormat]
val kc = classOf[LongWritable]
val vc = classOf[Text]
val path :String = "file:///home/user/test"
val text = sc.newAPIHadoopFile(path, fc ,kc, vc, sc.hadoopConfiguration)
val linesWithFileNames = text.asInstanceOf[NewHadoopRDD[LongWritable, Text]]
.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tup => (file.getPath, tup._2))
}
)
linesWithFileNames.foreach(println)
I think it's pretty late to answer this question but I found an easy way to do what you were looking for:
Step 0: from pyspark.sql import functions as F
Step 1: createDataFrame using the RDD as usual. Let's say df
Step 2: Use input_file_name()
df.withColumn("INPUT_FILE", F.input_file_name())
This will add a column to your DataFrame with source file name.