SparkDataFrame.dtypes fails if a column has special chars..how to bypass and read the csv and inferschema - pyspark

Inferring Schema of a Spark Dataframe throws error if the csv file has column with special chars..
Test sample
foo.csv
id,comment
1, #Hi
2, Hello
spark = SparkSession.builder.appName("footest").getOrCreate()
df= spark.read.load("foo.csv", format="csv", inferSchema="true", header="true")
print(df.dtypes)
raise ValueError("Could not parse datatype: %s" % json_value)
I found comment from Dat Tran on inferSchema in spark csv package how to resolve this...cann't we still inferschema before dataclean?

Use it like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').enableHiveSupport().getOrCreate()
df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("test19.csv")
print(df.dtypes)
Output:
[('id', 'int'), ('comment', 'string')]

Related

How to add a file name to a column in a data frame as multiple files are merged together?

How can I add a file_name column to a dataframe, as data is loading into the frame? So, I want the file_name to show for every record in the dataframe.
I did some research on this, and found something that seems like it should work, but it actually doesn't load any file names, only the data in the files themselves.
import org.apache.spark.sql.functions._
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/corp/ABC*.gz")
df.withColumn("file_name", input_file_name)
What is wrong with my code here? Thanks.
The input_file_name function creates a string column for the file name of the current Spark task.
import org.apache.spark.sql.functions.input_file_name
val df= spark.read
.option("delimiter", "|")
.option("header", "false")
.csv("mnt/rawdata/2019/01/01/corp/")
.withColumn("file_name", input_file_name())

Create pyspark dataframe from parquet file

I am quite new in pyspark and I am still trying to figure out who things work. What I am trying to do is after loading a parquet file in memory using pyarrow Itry to make it to pyspark dataframe. But I am getting an error.
I should mention that I am not reading directly through pyspark because the file in in s3 which gives me another error about "no filesystem for scheme s3"
so I am trying to work around. Below I have a reproducible example.
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
parquet_file=pq.ParquetDataset('s3filepath.parquet',filesystem=s3)
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.createDataFrame(parquet_file)
------------------------------------------------------------------
TypeError Traceback (most recent
call last)
<ipython-input-20-0cb2dd287606> in <module>
----> 1 spark.createDataFrame(pandas_dataframe)
/usr/local/spark/python/pyspark/sql/session.py in
createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema =
self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema =
self._createFromLocal(map(prepare, data), schema)
749 jrdd =
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf =
self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),
schema.json())
TypeError: 'ParquetDataset' object is not iterable
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext('local', "retail")
sqlC = SQLContext(sc)
This is how you should read parquet files to spark df:
df = sqlC.read.parquet('path_to_file_or_dir')
You can read data from S3 via Spark as long as you have the public and secret keys for the S3 bucket ... this would be more efficient compared to going though arrow via pandas and then converting to spark dataframe because you would have to parallelize the serial read.
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
df = spark.read.parquet("s3://path/to/parquet/files")
source doc => https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#access-aws-s3-directly

How to log malformed rows from Scala Spark DataFrameReader csv

The documentation for the Scala_Spark_DataFrameReader_csv suggests that spark can log the malformed rows detected while reading a .csv file.
- How can one log the malformed rows?
- Can one obtain a val or var containing the malformed rows?
The option from the linked documentation is:
maxMalformedLogPerPartition (default 10): sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored
Based on this databricks example you need to explicitly add the "_corrupt_record" column to a schema definition when you read in the file. Something like this worked for me in pyspark 2.4.4:
from pyspark.sql.types import *
my_schema = StructType([
StructField("field1", StringType(), True),
...
StructField("_corrupt_record", StringType(), True)
])
my_data = spark.read.format("csv")\
.option("path", "/path/to/file.csv")\
.schema(my_schema)
.load()
my_data.count() # force reading the csv
corrupt_lines = my_data.filter("_corrupt_record is not NULL")
corrupt_lines.take(5)
If you are using the spark 2.3 check the _corrupt_error special column ... according to several spark discussions "it should work " , so after the read filter those which non-empty cols - there should be your errors ... you could check also the input_file_name() sql func
if you are not using lower than version 2.3 you should implement a custom read , record solution, because according to my tests the _corrupt_error does not work for csv data source ...
I've expanded on klucar's answer here by loading the csv, making a schema from the non-corrupted records, adding the corrupted record column, using the new schema to load the csv and then looking for corrupted records.
from pyspark.sql.types import StructField, StringType
from pyspark.sql.functions import col
file_path = "/path/to/file"
mode = "PERMISSIVE"
schema = spark.read.options(mode=mode).csv(file_path).schema
schema = schema.add(StructField("_corrupt_record", StringType(), True))
df = spark.read.options(mode=mode).schema(schema).csv(file_path)
df.cache()
df.count()
df.filter(col("_corrupt_record").isNotNull()).show()

SparkSQL - Read parquet file directly

I am migrating from Impala to SparkSQL, using the following code to read a table:
my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')
How do I invoke SparkSQL above, so it can return something like:
'select col_A, col_B from my_table'
After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries on it.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")
df.printSchema
// after registering as a table you will be able to run sql queries
df.registerTempTable("people")
sqlContext.sql("select * from people").collect.foreach(println)
With plain SQL
JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.
//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate
spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
.show()
Suppose that you have the parquet file ventas4 in HDFS:
hdfs://localhost:9000/sistgestion/sql/ventas4
In this case, the steps are:
Charge the SQL Context:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Read the parquet File:
val ventas=sqlContext.read.parquet("hdfs://localhost:9000/sistgestion/sql/ventas4")
Register a temporal table:
ventas.registerTempTable("ventas")
Execute the query (in this line you can use toJSON to pass a JSON format or you can use collect()):
sqlContext.sql("select * from ventas").toJSON.foreach(println(_))
sqlContext.sql("select * from ventas").collect().foreach(println(_))
Use the following code in intellij:
def groupPlaylistIds(): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder.appName("FollowCount")
.master("local[*]")
.getOrCreate()
val sc = spark.sqlContext
val d = sc.read.format("parquet").load("/Users/CCC/Downloads/pq/file1.parquet")
d.printSchema()
val d1 = d.select("col1").filter(x => x!='-')
val d2 = d1.filter(col("col1").startsWith("searchcriteria"));
d2.groupBy("col1").count().sort(col("count").desc).show(100, false)
}

Read .csv data in european format with Spark

I am currently doing my first attempts with Apache Spark.
I would like to read a .csv File with an SQLContext object, but Spark won't provide the correct results as the File is a european one (comma as decimal separator and semicolon used as value separator).
Is there a way to tell Spark to follow a different .csv syntax?
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Foo")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("inferSchema","true")
.load("data.csv")
df.show()
A row in the relating .csv looks like this:
04.10.2016;12:51:00;1,1;0,41;0,416
Spark interprets the entire row as a column. df.show() prints:
+--------------------------------+
|Col1;Col2,Col3;Col4;Col5 |
+--------------------------------+
| 04.10.2016;12:51:...|
+--------------------------------+
In previous attempts to get it working df.show() was even printing more row-content where it now says '...' but eventually cutting the row at the comma in the third col.
You can just read as Test and split by ; or set a custom delimiter to the CSV format as in .option("delimiter",";")