Reading comma separated text file in spark 1.6 - scala

I have a text file which is similar to below
20190920
123456789,6325,NN5555,123,4635,890,C,9
985632465,6467,KK6666,654,9780,636,B,8
258063464,6754,MM777,789,9461,895,N,5
And I am using spark 1.6 with scala to read this text file
val df = sqlcontext.read.option("com.databricks.spark.csv")
.option("header","false").option("inferSchema","false").load(path)
df.show()
When I used above command to read it is reading only first column. Is there anything to add to read that file with all column values.
Output I got :
20190920
123456789
985632465
258063464
3

In this case you should provide schema,So your code will look like this
val mySchema = StructType(
List(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
// and other columns ...
)
)
val df = sqlcontext.read
.schema(mySchema)
.option("com.databricks.spark.csv")
.option("header","false")
.option("inferSchema","false")
.load(path)

Related

Get the column names of malformed records while reading a csv file using pyspark

I am reading a csv file using pyspark with predefined schema.
schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True)
StructField("col3", FloatType(), True)
])
df = spark.sqlContext.read
.schema(schema)
.option("header",true)
.option("delimiter", ",")
.csv(path)
Now in the csv file, there is float value in col1 and string value in col3. I need to raise an exception and get the names of these columns(col1, col3) because these columns contain the values of different data type than that of defined in schema.
How do I achieve this?
In pyspark versions >2.2 you can use columnNameOfCorruptRecord with csv:
schema = StructType(
[
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
StructField("col3", FloatType(), True),
StructField("corrupted", StringType(), True),
]
)
df = spark.sqlContext.read.csv(
path,
schema=schema,
header=True,
sep=",",
mode="PERMISSIVE",
columnNameOfCorruptRecord="corrupted",
).show()
+----+----+----+------------+
|col1|col2|col3| corrupted|
+----+----+----+------------+
|null|null|null|0.10,123,abc|
+----+----+----+------------+
EDIT: CSV record fields are not independent of one another, so it can't generally be said that one field is corrupt, but others are not. Only the entire record can be corrupt or not corrupt.
For example, suppose we have a comma delimited file with one row and two floating point columns, the Euro values 0,10 and 1,00. The file looks like this:
col1,col2
0,10,1,00
Which field is corrupt?

Transform structured data to JSON format using Spark Scala

I've my "Structured data" as shown below, I need to transform it to the below shown "Expected results" type. My "Output schema" is shown as well. Appreciate if you can provide some help on how I can achieve this using Spark Scala code.
Note: Grouping on the structured data to be done the columns SN and VIN.
There should be one row for the same SN and VIN, if either SN or VIN changes, then data to be present in the next row.
Structured data:
+-----------------+-------------+--------------------+---+
|VIN |ST |SV |SN |
|FU74HZ501740XXXXX|1566799999225|44.0 |APP|
|FU74HZ501740XXXXX|1566800002758|61.0 |APP|
|FU74HZ501740XXXXX|1566800009446|23.39 |ASP|
Expected results:
Output schema:
val outputSchema = StructType(
List(
StructField("VIN", StringType, true),
StructField("EVENTS", ArrayType(
StructType(Array(
StructField("SN", StringType, true),
StructField("ST", IntegerType, true),
StructField("SV", DoubleType, true)
))))
)
)
From Spark 2.1 you can achieve this using struct and collect_list.
val df_2 = Seq(
("FU74HZ501740XXXX",1566799999225.0,44.0,"APP"),
("FU74HZ501740XXXX",1566800002758.0,61.0,"APP"),
("FU74HZ501740XXXX",1566800009446.0,23.39,"ASP")
).toDF("vin","st","sv","sn")
df_2.show(false)
+----------------+-----------------+-----+---+
|vin |st |sv |sn |
+----------------+-----------------+-----+---+
|FU74HZ501740XXXX|1.566799999225E12|44.0 |APP|
|FU74HZ501740XXXX|1.566800002758E12|61.0 |APP|
|FU74HZ501740XXXX|1.566800009446E12|23.39|ASP|
+----------------+-----------------+-----+---+
Use collect_list with struct:
df_2.groupBy("vin","sn")
.agg(collect_list(struct($"st", $"sv",$"sn")).as("events"))
.withColumn("events",to_json($"events"))
.drop(col("sn"))
This will give the wwanted output:
+----------------+---------------------------------------------------------------------------------------------+
|vin |events |
+----------------+---------------------------------------------------------------------------------------------+
|FU74HZ501740XXXX|[{"st":1.566800009446E12,"sv":23.39,"sn":"ASP"}] |
|FU74HZ501740XXXX|[{"st":1.566799999225E12,"sv":44.0,"sn":"APP"},{"st":1.566800002758E12,"sv":61.0,"sn":"APP"}]|
+----------------+---------------------------------------------------------------------------------------------+
You can get it via SparkSession.
val df = spark.read.json("/path/to/json/file/test.json")
here spark is the SparkSession object

Writing CSV file using Spark and scala - empty quotes instead of Null values

I'm using spark 2.4.1 and scala,
and trying to write DF to csv file.
it seems that in case of null values ,the csv contains "".
Is it possible to remove those empty quotes?
val data = Seq(
Row(1, "a"),
Row(5, "z"),
Row(5, null)
)
val schema = StructType(
List(
StructField("num", IntegerType, true),
StructField("letter", StringType, true)
)
)
var df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
df.write.csv("location/")
The output seems like:
1,a
5,z
5,""
And I want it will be:
1,a
5,z
5,
What should I do?
Thanks!
You can use options of the writer see CSV specific options(SaveMode is not related to answer);
df.write
.option("nullValue", null)
.mode(SaveMode.Overwrite)
.csv("location/")
Try this:
df.write.option("nullValue",None).save("location/")

Handling schema mismatches in Spark

I am reading a csv file using Spark in Scala.
The schema is predefined and i am using it for reading.
This is the esample code:
// create the schema
val schema= StructType(Array(
StructField("col1", IntegerType,false),
StructField("col2", StringType,false),
StructField("col3", StringType,true)))
// Initialize Spark session
val spark: SparkSession = SparkSession.builder
.appName("Parquet Converter")
.getOrCreate
// Create a data frame from a csv file
val dataFrame: DataFrame =
spark.read.format("csv").schema(schema).option("header", false).load(inputCsvPath)
From what i read when reading cav with Spark using a schema there are 3 options:
Set mode to DROPMALFORMED --> this will drop the lines that don't match the schema
Set mode to PERMISSIVE --> this will set the whole line to null values
Set mode to FAILFAST --> this will throw an exception when a mismatch is discovered
What is the best way to combine the options? The behaviour I want is to get the mismatches in the schema, print them as errors and ignoring the lines in my data frame.
Basically, I want a combination of FAILFAST and DROPMALFORMED.
Thanks in advance
This is what I eventually did:
I added to the schema the "_corrupt_record" column, for example:
val schema= StructType(Array(
StructField("col1", IntegerType,true),
StructField("col2", StringType,false),
StructField("col3", StringType,true),
StructField("_corrupt_record", StringType, true)))
Then I read the CSV using PERMISSIVE mode (it is Spark default):
val dataFrame: DataFrame = spark.read.format("csv")
.schema(schema)
.option("header", false)
.option("mode", "PERMISSIVE")
.load(inputCsvPath)
Now my data frame holds an additional column that holds the rows with schema mismatches.
I filtered the rows that have mismatched data and printed it:
val badRows = dataFrame.filter("_corrupt_record is not null")
badRows.cache()
badRows.show()
Just use DROPMALFORMED and follow the log. If malformed records are present there are dumped to the log, up to the limit set by maxMalformedLogPerPartition option.
spark.read.format("csv")
.schema(schema)
.option("header", false)
.option("mode", "DROPMALFORMED")
.option("maxMalformedLogPerPartition", 128)
.load(inputCsvPath)

How to use double pipe as delimiter in CSV?

Spark 1.5 and Scala 2.10.6
I have a data file that is using "¦¦" as the delimiter. I am having a hard time parsing through this to create a data frame. Can multiple delimiters be used to create a data frame? The code works with a single broken pipe but not with multiple delimiters.
My Code:
val customSchema_1 = StructType(Array(
StructField("ID", StringType, true),
StructField("FILLER", StringType, true),
StructField("CODE", StringType, true)));
val df_1 = sqlContext.read
.format("com.databricks.spark.csv")
.schema(customSchema_1)
.option("delimiter", "¦¦")
.load("example.txt")
Sample file:
12345¦¦ ¦¦10
I ran into this and found a good solution, I am using spark 2.3, I have a feeling it should work all of spark 2.2+ but have not tested it. The way it works is I replace the || with a tab and then the built in csv can take a Dataset[String] . I used tab because I have commas in my data.
var df = spark.sqlContext.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\t")
.csv(spark.sqlContext.read.textFile("filename")
.map(line => line.split("\\|\\|").mkString("\t")))
Hope this helps some else.
EDIT:
As of spark 3.0.1 this works out of the box.
example:
val ds = List("name||id", "foo||12", "brian||34", """"cray||name"||123""", "cray||name||123").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]
val csv = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "||").csv(ds)
csv: org.apache.spark.sql.DataFrame = [name: string, id: string]
csv.show
+----------+----+
| name| id|
+----------+----+
| foo| 12|
| brian| 34|
|cray||name| 123|
| cray|name|
+----------+----+
So the actual error being emitted here is:
java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ¦¦
The docs corroborate this limitation and I checked the Spark 2.0 csv reader and it has the same requirement.
Given all of this, if your data is simple enough where you won't have entries containing ¦¦, I would load your data like so:
scala> :pa
// Entering paste mode (ctrl-D to finish)
val customSchema_1 = StructType(Array(
StructField("ID", StringType, true),
StructField("FILLER", StringType, true),
StructField("CODE", StringType, true)));
// Exiting paste mode, now interpreting.
customSchema_1: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,true), StructField(FILLER,StringType,true), StructField(CODE,StringType,true))
scala> val rawData = sc.textFile("example.txt")
rawData: org.apache.spark.rdd.RDD[String] = example.txt MapPartitionsRDD[1] at textFile at <console>:31
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rowRDD = rawData.map(line => Row.fromSeq(line.split("¦¦")))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:34
scala> val df = sqlContext.createDataFrame(rowRDD, customSchema_1)
df: org.apache.spark.sql.DataFrame = [ID: string, FILLER: string, CODE: string]
scala> df.show
+-----+------+----+
| ID|FILLER|CODE|
+-----+------+----+
|12345| | 10|
+-----+------+----+
We tried to read data having custom delimiters and customizing column names for data frame in following way,
# Hold new column names saparately
headers ="JC_^!~_*>Year_^!~_*>Date_^!~_*>Service_Type^!~_*>KMs_Run^!~_*>
# '^!~_*>' This is field delimiter, so split string
head = headers.split("^!~_*>")
## Below command splits the S3 file with custom delimiter and converts into Dataframe
df = sc.textFile("s3://S3_Path/sample.txt").map(lambda x: x.split("^!~_*>")).toDF(head)
Passing head as parameter in toDF() assign new column names to dataframe created from text file having custom delimiters.
Hope this helps.
Starting from Spark2.8 and above support of multiple character delimiter has been added.
https://issues.apache.org/jira/browse/SPARK-24540
The above solution proposed by #lockwobr works in scala. Whoever working below Spark 2.8 and looking out for solution in PySpark you can refer to the below
ratings_schema = StructType([
StructField("user_id", StringType(), False)
, StructField("movie_id", StringType(), False)
, StructField("rating", StringType(), False)
, StructField("rating_timestamp", StringType(), True)
])
#movies_df = spark.read.csv("ratings.dat", header=False, sep="::", schema=ratings_schema)
movies_df = spark.createDataFrame(
spark.read.text("ratings.dat").rdd.map(lambda line: line[0].split("::")),
ratings_schema)
i have provided an example but you can modify it for your logic.