Provide schema while reading csv file as a dataframe in Scala Spark - scala

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below.
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("schema","project: string ,article: string ,requests: integer ,bytes_served: long")
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
But when I check the schema of the data frame I created, it seems to have taken its own schema. Am I doing anything wrong ? how to make spark to pick up the schema I mentioned ?
> pagecount.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)

Try the below code, you need not specify the schema. When you give inferSchema as true it should take it from your csv file.
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
If you want to manually specify the schema, you can do it as below:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("project", StringType, true),
StructField("article", StringType, true),
StructField("requests", IntegerType, true),
StructField("bytes_served", DoubleType, true))
)
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.schema(customSchema)
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

For those interested in doing this in Python here is a working version.
customSchema = StructType([
StructField("IDGC", StringType(), True),
StructField("SEARCHNAME", StringType(), True),
StructField("PRICE", DoubleType(), True)
])
productDF = spark.read.load('/home/ForTesting/testProduct.csv', format="csv", header="true", sep='|', schema=customSchema)
testProduct.csv
ID|SEARCHNAME|PRICE
6607|EFKTON75LIN|890.88
6612|EFKTON100HEN|55.66
Hope this helps.

I'm using the solution provided by Arunakiran Nulu in my analysis (see the code). Despite it is able to assign the correct types to the columns, all the values returned are null. Previously, I've tried to the option .option("inferSchema", "true") and it returns the correct values in the dataframe (although different type).
val customSchema = StructType(Array(
StructField("numicu", StringType, true),
StructField("fecha_solicitud", TimestampType, true),
StructField("codtecnica", StringType, true),
StructField("tecnica", StringType, true),
StructField("finexploracion", TimestampType, true),
StructField("ultimavalidacioninforme", TimestampType, true),
StructField("validador", StringType, true)))
val df_explo = spark.read
.format("csv")
.option("header", "true")
.option("delimiter", "\t")
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss")
.schema(customSchema)
.load(filename)
Result
root
|-- numicu: string (nullable = true)
|-- fecha_solicitud: timestamp (nullable = true)
|-- codtecnica: string (nullable = true)
|-- tecnica: string (nullable = true)
|-- finexploracion: timestamp (nullable = true)
|-- ultimavalidacioninforme: timestamp (nullable = true)
|-- validador: string (nullable = true)
and the table is:
|numicu|fecha_solicitud|codtecnica|tecnica|finexploracion|ultimavalidacioninforme|validador|
+------+---------------+----------+-------+--------------+-----------------------+---------+
| null| null| null| null| null| null| null|
| null| null| null| null| null| null| null|
| null| null| null| null| null| null| null|
| null| null| null| null| null| null| null|

The previous solutions have used the custom StructType.
With spark-sql 2.4.5 (scala version 2.12.10) it is now possible to specify the schema as a string using the schema function
import org.apache.spark.sql.SparkSession;
val sparkSession = SparkSession.builder()
.appName("sample-app")
.master("local[2]")
.getOrCreate();
val pageCount = sparkSession.read
.format("csv")
.option("delimiter","|")
.option("quote","")
.schema("project string ,article string ,requests integer ,bytes_served long")
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

Thanks to the answer by #Nulu, it works for pyspark with minimal tweaking
from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType
customSchema = StructType(Array(
StructField("project", StringType, true),
StructField("article", StringType, true),
StructField("requests", IntegerType, true),
StructField("bytes_served", DoubleType, true)))
pagecount = sc.read.format("com.databricks.spark.csv")
.option("delimiter"," ")
.option("quote","")
.option("header", "false")
.schema(customSchema)
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

schema definition as simple string
Just in case if some one is interested in schema definition as simple string with date and time stamp
data file creation from Terminal or shell
echo "
2019-07-02 22:11:11.000999, 01/01/2019, Suresh, abc
2019-01-02 22:11:11.000001, 01/01/2020, Aadi, xyz
" > data.csv
Defining the schema as String
user_schema = 'timesta TIMESTAMP,date DATE,first_name STRING , last_name STRING'
reading the data
df = spark.read.csv(path='data.csv', schema = user_schema, sep=',', dateFormat='MM/dd/yyyy',timestampFormat='yyyy-MM-dd HH:mm:ss.SSSSSS')
df.show(10, False)
+-----------------------+----------+----------+---------+
|timesta |date |first_name|last_name|
+-----------------------+----------+----------+---------+
|2019-07-02 22:11:11.999|2019-01-01| Suresh | abc |
|2019-01-02 22:11:11.001|2020-01-01| Aadi | xyz |
+-----------------------+----------+----------+---------+
Please note defining the schema explicitly instead of letting spark infer the schema also improves the spark read performance.

Here's how you can work with a custom schema, a complete demo:
$> shell code,
echo "
Slingo, iOS
Slingo, Android
" > game.csv
Scala code:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("game_id", StringType, true),
StructField("os_id", StringType, true)
))
val csv_df = spark.read.format("csv").schema(customSchema).load("game.csv")
csv_df.show
csv_df.orderBy(asc("game_id"), desc("os_id")).show
csv_df.createOrReplaceTempView("game_view")
val sort_df = sql("select * from game_view order by game_id, os_id desc")
sort_df.show

if your spark version is 3.0.1, you can use following Scala scripts:
val df = spark.read.format("csv").option("delimiter",",").option("header",true).load("file:///LOCAL_CSV_FILE_PATH")
but in this way, all datatypes will be set as String.

// import Library
import java.io.StringReader ;
import au.com.bytecode.opencsv.CSVReader
//filename
var train_csv = "/Path/train.csv";
//read as text file
val train_rdd = sc.textFile(train_csv)
//use string reader to convert in proper format
var full_train_data = train_rdd.map{line => var csvReader = new CSVReader(new StringReader(line)) ; csvReader.readNext(); }
//declares types
type s = String
// declare case class for schema
case class trainSchema (Loan_ID :s ,Gender :s, Married :s, Dependents :s,Education :s,Self_Employed :s,ApplicantIncome :s,CoapplicantIncome :s,
LoanAmount :s,Loan_Amount_Term :s, Credit_History :s, Property_Area :s,Loan_Status :s)
//create DF RDD with custom schema
var full_train_data_with_schema = full_train_data.mapPartitionsWithIndex{(idx,itr)=> if (idx==0) itr.drop(1);
itr.toList.map(x=> trainSchema(x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12))).iterator }.toDF

In pyspark 2.4 onwards, you can simply use header parameter to set the correct header:
data = spark.read.csv('data.csv', header=True)
Similarly, if using scala you can use header parameter as well.

You can also do like this by using sparkSession and implicit
import sparkSession.implicits._
val pagecount:DataFrame = sparkSession.read
.option("delimiter"," ")
.option("quote","")
.option("inferSchema","true")
.csv("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
.toDF("project","article","requests","bytes_served")

This is one of option where we can pass the column names to the dataframe while loading CSV.
import pandas
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv("C:/Users/NS00606317/Downloads/Iris.csv", names=names, header=0)
print(dataset.head(10))
Output
sepal-length sepal-width petal-length petal-width class
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
7 4.6 3.4 1.4 0.3 Iris-setosa
8 5.0 3.4 1.5 0.2 Iris-setosa
9 4.4 2.9 1.4 0.2 Iris-setosa
10 4.9 3.1 1.5 0.1 Iris-setosa

here my solution is:
import org.apache.spark.sql.types._
val spark = org.apache.spark.sql.SparkSession.builder.
master("local[*]").
appName("Spark CSV Reader").
getOrCreate()
val movie_rating_schema = StructType(Array(
StructField("UserID", IntegerType, true),
StructField("MovieID", IntegerType, true),
StructField("Rating", DoubleType, true),
StructField("Timestamp", TimestampType, true)))
val df_ratings: DataFrame = spark.read.format("csv").
option("header", "true").
option("mode", "DROPMALFORMED").
option("delimiter", ",").
//option("inferSchema", "true").
option("nullValue", "null").
schema(movie_rating_schema).
load(args(0)) //"file:///home/hadoop/spark-workspace/data/ml-20m/ratings.csv"
val movie_avg_scores = df_ratings.rdd.map(_.toString()).
map(line => {
// drop "[", "]" and then split the str
val fileds = line.substring(1, line.length() - 1).split(",")
//extract (movie id, average rating)
(fileds(1).toInt, fileds(2).toDouble)
}).
groupByKey().
map(data => {
val avg: Double = data._2.sum / data._2.size
(data._1, avg)
})

Related

Cannot filter a strucure of Strings with spark

I'm trying to filter lines from a dataframe with this structure :
|-- age: integer (nullable = true)
|-- qty: integer (nullable = true)
|-- dates: array (nullable = true)
| |-- element: timestamp (containsNull = true)
For example in this dataframe I only want the first row :
+---------+------------+------------------------------------------------------------------+
| age | qty |dates |
+---------+------------+------------------------------------------------------------------+
| 54 | 1| [2020-12-31 12:15:20, 2021-12-31 12:15:20] |
| 45 | 1| [2020-12-31 12:15:20, 2018-12-31 12:15:20, 2019-12-31 12:15:20] |
+---------+------------+------------------------------------------------------------------+
Here is my code :
val result = sqlContext
.table("scores")
result.filter(array_contains(col("dates").cast("string"),
2021)).show(false)
But I'm getting this error :
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(
due to data type mismatch: Arguments must be an array followed by a value of same type as the > array members;
Can anyone help please?
You need to use rlike to check if each array element contains 2021. array_contains check for exact match, not partial match.
result.filter("array_max(transform(dates, x -> string(x) rlike '2021'))").show(false)
You can explode the ArrayType and then make your proceessing as You want: cast the column as String then apply your filter:
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import java.sql.Timestamp
import java.text.SimpleDateFormat
def convertToTimeStamp(s: String) = {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss")
val parsedDate = dateFormat.parse(s)
new Timestamp(parsedDate.getTime)
}
val data = Seq(
Row(54, 1, Array(convertToTimeStamp("2020-12-31 12:15:20"), convertToTimeStamp("2021-12-31 12:15:20"))),
Row(45, 1, Array(convertToTimeStamp("2020-12-31 12:15:20"), convertToTimeStamp("2018-12-31 12:15:20"), convertToTimeStamp("2019-12-31 12:15:20")))
)
val Schema = StructType(Array(
StructField("age", IntegerType, nullable = true),
StructField("qty", IntegerType, nullable = true),
StructField("dates", ArrayType(TimestampType, containsNull = true), nullable = true)
))
val rdd = spark.sparkContext.parallelize(data)
var df = spark.createDataFrame(rdd, Schema)
df.show()
df.printSchema()
df = df.withColumn("exp",f.explode(f.col("dates")))
df.filter(f.col("exp").cast(StringType).contains("2021")).show()
You can use exists function to check whether dates array contains a date with year 2021:
df.filter("exists(dates, x -> year(x) = 2021)").show(false)
//+---+---+------------------------------------------+
//|age|qty|dates |
//+---+---+------------------------------------------+
//|54 |1 |[2020-12-31 12:15:20, 2021-12-31 12:15:20]|
//+---+---+------------------------------------------+
If you want to use array_contains, you need to transform the timestamp elements into year:
df.filter("array_contains(transform(dates, x -> year(x)), 2021)").show(false)

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions

I got this error while using this code to drop a nested column with pyspark. Why is this not working? I was trying to use a tilde instead of a not != as the error suggests but it doesnt work either. So what do you do in that case?
def drop_col(df, struct_nm, delete_struct_child_col_nm):
fields_to_keep = filter(lambda x: x != delete_struct_child_col_nm, df.select("
{}.*".format(struct_nm)).columns)
fields_to_keep = list(map(lambda x: "{}.{}".format(struct_nm, x), fields_to_keep))
return df.withColumn(struct_nm, struct(fields_to_keep))
I built a simple example with a struct column and a few dummy columns:
from pyspark import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, lit, col, struct
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
schema = StructType(
[
StructField('addresses',
StructType(
[StructField("state", StringType(), True),
StructField("street", StringType(), True),
StructField("country", StringType(), True),
StructField("code", IntegerType(), True)]
)
)
]
)
rdd = [({'state': 'pa', 'street': 'market', 'country': 'USA', 'code': 100},),
({'state': 'ca', 'street': 'baker', 'country': 'USA', 'code': 101},)]
df = sql_context.createDataFrame(rdd, schema)
df = df.withColumn('id', monotonically_increasing_id())
df = df.withColumn('name', lit('test'))
print(df.show())
print(df.printSchema())
Output:
+--------------------+-----------+----+
| addresses| id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+
root
|-- addresses: struct (nullable = true)
| |-- state: string (nullable = true)
| |-- street: string (nullable = true)
| |-- country: string (nullable = true)
| |-- code: integer (nullable = true)
|-- id: long (nullable = false)
|-- name: string (nullable = false)
To drop the whole struct column, you can simply use the drop function:
df2 = df.drop('addresses')
print(df2.show())
Output:
+-----------+----+
| id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
To drop specific fields, in a struct column, it's a bit more complicated - there are some other similar questions here:
Dropping a nested column from Spark DataFrame
Dropping nested column of Dataframe with PySpark
In any case, I found them to be a bit complicated - my approach would just be to reassign the original column with the subset of struct fields you want to keep:
columns_to_keep = ['country', 'code']
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+----------+-----------+----+
| addresses| id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
Alternatively, if you just wanted to specify the columns you want to remove rather than the columns you want to keep:
columns_to_remove = ['country', 'code']
all_columns = df.select("addresses.*").columns
columns_to_keep = list(set(all_columns) - set(columns_to_remove))
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+------------+-----------+----+
| addresses| id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
Hope this helps!

Manually create a pyspark dataframe

I am trying to manually create a pyspark dataframe given certain data:
row_in = [(1566429545575348), (40.353977), (-111.701859)]
rdd = sc.parallelize(row_in)
schema = StructType(
[
StructField("time_epocs", DecimalType(), True),
StructField("lat", DecimalType(), True),
StructField("long", DecimalType(), True),
]
)
df_in_test = spark.createDataFrame(rdd, schema)
This gives an error when I try to display the dataframe, so I am not sure how to do this.
However, the Spark documentation seems to be a bit convoluted to me, and I got similar errors when I tried to follow those instructions.
Does anyone know how to do this?
Simple dataframe creation:
df = spark.createDataFrame(
[
(1, "foo"), # create your data here, be consistent in the types.
(2, "bar"),
],
["id", "label"] # add your column names here
)
df.printSchema()
root
|-- id: long (nullable = true)
|-- label: string (nullable = true)
df.show()
+---+-----+
| id|label|
+---+-----+
| 1| foo|
| 2| bar|
+---+-----+
According to official doc:
when schema is a list of column names, the type of each column will be inferred from data. (example above ↑)
When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data. (examples below ↓)
# Example with a datatype string
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
"id int, label string", # add column names and types here
)
# Example with pyspark.sql.types
from pyspark.sql import types as T
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
T.StructType( # Define the whole schema within a StructType
[
T.StructField("id", T.IntegerType(), True),
T.StructField("label", T.StringType(), True),
]
),
)
df.printSchema()
root
|-- id: integer (nullable = true) # type is forced to Int
|-- label: string (nullable = true)
Additionally, you can create your dataframe from Pandas dataframe, schema will be inferred from Pandas dataframe's types :
import pandas as pd
import numpy as np
pdf = pd.DataFrame(
{
"col1": [np.random.randint(10) for x in range(10)],
"col2": [np.random.randint(100) for x in range(10)],
}
)
df = spark.createDataFrame(pdf)
df.show()
+----+----+
|col1|col2|
+----+----+
| 6| 4|
| 1| 39|
| 7| 4|
| 7| 95|
| 6| 3|
| 7| 28|
| 2| 26|
| 0| 4|
| 4| 32|
+----+----+
To elaborate/build off of #Steven's answer:
field = [
StructField("MULTIPLIER", FloatType(), True),
StructField("DESCRIPTION", StringType(), True),
]
schema = StructType(field)
multiplier_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
Will create a blank dataframe.
We can now simply add a row to it:
l = [(2.3, "this is a sample description")]
rdd = sc.parallelize(l)
multiplier_df_temp = spark.createDataFrame(rdd, schema)
multiplier_df = wtp_multiplier_df.union(wtp_multiplier_df_temp)
This answer demonstrates how to create a PySpark DataFrame with createDataFrame, create_df and toDF.
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["first_name", "age"])
df.show()
+----------+---+
|first_name|age|
+----------+---+
| joe| 34|
| luisa| 22|
+----------+---+
You can also pass createDataFrame a RDD and schema to construct DataFrames with more precision:
from pyspark.sql import Row
from pyspark.sql.types import *
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
schema = schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
create_df from my Quinn project allows for the best of both worlds - it's concise and fully descriptive:
from pyspark.sql.types import *
from quinn.extensions import *
df = spark.create_df(
[("jose", "a"), ("li", "b"), ("sam", "c")],
[("name", StringType(), True), ("blah", StringType(), True)]
)
df.show()
+----+----+
|name|blah|
+----+----+
|jose| a|
| li| b|
| sam| c|
+----+----+
toDF doesn't offer any advantages over the other approaches:
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
df = rdd.toDF()
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
With formatting
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(1, "foo"),
(2, "bar"),
],
StructType(
[
StructField("id", IntegerType(), False),
StructField("txt", StringType(), False),
]
),
)
print(df.dtypes)
df.show()
Extending #Steven's Answer:
data = [(i, 'foo') for i in range(1000)] # random data
columns = ['id', 'txt'] # add your columns label here
df = spark.createDataFrame(data, columns)
Note: When schema is a list of column-names, the type of each column will be inferred from data.
If you want to specifically define schema then do this:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df1 = spark.createDataFrame(data, schema)
Outputs:
>>> df1
DataFrame[id: int, txt: string]
>>> df
DataFrame[id: bigint, txt: string]
for beginners, a full example importing data from file:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
ShortType,
StringType,
StructType,
StructField,
TimestampType,
)
import os
here = os.path.abspath(os.path.dirname(__file__))
spark = SparkSession.builder.getOrCreate()
schema = StructType(
[
StructField("id", ShortType(), nullable=False),
StructField("string", StringType(), nullable=False),
StructField("datetime", TimestampType(), nullable=False),
]
)
# read file or construct rows manually
df = spark.read.csv(os.path.join(here, "data.csv"), schema=schema, header=True)

Creating a Spark Vector Column with createDataFrame

I can make a Spark DataFrame with a vector column with the toDF method.
val dataset = Seq((1.0, org.apache.spark.ml.linalg.Vectors.dense(0.0, 10.0, 0.5))).toDF("id", "userFeatures")
scala> dataset.printSchema()
root
|-- id: double (nullable = false)
|-- userFeatures: vector (nullable = true)
scala> dataset.schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(id,DoubleType,false), StructField(userFeatures,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7,true))
I'm not sure how to create a vector column with the createDataFrame method. There isn't a VectorType type in org.apache.spark.sql.types.
This doesn't work:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.ml.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
df.printSchema()
To create a Spark Vector Column with createDataFrame, you can use following code:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, org.apache.spark.mllib.linalg.Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
+---+---------+
| id| features|
+---+---------+
|1.0|[1.0,2.0]|
+---+---------+
df.printSchema()
root
|-- id: double (nullable = true)
|-- features: vector (nullable = true)
The actual issue was incompatible type org.apache.spark.ml.linalg.Vectors.dense which is not a valid external type for schema of vector. So, we have to switch to mllib package instead of ml package.
I hope it helps!
Note: I am using Spark v2.3.0. Also, class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg.
For reference - https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib

How to use double pipe as delimiter in CSV?

Spark 1.5 and Scala 2.10.6
I have a data file that is using "¦¦" as the delimiter. I am having a hard time parsing through this to create a data frame. Can multiple delimiters be used to create a data frame? The code works with a single broken pipe but not with multiple delimiters.
My Code:
val customSchema_1 = StructType(Array(
StructField("ID", StringType, true),
StructField("FILLER", StringType, true),
StructField("CODE", StringType, true)));
val df_1 = sqlContext.read
.format("com.databricks.spark.csv")
.schema(customSchema_1)
.option("delimiter", "¦¦")
.load("example.txt")
Sample file:
12345¦¦ ¦¦10
I ran into this and found a good solution, I am using spark 2.3, I have a feeling it should work all of spark 2.2+ but have not tested it. The way it works is I replace the || with a tab and then the built in csv can take a Dataset[String] . I used tab because I have commas in my data.
var df = spark.sqlContext.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\t")
.csv(spark.sqlContext.read.textFile("filename")
.map(line => line.split("\\|\\|").mkString("\t")))
Hope this helps some else.
EDIT:
As of spark 3.0.1 this works out of the box.
example:
val ds = List("name||id", "foo||12", "brian||34", """"cray||name"||123""", "cray||name||123").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]
val csv = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "||").csv(ds)
csv: org.apache.spark.sql.DataFrame = [name: string, id: string]
csv.show
+----------+----+
| name| id|
+----------+----+
| foo| 12|
| brian| 34|
|cray||name| 123|
| cray|name|
+----------+----+
So the actual error being emitted here is:
java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ¦¦
The docs corroborate this limitation and I checked the Spark 2.0 csv reader and it has the same requirement.
Given all of this, if your data is simple enough where you won't have entries containing ¦¦, I would load your data like so:
scala> :pa
// Entering paste mode (ctrl-D to finish)
val customSchema_1 = StructType(Array(
StructField("ID", StringType, true),
StructField("FILLER", StringType, true),
StructField("CODE", StringType, true)));
// Exiting paste mode, now interpreting.
customSchema_1: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,true), StructField(FILLER,StringType,true), StructField(CODE,StringType,true))
scala> val rawData = sc.textFile("example.txt")
rawData: org.apache.spark.rdd.RDD[String] = example.txt MapPartitionsRDD[1] at textFile at <console>:31
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rowRDD = rawData.map(line => Row.fromSeq(line.split("¦¦")))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:34
scala> val df = sqlContext.createDataFrame(rowRDD, customSchema_1)
df: org.apache.spark.sql.DataFrame = [ID: string, FILLER: string, CODE: string]
scala> df.show
+-----+------+----+
| ID|FILLER|CODE|
+-----+------+----+
|12345| | 10|
+-----+------+----+
We tried to read data having custom delimiters and customizing column names for data frame in following way,
# Hold new column names saparately
headers ="JC_^!~_*>Year_^!~_*>Date_^!~_*>Service_Type^!~_*>KMs_Run^!~_*>
# '^!~_*>' This is field delimiter, so split string
head = headers.split("^!~_*>")
## Below command splits the S3 file with custom delimiter and converts into Dataframe
df = sc.textFile("s3://S3_Path/sample.txt").map(lambda x: x.split("^!~_*>")).toDF(head)
Passing head as parameter in toDF() assign new column names to dataframe created from text file having custom delimiters.
Hope this helps.
Starting from Spark2.8 and above support of multiple character delimiter has been added.
https://issues.apache.org/jira/browse/SPARK-24540
The above solution proposed by #lockwobr works in scala. Whoever working below Spark 2.8 and looking out for solution in PySpark you can refer to the below
ratings_schema = StructType([
StructField("user_id", StringType(), False)
, StructField("movie_id", StringType(), False)
, StructField("rating", StringType(), False)
, StructField("rating_timestamp", StringType(), True)
])
#movies_df = spark.read.csv("ratings.dat", header=False, sep="::", schema=ratings_schema)
movies_df = spark.createDataFrame(
spark.read.text("ratings.dat").rdd.map(lambda line: line[0].split("::")),
ratings_schema)
i have provided an example but you can modify it for your logic.