I'm using Apache Spark v3.0.1 and Apache Sedona v1.1.1 and I'm trying to read a Shapefile into a SpatialRDD. I first tried the example provided by the Sedona library (more specifically, the code inside testShapefileConstructor method), and it just worked. However, when I try to read another Shapefile, despite the fact that metadata was loaded correctly, the actual data was missing. Using count on the SpatialRDD gives me 0.
The shapefile I'm using is available here. It's the map of a Brazilian state. Since I tried with data from other states, I guess there's something wrong with those files.
And this is the code I used. I'm aware that the contents of the shapefile reside in a folder with .shp, .shx, .dbf and .prj files, so the variable path to that folder.
import org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
import org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader
import org.apache.sedona.sql.utils.{Adapter, SedonaSQLRegistrator}
import org.apache.sedona.viz.sql.utils.SedonaVizRegistrator
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.{SparkSession, DataFrame, Encoder}
object Main {
def main(args: Array[String]) {
val spark = SparkSession.builder
.config("spark.master", "local[*]")
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[SedonaVizKryoRegistrator].getName)
.appName("test")
.getOrCreate()
SedonaSQLRegistrator.registerAll(spark)
SedonaVizRegistrator.registerAll(spark)
val path = "/path/to/shapefile/folder"
val spatialRDD = ShapefileReader.readToGeometryRDD(spark.sparkContext, path)
println(spatialRDD.fieldNames)
println(spatialRDD.rawSpatialRDD.count())
var rawSpatialDf = Adapter.toDf(spatialRDD, spark)
rawSpatialDf.show()
rawSpatialDf.printSchema()
}
}
Output:
[ID, CD_GEOCODM, NM_MUNICIP]
0
+--------+---+----------+----------+
|geometry| ID|CD_GEOCODM|NM_MUNICIP|
+--------+---+----------+----------+
+--------+---+----------+----------+
root
|-- geometry: geometry (nullable = true)
|-- ID: string (nullable = true)
|-- CD_GEOCODM: string (nullable = true)
|-- NM_MUNICIP: string (nullable = true)
I tried changing the character encoding, as pointed out here, but the results were the same after these attempts:
System.setProperty("sedona.global.charset", "utf8")
and
System.setProperty("sedona.global.charset", "iso-8859-1")
So I still have no idea why this fails to be read. What could be problem?
Currently Sedona only supports Shapefile type Point, Polyline, Polygon, and MultiPoint (i.e., type 1, 3, 5, 8) according to https://github.com/apache/incubator-sedona/blob/master/core/src/main/java/org/apache/sedona/core/formatMapper/shapefileParser/parseUtils/shp/ShapeType.java
But your data might be something else because Shapefile specification supports more types: https://en.wikipedia.org/wiki/Shapefile
I had the same problem using the Wegsegment shapefile from https://www.geopunt.be/download?container=wegenregister&title=Wegenregister (which is the Flemish road register).
I could open the file just fine with QGIS, I exported it from there with GeometryType LineString instead of Automatic and the export worked fine in Sedona. I noticed the original had LineStringM features (if you just add the layer and then hover over it). When I examined the m ordinate (cf. https://gis.stackexchange.com/a/274117/211228) it turned out to be empty, so don't think I lost anything there. Seems yours is a Polygon, but exporting with type Polygon instead of Automatic also makes it possible to read with Sedona.
Related
I'm trying to upgrade from Spark 2.1 to 2.2. When I try to read or write a dataframe to a location (CSV or JSON) I am receiving this error:
Illegal pattern component: XXX
java.lang.IllegalArgumentException: Illegal pattern component: XXX
at org.apache.commons.lang3.time.FastDatePrinter.parsePattern(FastDatePrinter.java:282)
at org.apache.commons.lang3.time.FastDatePrinter.init(FastDatePrinter.java:149)
at org.apache.commons.lang3.time.FastDatePrinter.<init>(FastDatePrinter.java:142)
at org.apache.commons.lang3.time.FastDateFormat.<init>(FastDateFormat.java:384)
at org.apache.commons.lang3.time.FastDateFormat.<init>(FastDateFormat.java:369)
at org.apache.commons.lang3.time.FastDateFormat$1.createInstance(FastDateFormat.java:91)
at org.apache.commons.lang3.time.FastDateFormat$1.createInstance(FastDateFormat.java:88)
at org.apache.commons.lang3.time.FormatCache.getInstance(FormatCache.java:82)
at org.apache.commons.lang3.time.FastDateFormat.getInstance(FastDateFormat.java:165)
at org.apache.spark.sql.catalyst.json.JSONOptions.<init>(JSONOptions.scala:81)
at org.apache.spark.sql.catalyst.json.JSONOptions.<init>(JSONOptions.scala:43)
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:333)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:279)
I am not setting a default value for dateFormat, so I'm not understanding where it is coming from.
spark.createDataFrame(objects.map((o) => MyObject(t.source, t.table, o.partition, o.offset, d)))
.coalesce(1)
.write
.mode(SaveMode.Append)
.partitionBy("source", "table")
.json(path)
I still get the error with this:
import org.apache.spark.sql.{SaveMode, SparkSession}
val spark = SparkSession.builder.appName("Spark2.2Test").master("local").getOrCreate()
import spark.implicits._
val agesRows = List(Person("alice", 35), Person("bob", 10), Person("jill", 24))
val df = spark.createDataFrame(agesRows).toDF();
df.printSchema
df.show
df.write.mode(SaveMode.Overwrite).csv("my.csv")
Here is the schema:
root
|-- name: string (nullable = true)
|-- age: long (nullable = false)
I found the answer.
The default for the timestampFormat is yyyy-MM-dd'T'HH:mm:ss.SSSXXX which is an illegal argument. It needs to be set when you are writing the dataframe out.
The fix is to change that to ZZ which will include the timezone.
df.write
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.mode(SaveMode.Overwrite)
.csv("my.csv")
Ensure you are using the correct version of commons-lang3
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version>
</dependency>
Use commons-lang3-3.5.jar fixed the original error. I didn't check the source code to tell why but it is no surprising as the original exception happens at org.apache.commons.lang3.time.FastDatePrinter.parsePattern(FastDatePrinter.java:282). I also noticed the file /usr/lib/spark/jars/commons-lang3-3.5.jar (on an EMR cluster instance) which also suggest 3.5 is the consistent version to use.
I also met this problem, and my solution(reason) is:
Because I put a wrong format json file to hdfs.
After I put a correct text or json file, it can go correctly.
I have an issue where when we load a Json file into Spark, store it as Parquet and then try and access the Parquet file from Impala; Impala complains about the names of the columns as they contain characters which are illegal in SQL.
One of the "features" of the JSON files is that they don't have a predefined schema. I want Spark to create the schema, and then I have to modify the field names that have illegal characters.
My first thought was to use withColumnRenamed on the names of the fields in the DataFrame but this only works on top level fields I believe, so I could not use that as the Json contains nested data.
So I created the following code to recreate the DataFrames schema, going recursively through the structure. And then I use that new schema to recreate the DataFrame.
(Code updated with Jacek's suggested improvment of using the Scala copy constructor.)
def replaceIllegal(s: String): String = s.replace("-", "_").replace("&", "_").replace("\"", "_").replace("[", "_").replace("[", "_")
def removeIllegalCharsInColumnNames(schema: StructType): StructType = {
StructType(schema.fields.map { field =>
field.dataType match {
case struct: StructType =>
field.copy(name = replaceIllegal(field.name), dataType = removeIllegalCharsInColumnNames(struct))
case _ =>
field.copy(name = replaceIllegal(field.name))
}
})
}
sparkSession.createDataFrame(df.rdd, removeIllegalCharsInColumnNames(df.schema))
This works. But is there a better / simpler way to achive what I want to do?
And is there a better way to replace the existing schema on a DataFrame? The following code did not work:
df.select($"*".cast(removeIllegalCharsInColumnNames(df.schema)))
It gives this error:
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'cast'
I think the best bet would be to convert the Dataset (before you save as a parquet file) to an RDD and use your custom schema to describe the structure as you want.
val targetSchema: StructType = ...
val fromJson: DataFrame = ...
val targetDataset = spark.createDataFrame(fromJson.rdd, targetSchema)
See the example in SparkSession.createDataFrame as a reference however it uses an RDD directly while you're going to create it from a Dataset.
val schema =
StructType(
StructField("name", StringType, false) ::
StructField("age", IntegerType, true) :: Nil)
val people =
sc.textFile("examples/src/main/resources/people.txt").map(
_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
val dataFrame = sparkSession.createDataFrame(people, schema)
dataFrame.printSchema
// root
// |-- name: string (nullable = false)
// |-- age: integer (nullable = true)
But as you mentioned in your comment (that I later merged to your question):
JSON files don't have a predefined schema.
With that said, I think your solution is a correct one. Spark does not offer anything similar out of the box and I think it's more about developing a custom Scala code that would traverse a StructType/StructField tree and change what's incorrect.
What I would suggest to change in your code is to use the copy constructor (a feature of Scala's case classes - see A Scala case class ‘copy’ method example) that would only change the incorrect name with the other properties untouched.
Using copy constructor would (roughly) correspond to the following code:
// was
// case s: StructType =>
// StructField(replaceIllegal(field.name), removeIllegalCharsInColumnNames(s), field.nullable, field.metadata)
s.copy(name = replaceIllegal(field.name), dataType = removeIllegalCharsInColumnNames(s))
There are some design patterns in functional languages (in general) and Scala (in particular) that could deal with the deep nested structure manipulation, but that might be too much (and I'm hesitant to share it).
I therefore think that the question is in its current "shape" more about how to manipulate a tree as a data structure not necessarily a Spark schema.
I have a CSV file, test.csv:
col
1
2
3
4
When I read it using Spark, it gets the schema of data correct:
val df = spark.read.option("header", "true").option("inferSchema", "true").csv("test.csv")
df.printSchema
root
|-- col: integer (nullable = true)
But when I override the schema of CSV file and make inferSchema false, then SparkSession is picking up custom schema partially.
val df = spark.read.option("header", "true").option("inferSchema", "false").schema(StructType(List(StructField("custom", StringType, false)))).csv("test.csv")
df.printSchema
root
|-- custom: string (nullable = true)
I mean only column name (custom) and DataType (StringType) are getting picked up. But, nullable part is being ignored, as it is still coming nullable = true, which is incorrect.
I am not able to understand this behavior. Any help is appreciated !
Consider this excerpt from the documentation about Parquet (a popular "Big Data" storage format):
"Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons."
CSV is handled the same way for the same reason.
As for what "compatibility reasons" means, Nathan Marz in his book Big Data describes that an ideal storage schema is both strongly typed for integrity and flexible for evolution. In other words, it should be easy to add and remove fields and not have your analytics blow up. Parquet is both typed and flexible; CSV is just flexible. Spark honors that flexibility by making columns nullable no matter what you do. You can debate whether you like that approach.
A SQL table has schemas rigorously defined and hard to change--so much so Scott Ambler wrote a big book on how to refactor them. Parquet and CSV are much less rigorous. They are both suited to the paradigms for which they were built, and Spark's approach is to take the liberal approach typically associated with "Big Data" storage formats.
I believe the “inferSchema” property is common and applicable for all the elements in a dataframe. But, If we want to change the nullable property of a specific element.
We could handle/set something like,
setNullableStateOfColumn(df, ”col", false )
def setNullableStateOfColumn(df:DataFrame, cn: String, nullable: Boolean) : DataFrame = {
// get schema
val schema = df.schema
// modify [[StructField] with name `cn`
val newSchema = StructType(schema.map {
case StructField( c, t, _, m) if c.equals(cn) => StructField( c, t, nullable = nullable, m)
case y: StructField => y
})
// apply new schema
df.sqlContext.createDataFrame( df.rdd, newSchema )
}
There is a similar thread for setting the nullable property of an element,
Change nullable property of column in spark dataframe
I am having a difficulty splitting contents of a dataframe column using Spark 1.4.1 for nested gz file. I used the map function to map the attributes of the gz file.
The data is in the following Format :
"id": "tag:1234,89898",
"actor":
{
"objectType": "person",
"id": "id:1234",
"link": "http:\wwww.1234.com/"
},
"body",
I am using the following code to split the columns and read the data file.
val dataframe= sc.textFile(("filename.dat.gz")
.toString())
.map(_.split(","))
.map(r => {(r(0), r(1),r(2))})
.toDF()
dataframe.printSchema()
But the result is something like:
root
--- _1: string (nullable = true)
--- _2: string (nullable = true)
--- _3: string (nullable = true)
This is the incorrect format. I want the schema to be in the format :
----- id
----- actor
---objectType
---id
---link
-----body
Am in doing something wrong ? I need to use this code to do some per-processing on my data set and apply some transformations.
This data looks like JSON. Fortunately, Spark supports the easy ingestion of JSON data using Spark SQL. From the Spark Documentation:
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json() on either an RDD of String, or a JSON file.
Here is a modified version of the example from the docs
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val myData = sc.textFile("myPath").map(s -> makeValidJSON(s))
val myNewData = sqlContext.read.json(myData)
// The inferred schema can be visualized using the printSchema() method.
myNewData.printSchema()
For the makeValidJSON function you just need to concentrate on some string parsing/manipulation strategies to get it right.
Hope this helps.
I have determined how to use the spark-shell to show the field names but it's ugly and does not include the type
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
println(sqlContext.parquetFile(path))
prints:
ParquetTableScan [cust_id#114,blar_field#115,blar_field2#116], (ParquetRelation /blar/blar), None
You should be able to do this:
sqlContext.read.parquet(path).printSchema()
From Spark docs:
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
OK I think I have an OK way of doing it, just peek the first row to infer the scheme. (Though not sure just how elegant this is, what if it happens to be empty?? I'm sure there has to be a better solution)
sqlContext.parquetFile(p).first()
At some point prints:
{
optional binary cust_id;
optional binary blar;
optional double foo;
}
fileSchema: message schema {
optional binary cust_id;
optional binary blar;
optional double foo;
}
The result of parquetFile() is a SchemaRDD (1.2) or DataFrame (1.3) which have the .printSchema() method.