I am very new to scala and I have the following issue.
I have a spark dataframe with the following schema:
df.printSchema()
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: string (containsNull = true)
I need to convert this to the following schema:
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: double (containsNull = true)
I do not want to specify the schema before hand, but instead change the existing one.
I have tried the following
df.withColumn("vector", col("vector").cast("array<element: double>"))
I have also tried converting it into an RDD to use map to change the elements and then turn it back into a dataframe, but I get the following data type Array[WrappedArray] and I am not sure how to handle it.
Using pyspark and numpy, I could do this by df.select("vector").rdd.map(lambda x: numpy.asarray(x)).
Any help would be greatly appreciated.
You're close. Try this code:
val df2 = df.withColumn("vector", col("vector").cast("array<double>"))
Related
I am using Spark to process some datas stored in an XML file.
I successfuly loaded my datas and printed the schema :
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag","elementTag")
.load(myPath+"/myfile.xml")
df.printSchema
Which give me a result that look like this :
root
|-- _id: string (nullable = true)
|-- _type: string (nullable = true)
|-- creationDate: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _value: string (nullable = true)
|-- lastUpdateDate: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _value: string (nullable = true)
From this datas, I want to extract only certain fields , which should be easy with a 'select'. So I am doing the folowing request :
df.select("_id","creationDate._value","lastUpdateDate._value")
But I get the error :
org.apache.spark.sql.AnalysisException: Ambiguous reference to fields StructField(_VALUE,StringType,true), StructField(_value,StringType,true);
My problem is that spark sql is not case sensitive and my file contains field _value and _VALUE and I can't change my input file.
Is there a way to solve this probleme with Spark?
Spark-xml creates _VALUE there is no child in a xml tag which cause conflict with other.
You can change default value _VALUE by adding option while reading xml as
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag","elementTag")
.option("valueTag", "anyName")
.load(myPath+"/myfile.xml")
Hope this helps!
I grouped by few columns and am getting WrappedArray out of these cols as you can see in schema. How do I get rid of them so I can proceed to next step and do an orderBy?
val sqlDF = spark.sql("SELECT * FROM
parquet.`parquet/20171009121227/rels/*.parquet`")
Getting a dataFrame:
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
then printing the schema gives us: final_df.printSchema
|-- rel: array (nullable = true)
| |-- element: double (containsNull = true)
|-- rel2: array (nullable = true)
| |-- element: double (containsNull = true)
Sample current output:
I am trying to convert to this:
|-- rel: double (nullable = true)
|-- rel2: double (nullable = true)
Desired example output (from the picture above):
-1.0,0.0
-1.0,0.0
In the case where collect_list will always only return one value, use first instead. Then there is no need to handle the issue of having an Array. Note that this should be done during the groupBy step.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val final_df = df.groupBy(...)
.agg(first($"relev").as("rel"),
first($"relev2").as("rel2"))
Try col(x).getItem:
groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2")
).withColumn("rel_0", col("rel").getItem(0))
Try split
import org.apache.spark.sql.functions._
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
.withColumn("rel",split("rel",","))
I have a data frame with a structure like this:
root
|-- npaDetails: struct (nullable = true)
| |-- additionalInformation: struct (nullable = true)
| |-- npaStatus: struct (nullable = true)
| |-- npaDetails: struct (nullable = true)
|-- npaHeaderData: struct (nullable = true)
| |-- npaNumber: string (nullable = true)
| |-- npaDownloadDate: string (nullable = true)
| |-- npaDownloadTime: string (nullable = true)
I want to retrieve all npaNumber from all the rows in the dataframe.
My approach was to iterate over all rows in the data frame to extract for each one the value stored in the column npaHeaderData, in the field npaNumber. So I code the following lines:
parquetFileDF.foreach { newRow =>
//To retrieve the second column
val column = newRow.get(1)
//The following line is not allowed
//val npaNumber= column.getAs[String]("npaNumber")
println(column)
}
The content of column printed in each iteration looks like:
[207400956,27FEB17,09.30.00]
But column is of type Any and I am not able extract any of its fields. Can anyone tell what am I doing wrong or what approach should I follow instead of this?
Thanks
if you are looking to extract only npaNumber then you can do
parquetFileDF.select($"npaHeaderData.npaNumber".as("npaNumber"))
you should have a dataframe with npaNumber column only.
you can call select() on dataframe which will give you a new dataframe with only specified column
var newDataFrame = dataFrame.select(dataFrame("npaHeaderData.npaNumber").as("npaNumber"))
You can do as below , which will avoid the [] ,while reading data from a data frame.
ids[DataFrame]: {id, name}
val idRDDs = ids.rdd.map(x => x.getAs[String](0))
for(id <- idRDDs){
id.map(x => println(x))
}
The above way will solve your issues.
I have an UDF which returns an array of tuples:
val df = spark.range(1).toDF("i")
val myUDF = udf((l:Long) => {
Seq((1,2))
})
df.withColumn("udf_result",myUDF($"i"))
.printSchema
gives
root
|-- i: long (nullable = false)
|-- test: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: integer (nullable = false)
I want to rename the elements of the struct to something meaningful instead of _1 and _2, how can this be achieved? Note that I'm aware that returning a Seq of case-classes would let me allow to give proper field names, but using Spark-Notebook (REPL) with Yarn we have many issues using case classes, so I'm looking for a solution without case-classes.
I'm using Spark 2 but with untyped DataFrames, the solution should also be applicable for Spark 1.6
It is possible to cast the output of the udf. E.g. to rename the structfields to x and y, you can do:
type-safe:
val schema = ArrayType(
StructType(
Array(
StructField("x",IntegerType),
StructField("y",IntegerType)
)
)
)
df.withColumn("udf_result",myUDF($"i").cast(schema))
or unsafe, but shorter using string-argument to cast
df.withColumn("udf_result",myUDF($"i").cast("array<struct<x:int,y:int>>"))
both will give the schema
root
|-- i: long (nullable = false)
|-- udf_result: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: integer (nullable = true)
| | |-- y: integer (nullable = true)
My row RDD looks like this:
Array[org.apache.spark.sql.Row] = Array([1,[example1,WrappedArray([**Standford,Organisation,NNP], [is,O,VP], [good,LOCATION,ADP**])]])
I have got this from converting dataframe to rdd, dataframe schema was :
root
|-- article_id: long (nullable = true)
|-- sentence: struct (nullable = true)
| |-- sentence: string (nullable = true)
| |-- attributes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- tokens: string (nullable = true)
| | | |-- ner: string (nullable = true)
| | | |-- pos: string (nullable = true)
Now how do access elements in row rdd, in dataframe I can use df.select("sentence"). I am looking forward to access elements like stanford/other nested elements.
As #SarveshKumarSingh wrote in a comment you can access a the rows in a RDD[Row] like you would access any other element in an RDD. Accessing the elements in the row can be done in a couple of ways. Either simply call get like this:
rowRDD.map(row => row.get(2).asInstanceOf[MyType])
or if it is a build in type, you can avoid the type cast:
rowRDD.map(row => row.getList(4))
or you might want to simply use pattern matching, like:
rowRDD.map{case Row(field1: Long, field2: MyType) => field2}
I hope this helps :)