Spark-xml: cannot read value of an element with attribute - apache-spark-xml

I am trying to use Spark-xml to read the xml file in the link https://www.dropbox.com/s/yg66o0tfwipx3mu/PMC1249490.xml?dl=0
It is a research article and I am interested in the text in the abstract. It seems that the schema of the entire xml file is inferred correctly, but the abstract element is missing the text data. It shows the attribute value (called P1) and only the words enclosed in the brackets.
Can anyone help me?
Below is the code I am using:
import pandas as pd
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.13.0 pyspark-shell'
spark = SparkSession.builder.appName("XML_Import").master("local[*]").getOrCreate()
df = spark.read.format('xml').options(rowTag="front").load('PMC1249490.xml')
df.select("article-meta.abstract").show(truncate=False)
+-------------------------------------------------+
|abstract |
+-------------------------------------------------+
|{{P1, [Dictyostelium discoideum, D. discoideum]}}|
+-------------------------------------------------+

Related

how to find length of string of array of json object in pyspark scala?

I have one column in DataFrame with format =
'[{jsonobject},{jsonobject}]'. here length will be 2 .
I have to find length of this array and store it in another column.
I've only worked with pySpark, but the Scala solution would be similar. Assuming the column name is input:
from pyspark.sql import functions as f, types as t
json_schema = t.ArrayType(t.MapType(t.StringType(), t.StringType()))
df.select(f.size(f.from_json(df.input, json_schema)).alias("num_objects"))

Filtering rows which are causing datatype parsing issue in spark

I have a spark dataFrame with column Salary as shown below:
|Salary|
|"100"|
|"200"|
|"abc"|
The dafault datatype is string. I want to convert that to Integer with removing those rows which are causing parsing issue.
Desired Output
|Salary|
|100|
|200|
Can someone please let me know the code for filtering the rows which will be causing datatype parsing issue.
Thanks in advance.
You can filter the desired Field with a regex and then casting the column:
import org.apache.spark.sql.types._
df.filter(row => row.getAs[String]("Salary").matches("""\d+"""))
.withColumn("Salary", $"Salary".cast(IntegerType))
You can do it also with Try if you don't like regex:
import scala.util._
df.filter(row => Try(row.getAs[String]("Salary").toInt).isSuccess)
.withColumn("Salary", $"Salary".cast(IntegerType))

Read files with different column order

I have few csv files with headers but I found out that some files have different column orders. Is there a way to handle this with Spark where I can define select order for each file so that the master DF doesn't have mismatch where col x might have values from col y?
My current read -
val masterDF = spark.read.option("header", "true").csv(allFiles:_*)
Extract all file names and store into list variable.
Then define schema of with all the columns in it.
iterate through each file using header true, so we are reading each file separately.
unionAll the new dataframe with the existing dataframe.
Example:
file_lst=['<path1>','<path2>']
from pyspark.sql.functions import *
from pyspark.sql.types import *
#define schema for the required columns
schema = StructType([StructField("column1",StringType(),True),StructField("column2",StringType(),True)])
#create an empty dataframe
df=spark.createDataFrame([],schema)
for i in file_lst:
tmp_df=spark.read.option("header","true").csv(i).select("column1","column2")
df=df.unionAll(tmp_df)
#display results
df.show()

Pass RDD in scala function. Output Dataframe

say I have the below csv and many more like it.
val csv = sc.parallelize(Array(
"col1, col2, col3",
"1, cat, dog",
"2, bird, bee"))
I would like to apply the below functions to the RDD to convert it to a data frame with the desired logic below. I keep running into the error error: not found: value DataFrame
How can I correct this?
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
/
In most cases I would read CSV files directly as a dateframe using Spark's core functionality, but I am unable to in this case.
Any/all help is appreciated.
in order not to get error: not found: value DataFrame you must add the following import:
import org.apache.spark.sql.DataFrame
and your method declaration should be like this:
def udf(fName : RDD[String]): DataFrame = { ...

Convert dataframe to json in scala

Assuming i have a wordcount example where i get a dataframe as word in one column and wordcount in another column, I want to collect the same and store it as an array of json in mongo collection.
eg for dataframe:
|Word | Count |
| abc | 1 |
| xyz | 23 |
I should get the json like:
{words:[{word:"abc",count:1},{word:"xyz",count:23}]}
When i tried .toJSON on the dataframe and collected the value as list and added it to a dataframe the result which got stored in my mongo was a collection of string rather than collection of JSON.
query used :
explodedWords1.toJSON.toDF("words").agg(collect_list("words")).toDF("words")
result : "{\"words\":[{\"word\":\"abc\",\"count\":1},{\"word\":\"xyz\",\"count\":23}]}"
I am new to Scala. Any help will be good. (Will be helpful if external package aint used).
The absolute best way to store data from dataframes into Mongo is using the
MongoDB Spark Connector (https://docs.mongodb.com/spark-connector/master/)
Just add "org.mongodb.spark" %% "mongo-spark-connector" % "2.2.0" to your sbt dependencies and check code below
import com.mongodb.spark.MongoSpark
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.master("local[2]")
.appName("test")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/dbname")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/dbname")
.getOrCreate()
import spark.implicits._
val explodedWords1 = List(
("abc",1),
("xyz",23)
).toDF("Word", "Count")
MongoSpark.save(explodedWords1.write.option("collection", "wordcount").mode("overwrite"))
However if you do want the results as a single json file then the script below should do it:
explodedWords1.repartition(1).write.json("/tmp/wordcount")
Finally if you want the json as a list of strings in your scala just use
explodedWords1.toJSON.collect()
Update:
I didn't see that you wanted all records aggregated to one field ("words")
If you use the code below, then all three methods above still function (swapping explodedWords1 with aggregated)
import org.apache.spark.sql.functions._
val aggregated = explodedWords1.agg(
collect_list(map(lit("word"), 'Word, lit("count"), 'Count)).as("words")
)
Option 1: explodedWords1
Option 2: aggregated