pyspark does not parse an xml from a file containing multiple xmls - pyspark

I am using spark 2.3.2 with python 3.7 to parse xml.
In an xml file (sample), I have appended 2 xmls.
When I parse it with:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.7.0 pyspark-shell'
conf = pyspark.SparkConf()
sc = SparkSession.builder.config(conf=conf).getOrCreate()
spark = SQLContext(sc)
dfSample = (spark.read.format("xml").option("rowTag", "xocs:doc")
.load(r"sample.xml"))
I see 2 xmls' data:
However, what I need is to extract the info under "ref-info" tag (along with their corresponding key eids), so my code is:
(dfSample.
withColumn("metaExp", F.explode(F.array("xocs:meta"))).
withColumn("eid", F.col("metaExp.xocs:eid")).
select("eid","xocs:item").
withColumn("xocs:itemExp", F.explode(F.array("xocs:item"))).
withColumn("item", F.col("xocs:itemExp.item")).
withColumn("itemExp", F.explode(F.array("item"))).
withColumn("bibrecord", F.col("item.bibrecord")).
withColumn("bibrecordExp", F.explode(F.array("bibrecord"))).
withColumn("tail", F.col("bibrecord.tail")).
withColumn("tailExp", F.explode(F.array("tail"))).
withColumn("bibliography", F.col("tail.bibliography")).
withColumn("bibliographyExp", F.explode(F.array("bibliography"))).
withColumn("reference", F.col("bibliography.reference")).
withColumn("referenceExp", F.explode(F.array("reference"))).
withColumn("ref-infoExp", F.explode(F.col("reference.ref-info"))).
withColumn("authors", F.explode(F.col("ref-infoExp.ref-authors.author"))).
withColumn("py", (F.col("ref-infoExp.ref-publicationyear._first"))).
withColumn("so", (F.col("ref-infoExp.ref-sourcetitle"))).
withColumn("ti", (F.col("ref-infoExp.ref-title"))).
drop("xocs:item", "xocs:itemExp", "item", "itemExp", "bibrecord", "bibrecordExp", "tail", "tailExp", "bibliography",
"bibliographyExp", "reference", "referenceExp").show())
This extracts the info only from the xml with eid = 85082880163
When I delete this one and only kept the one with eid = 85082880158, it works.
My file is an xml file containing those 2 lines in the link. I have also tried to merge those 2 into one xml but could not manage.
What is wrong with my data/approach? (My ultimate plan is to create such a file containing thousands of different xmls to be parsed)

Your trouble lies in this piece of XML:
<xocs:item>
<item>
<bibrecord>
<head>
<abstracts>
<abstract original="y" xml:lang="eng">
<ce:para>
Apple is often affected by [...] intercellular CO <inf>2</inf> concentration [...]
^^^^^^^^^^^^
While it is correct XML (mixed content), this embedded tag <inf>2</inf> seems to be breaking spark-xml parser. If you remove it (or convert to corresponding HTML entities) you will get correct results.

Related

Add element to the xml string in scala

I have the following relatively simple scenario, but it’s working.
I need an append to my xml string, here's the scenario:
val xmlStr = "<return> <numberPin> 123456 </numberPin> </return>"
I need some way to add the element data and return the string below, I would like some solution with regular expression if possible
"<return> <numberPin> 123456 </numberPin> <date> 2019-09-04 00:00:00 </date> </return>"
You can create a template xml at first that can be updated at runtime.
You can do something like below:
def updateXml (xmlStr:String, dateContent: String) = {
xmlStr.replace("DATE_DATA", dateContent)
}
val xmlStr = "<return> <numberPin> 123456 </numberPin> DATE_DATA </return>"
val dateData = "<date> 2019-09-04 00:00:00 </date>"
updateXml(xmlStr, dateData)
Another alternative is to create an xml template in a file(if the xml content is like a big file). Read it in your code and insert required data at run-time as shown in the above example(where i stuffed DATE_DATA in template and replaced it at runtime using the method).

How to export an csv file to a bigqery table using java dataflow?

I want to read an csv file from the cloud bucket and write it to a bigquery table with columns using dataflow in java. How can I set the headers to the csv file while writing to bigquery?
There are two issues to solve here
Skipping the header when reading the data, and
Using the header to correctly populate teh bigquery table columns.
For (1) this is, as of June 2019, not implemented natively, though you could try the options listed at Skipping header rows - is it possible with Cloud DataFlow?. For (2) the easiest would be to read the first line of your CSV in your main program, and pass the list of column names in the constructor to a DoFn that converts CSV lines into TableRow objects ready to write to Bigquery.
Your final program would look something like
public void CsvToBigquery(csvInputPattern, bigqueryTable) {
final String[] columns = readAndSplitFirstLineOfFirstFile(csvInputPattern);
Pipeline p = new Pipeline.create(...);
p
.apply(TextIO.read().from(csvInputPattern)
.apply(Filter.by(new MatchIfNonHeader())
.apply(ParDo.of(new DoFn<String, TableRow>() {
... // use columns here to TableRows
})
.apply(BigtableIO.write().withTableId(bigqueryTable)...);
}
I've done a similar task and used Apache Common library in ParDo function to extract the data from CSV files and then converted them to Table Row Objects for BQ.
String fileData = c.element();
BufferedReader fileReader = new BufferedReader(new InputStreamReader(
new ByteArrayInputStream(fileData.getBytes("UTF-8")), "UTF-8"));
CSVParser csvParser = new CSVParser(fileReader,CSVFormat.DEFAULT.withFirstRecordAsHeader().withIgnoreHeaderCase().withTrim());
Iterable<CSVRecord> csvRecords = csvParser.getRecords();
for (CSVRecord csvRecord : csvRecords) {
TableRow row = new TableRow();
checkAndConvertIntoBqDataType(csvRecord.toMap());
c.output(row);
}

Apache Tika - Parsing and extracting only metadata without reading content

Is there a way to configure the Apache Tikka so that it only extracts the metadata properties from the file and does not access the content of the file. ? We need a way to do this so as to avoid reading the entire content in larger files.
The code to extract we are using is as follows:
var tikaConfig = TikaConfig.getDefaultConfig();
var metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser(tikaConfig);
BodyContentHandler handler = new BodyContentHandler();
using (TikaInputStream stream = TikaInputStream.get(new File(filename), metadata))
{
parser.parse(stream, handler, metadata, new ParseContext());
Array metadataKeys = metadata.names();
Array.Sort(metadataKeys);
}
With the above code sample, when we try to extract the metadata even the content is being read. We would need a way to avoid the same.

How to bundle many files in S3 using Spark

I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).

How to specify strings in Weka file?

I am working on a text classification system and I would like to use unigrams as features. When building the arff file, I declared a string attribute field inside which I want to specify all the words contained in a message separated by comma. However, Weka is telling me that it "Cannot handle string attributtes". I tried defining the relation in the header with StringToWordVector, but it didn't help. How to go about this otherway? Many thanks!
if your arff file format is correct then the following code can help you
// dataSource: arff file (path of your arff file)
BufferedReader trainReader = new BufferedReader(new FileReader(dataSource));
trainInsts = new Instances(trainReader);
trainInsts.setClassIndex(trainInsts.numAttributes() - 1);
// the filter is used to convert the data from string to numeric
StringToWordVector STWfilter = new StringToWordVector();
FilteredClassifier model = new FilteredClassifier();
model.setFilter(STWfilter);
STWfilter.setInputFormat(trainInsts);
// the converted data
trainInsts = Filter.useFilter(trainInsts, STWfilter);