parse xml with multiple rowtags using spark - scala

I want to parser xml using spark so I am using spark databricks lib. sample xml is as follows:
<Transactions>
<Transaction>
<transid>1111</transid>
</Transaction>
<Transaction>
<transid>2222</transid>
</Transaction>
</Transactions>
<Payments>
<Payment>
<Id>123</Id>
</Payment>
<Payment>
<Id>456</Id>
</Payment>
</Payments>
code to parse:
val transNestedDF = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Transactions").load("trans_nested.xml")
transNestedDF.registerTempTable("TransNestedTbl")
sqlContext.sql("select Transaction[0].transid from TransNestedTbl").collect()
Here I don't have any root tag also I can't define multiple row tags so if I have to process both transactions and payments in single read using above single dataframe then how to achieve that?
need help.

Let's try this with lxml, a python library, which itself uses xpath:
If you don't have it installed, you need to:
pip intall lxml
then:
import lxml.html
pay = """ [your code above] """
doc = lxml.html.fromstring(pay)
tid =doc.xpath('Transactions//transid'.lower()) #or ('//Transactions//transid'.lower()) depending on the structure of the original doc
pid = doc.xpath('Payments//id'.lower()) #same comment
final = ''
for i in tid:
for p in pid:
final = final+i.text+'|'+p.text+' \n'
print(final)
Output:
1111|123
1111|456
2222|123
2222|456

You can't do it in one read, if there is no tag around both of these. If there is any common parent tag, you can use that as rowTag and ignore the rest that is parsed.
You can of course read them separately into two DataFrames. That works fine if you treat them separately. But you lose the association between transactions and payments, unless you can join on some ID.
But then I'd wonder why the XML structure doesn't have any common parent if these are associated.

Related

How to writeback to dataframe using transform_df in palantir foundry?

I created a library for updating description of the columns of the input dataset. This function takes three parameter as input (input_dataset, output_dataset, config file) and eventually writes back the description of output dataset. So now we want to import this library across various use cases. How to go for those cases where we are writing spark transformation i.e taking inputs through transform_df because here we can't assign output to output variable. In that situation how can i call my description library function? How to proceed in those situation in palantir foundry. Any suggestions?
This method isn't currently supported using the #transform_df decorator; you'll have to use the #transform decorator at the moment.
The reasoning behind this resulted from recognizing the need for broader access to metadata APIs like the #transform decorator already allows. Thus it seemed more in line with this pattern to keep it there since the #transform_df decorator is inherently higher-level.
You can always simply move over your transformations from...
from transforms.api import transform_df, Input, Output
#transform_df(
Output("/my/output"),
my_input("/my/input"),
)
def my_compute_function(my_input):
df = my_input
# ... logic ....
return my_input
...to...
from transforms.api import transform, Input, Output
#transform(
my_output=Output("/my/output"),
my_input=Input("/my/input")
)
def my_compute_function(my_input, my_output):
df = my_input.dataframe()
# ... logic ....
my_output.write_dataframe(df)
...in which only 6 lines of code need be changed.

how to read CSV file in scala

I have a CSV file and I want to read that file and store it in case class. As I know A CSV is a comma separated values file. But in case of my csv file there are some data which have already comma itself. and it creates new column for every comma. So the problem how to split data from that.
1st data
04/20/2021 16:20(1st column) Here a bunch of basic techniques that suit most businesses, and easy-to-follow steps that can help you create a strategy for your social media marketing goals.(2nd column)
2nd data
11-07-2021 12:15(1st column) Focus on attracting real followers who are genuinely interested in your content, and make the most of your social media marketing efforts.(2nd column)
var i=0
var length=0
val data=Source.fromFile(file)
for (line <- data.getLines) {
val cols = line.split(",").map(_.trim)
length = cols.length
while(i<length){
//println(cols(i))
i=i+1
}
i=0
}
If you are reading a complex CSV file then the ideal solution is to use an existing library. Here is a link to the ScalaDex search results for CSV.
ScalaDex CSV Search
However, based on the comments, it appears that you might actually be wanting to read data stored in a Google Sheet. If that is the case, you can utilize the fact that you have some flexibility to save the data in a text file yourself. When I want to read data from a Google Sheet in Scala, the approach I use first is to save the file in a format that isn't hard to read. If the fields have embedded commas but no tabs, which is common, then I will save the file as a TSV and parse that with split("\t").
A simple bit of code that only uses the standard library might look like the following:
val source = scala.io.Source.fromFile("data.tsv")
val data = source.getLines.map(_.split("\t")).toArray
source.close
After this, data will be an Array[Array[String]] with your data in it that you can process as you desire.
Of course, if your data includes both tabs and commas then you'll really want to use one of those more robust external libraries.
You could use univocity CSV parser for faster stuffs.
You can also use it for creation as well.
Univocity parsers

mirth connect Database Reader automatic column mapping

Please could somebody confirm the following..
I am using Mirth Connect 3.5.08232.
My Source Connector is a Database Reader.
Say, I am using a query that returns multiple rows, and return the result (via JavaScript), as documentation suggests, so that Mirth would treat each row as a separate message. I also use a couple of mappers as source transformers, and save the mapped fields in my channel map (which ends up to contain only those fields that I define in transformers)
In the destination, and specifically, in destination response transformer (or destination body, if it is a JavaScript writer), how do I access the source fields?
the only way I found by trial and error is
var rawMsg = connectorMessage.getRawData();
var xmlMsg = new XML(rawMsg);
logger.info(xmlMsg.some_field); // ignore the root element of rawMsg
Is this the right way to do this? I thought that maybe the fields that were nicely automatically detected would be put in some kind of a map, like sourceMap - but that doesn't seem to be the case, right?
Thank you
If you are using Mapper steps in your transformer to extract the data and put it into a variable map (like the channel map), then you can use any of the following methods to retrieve it from a subsequent JavaScript context (including a JavaScript Writer, and your response transformer):
var value = channelMap.get('key');
var value = $c('key');
var value = $('key');
Look at the Variable Maps section of the User Guide for more information.
So to recap, say you're selecting a column "mycolumn" with a Database Reader. The XML sent to the channel will be something like this:
<result>
<mycolumn>value</mycolumn>
</result>
Then you can choose to extract pieces of that message into specific variables for later use. The transformer allows you to easily drag-and-drop pieces of the sample inbound message.
Finally in your JavaScript Writer (or in any subsequent filter, transformer, or response transformer), just drag the value into the field you want:
And the corresponding JavaScript code will automatically be inserted:
One last note, if you are selecting a lot of variables and don't want to make Mapper steps for each one individually, you can use a JavaScript Step to iterate through the message and extract each column into a separate map variable:
for each (child in msg.children()) {
channelMap.put(child.localName(), child.toString());
}
Or, you can just reference the columns directly from within the JavaScript Writer:
var msg = new XML(connectorMessage.getEncodedData());
var column1 = msg.column1.toString();
var column2 = msg.column2.toString();
...

Spark: RDD.saveAsTextFile when using a pair of (K,Collection[V])

I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.

SQLAlchemy, Psycopg2 and Postgresql COPY

It looks like Psycopg has a custom command for executing a COPY:
psycopg2 COPY using cursor.copy_from() freezes with large inputs
Is there a way to access this functionality from with SQLAlchemy?
accepted answer is correct but if you want more than just the EoghanM's comment to go on the following worked for me in COPYing a table out to CSV...
from sqlalchemy import sessionmaker, create_engine
eng = create_engine("postgresql://user:pwd#host:5432/db")
ses = sessionmaker(bind=engine)
dbcopy_f = open('/tmp/some_table_copy.csv','wb')
copy_sql = 'COPY some_table TO STDOUT WITH CSV HEADER'
fake_conn = eng.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.copy_expert(copy_sql, dbcopy_f)
The sessionmaker isn't necessary but if you're in the habit of creating the engine and the session at the same time to use raw_connection you'll need separate them (unless there is some way to access the engine through the session object that I don't know). The sql string provided to copy_expert is also not the only way to it, there is a basic copy_to function that you can use with subset of the parameters that you could past to a normal COPY TO query. Overall performance of the command seems fast for me, copying out a table of ~20000 rows.
http://initd.org/psycopg/docs/cursor.html#cursor.copy_to
http://docs.sqlalchemy.org/en/latest/core/connections.html#sqlalchemy.engine.Engine.raw_connection
If your engine is configured with a psycopg2 connection string (which is the default, so either "postgresql://..." or "postgresql+psycopg2://..."), you can create a psycopg2 cursor from an SQL Alchemy session using
cursor = session.connection().connection.cursor()
which you can use to execute
cursor.copy_from(...)
The cursor will be active in the same transaction as your session currently is. If a commit or rollback happens, any further use of the cursor with throw a psycopg2.InterfaceError, you would have to create a new one.
You can use:
def to_sql(engine, df, table, if_exists='fail', sep='\t', encoding='utf8'):
# Create Table
df[:0].to_sql(table, engine, if_exists=if_exists)
# Prepare data
output = cStringIO.StringIO()
df.to_csv(output, sep=sep, header=False, encoding=encoding)
output.seek(0)
# Insert data
connection = engine.raw_connection()
cursor = connection.cursor()
cursor.copy_from(output, table, sep=sep, null='')
connection.commit()
cursor.close()
I insert 200000 lines in 5 seconds instead of 4 minutes
It doesn't look like it.
You may have to just use psycopg2 to expose this functionality and forego the ORM capabilities. I guess I don't really see the benefit of ORM in such an operation anyway since it's a straight bulk insert and dealing with individual objects a la an ORM would not really make a whole lot of sense.
If you're starting from SQLAlchemy, you need to first get to the connection engine (also known by the property name bind on some SQLAlchemy objects):
engine = create_engine('postgresql+psycopg2://myuser:password#localhost/mydb')
# or
engine = session.engine
# or any other way you know to get to the engine
From the engine you can isolate a psycopg2 connection:
# get a psycopg2 connection
connection = engine.connect().connection
# get a cursor on that connection
cursor = connection.cursor()
Here are some templates for the COPY statement to use with cursor.copy_expert(), a more complete and flexible option than copy_from() or copy_to() as it is indicated here: https://www.psycopg.org/docs/cursor.html#cursor.copy_expert.
# to dump to a file
dump_to = """
COPY mytable
TO STDOUT
WITH (
FORMAT CSV,
DELIMITER ',',
HEADER
);
"""
# to copy from a file:
copy_from = """
COPY mytable
FROM STDIN
WITH (
FORMAT CSV,
DELIMITER ',',
HEADER
);
"""
Check out what the options above mean and others that may be of interest to your specific situation https://www.postgresql.org/docs/current/static/sql-copy.html.
IMPORTANT NOTE: The link to the documentation of cursor.copy_expert() indicates to use STDOUT to write out to a file and STDIN to copy from a file. But if you look at the syntax on the PostgreSQL manual, you'll notice that you can also specify the file to write to or from directly in the COPY statement. Don't do that, you're likely just wasting your time if you're not running as root (who runs Python as root during development?) Just do what's indicated in the psycopg2's docs and specify STDIN or STDOUT in your statement with cursor.copy_expert(), it should be fine.
# running the copy statement
with open('/path/to/your/data/file.csv') as f:
cursor.copy_expert(copy_from, file=f)
# don't forget to commit the changes.
connection.commit()
You don't need to drop down to psycopg2, use raw_connection nor a cursor.
Just execute the sql as usual, you can even use bind parameters with text():
engine.execute(text('''copy some_table from :csv
delimiter ',' csv'''
).execution_options(autocommit=True),
csv='/tmp/a.csv')
You can drop the execution_options(autocommit=True) if this PR will be accepted