Parsing XML data with Postgres - postgresql

PostgreSQL 9.2.4
I apologize if I added a duplicate topic, but I can't seem to figure it out (yet). I'd like to parse XML data which is stored in postgres table.
For example:
select program_information.description FROM program_information WHERE id = 8768787;
gives me output like:
<?xml version="1.0"?>
<ProgramInformation>
<BasicDescription>
<Title type="original">Zla smrt</Title>
<Synopsis length="short">Pet prijateljev, starih nekaj čez dvajset let, v samotni koči najde Knjigo mrtvih. S posnetka, ki so ga napravili arheologi, izvedo, da je bilo starodavno besedilo odkrito med kandarijskimi ruševinami sumerske civilizacije.</Synopsis>
<Keyword type="secondary"></Keyword>
<ParentalGuidance>
<mpeg7:ParentalRating href="rn:mpeg:MPAAParentalRatingCS:PG">
<mpeg7:Name>PG</mpeg7:Name>
</mpeg7:ParentalRating>
</ParentalGuidance>
<CreditsList>
<CreditsItem role="urn:tva:metadata:TVARoleCS:ACTOR">
<PersonName>
<mpeg7:GivenName>Bruce</mpeg7:GivenName>
<mpeg7:FamilyName>Campbell</mpeg7:FamilyName>
</PersonName>
</CreditsItem>
<CreditsItem role="urn:tva:metadata:TVARoleCS:ACTOR">
<PersonName>
<mpeg7:GivenName>Ellen</mpeg7:GivenName>
<mpeg7:FamilyName>Sandweiss</mpeg7:FamilyName>
</PersonName>
</CreditsItem>
<CreditsItem role="urn:tva:metadata:TVARoleCS:ACTOR">
<PersonName>
<mpeg7:GivenName>Betsy</mpeg7:GivenName>
<mpeg7:FamilyName>Baker</mpeg7:FamilyName>
</PersonName>
</CreditsItem>
<CreditsItem role="urn:tva:metadata:TVARoleCS:DIRECTOR">
<PersonName>
<mpeg7:GivenName>Sam</mpeg7:GivenName>
<mpeg7:FamilyName>Raimi</mpeg7:FamilyName>
</PersonName>
</CreditsItem>
</CreditsList>
<ReleaseInformation>
<ReleaseDate>
<Year>1981</Year>
</ReleaseDate>
</ReleaseInformation>
</BasicDescription>
<AVAttributes>
<AudioAttributes>
<NumOfChannels>2</NumOfChannels>
</AudioAttributes>
</AVAttributes>
</ProgramInformation>
So what I'd like is a parsed output of that output in separate columns (title, synopsis, ratings, actors, etc.)
So what I'd like is an output like
+----------+-------------+----------------+
| Title | Synopsis | ParentalRating |
+----------+-------------+----------------+
| my title | some descr | rating |
+----------+-------------+----------------+
I've tried with xpath but so far a dead end... :/
Can anyone guide me to correct query? Thank you!
M

Your XML document misses a namespace declaration. Is it supposed to be like that? I manually added a namespace at <ProgramInformation> and it worked ..
<ProgramInformation xmlns:mpeg7="http://mpeg7.io">
.. with the following query:
SELECT
XPATH('//BasicDescription/Title/text()', t.xml) AS title,
XPATH('//BasicDescription/Synopsis/text()', t.xml) AS synopsis,
XPATH('//mpeg7:ParentalRating/mpeg7:Name/text()', t.xml,ARRAY[ARRAY['mpeg7', 'http://mpeg4.io']]) AS parentalRating
FROM t;
title | synopsis | parentalrating
--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------
{"Zla smrt"} | {"Pet prijateljev, starih nekaj čez dvajset let, v samotni koči najde Knjigo mrtvih. S posnetka, ki so ga napravili arheologi, izvedo, da je bilo starodavno besedilo odkrito med kandarijskimi ruševinami sumerske civilizacije."} | {PG}
(1 Zeile)

Related

How to use dynamic table name for sub query where the dynamic value coming from its own main query in PostgreSQL?

I have formed this query to get the desired output mentioned below:
select tbl.id, tbl.label, tbl.input_type, tbl.table_name
case when tbl.input_type = 'dropdown' or tbl.input_type = 'searchable-dropdown'
then (select json_agg(opt) from tbl.table_name) as opt) end as options
from mst_config as tbl;
I want output like below:
id | label | input_type | table_name | options
----+----------------------------------------------------+---------------------+-------------------------+-----------------------------------------------------------
1 | Gender | dropdown | mst_gender | [{"id":1,"label":"MALE"},
| | | | {"id":2,"label":"FEMALE"}]
2 | SS | dropdown | mst_ss | [{"id":1,"label":"something"},
| | | | {"id":2,"label_en":"something"}]
But, I'm facing a problem while using,
select json_agg(opt) from tbl.table_name) as opt
In the above part "tbl.table_name", I wanted to use it as dynamic table name but it's not working.
Then, I have searched a lot and found something like Execute format('select * from %s', table_name), where tablename is the dynamic table name. I have even tried the same with postgres function.
But I faced an issue again while using the format method. The reason is I want to use the variable for which the value needs to come from its own main query value instead of already having it in a variable. so this one was also not working.
I would really appreciate if anyone can help me out on this. Also if there are any other possibilities available to achieve this output, help me on that as well.

What is correct query for obtaining the below information

#id | title | series | series_position | author | img_url | object_key | e_tag | page_number
Given I have the above table column, what would be the correct query to obtain the following data order from a single query:
[author1:{
[series1:[{title:title1,series_position:1},{title:title2,series_position:2},{title:title3,series_position:3}]],
[series2:[{title:title1,series_position:1},{title:title2,series_position:2},{title:title3,series_position:3}]],
[series3:[{title:title1,series_position:1},{title:title2,series_position:2},{title:title3,series_position:3}]]]
},
author2:{
[series1:[{title:title1,series_position:1},{title:title2,series_position:2},{title:title3,series_position:3}]],
[series2:[{title:title1,series_position:1},{title:title2,series_position:2},{title:title3,series_position:3}]],
[series3:[{title:title1,series_position:1},{title:title2,series_position:2},{title:title3,series_position:3}]]]
}]
Currently I am doing something like this:
books = Books.query.order_by(Books.author).all()
authors = sorted(set(book.author for book in books))
After which I do something like the below pseudo code to display the data in jinja:
{%for author is authors%}:
<div id={{author}}>
for book in books
if author in book.author:
<li><a>book.title,book.series,book.series_position</a></li>
</div>
{%endfor%}
Below is the image of what the above jinja code currently generates frontend
Below is the image of the table:
This should give the data structure you want ( in a dictionary format):
import pandas as pd
conn=... #your sql connection to your database here
df=pd.read_sql('select * from Books', conn)
d={}
for i in set(df.author):
d[i]={}
df2=df[df.author==i]
for k in set(df2.series):
d[i][k]=[]
df3=df2[df2.series==k]
for j in range(len(df3)):
d[i][k].append({'title':df3['title'].iloc[j], 'series_position':df3['series_position'].iloc[j]})

How to return a function result into query?

I have a function called ClientStatus that returns a record with two fields Status_Description and Status_Date. This function receives a parameter Client_Id.
I'm trying to get the calculated client status for all the clients in the table Clients, something like:
| Client_Name | Status_Description | Status_Date |
+-------------+--------------------+-------------+
| Abc | Active | 12-12-2010 |
| Def | Inactive | 13-12-2011 |
Where Client_Name comes from the table Clients, Status_Description and Status_Date from the function result.
My first (wrong) approach was to join the table and the function like so:
SELECT c.Client_Name, cs.Status_Description, cs.Status_Date FROM Clients c
LEFT JOIN (
SELECT * FROM ClientStatus(c.ClientId) as (Status_Description text, Status_Date date)) cs
This obviously didn't work because c.ClientId could not be referenced.
Could someone explain me how can I obtain the result I am looking for?
Thanks in advance.
I think the following can give the result you expect :
SELECT c.Client_Name, d.Status_Description, d.Status_Date
FROM Clients c, ClientStatus(c.ClientId) d
I have solved my problem writing the query like this:
SELECT c.Client_Name, cs.status[1] as Description, cs.stautus[2]::date as Date
FROM (
SELECT string_to_array(translate(
(SELECT ClientStatus(ClientId))::Text, '()', ''), ',') status
FROM Clients
) cs
It is not the most elegant solution but it was the only one I could find to make this work.

postgresql + textsearch + german umlauts + UTF8

I'm really at my wits end, with this Problem, and I really hope someone could help me. I am using a Postgresql 9.3. My Database contains mostly german texts but not only, so it's encoded in utf-8. I want to establish a fulltextsearch wich supports german language, nothing special so far.
But the search is behaving really strange,, and I can't find out what I am doing wrong.
So, given the following table given as example
select * from test;
a
-------------
ein Baum
viele Bäume
Überleben
Tisch
Tische
Café
\d test
Tabelle »public.test«
Spalte | Typ | Attribute
--------+------+-----------
a | text |
sintext=# \d
Liste der Relationen
Schema | Name | Typ | Eigentümer
--------+---------------------+---------+------------
(...)
public | test | Tabelle | paf
Now, lets have a look at some textsearch examples:
select * from test where to_tsvector('german', a) ## plainto_tsquery('Baum');
a
-------------
ein Baum
viele Bäume
select * from test where to_tsvector('german', a) ## plainto_tsquery('Bäume');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Überleben');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Tisch');
a
--------
Tisch
Tische
Whereas Tische is Plural of Tisch (table) and Bäume is plural of Baum (tree). So, Obviously Umlauts does not work while textsearch perfoms well.
But what really confuses me is, that a) non-german special characters are matching
select * from test where to_tsvector('german', a) ## plainto_tsquery('Café');
a
------
Café
and b) if I don't use the german dictionary, there is no Problem with umlauts (but of course no real textsearch as well)
select * from test where to_tsvector(a) ## plainto_tsquery('Bäume');
a
-------------
viele Bäume
So, if I use the german dictionary for Text-Search, just the german special characters do not work? Seriously? What the hell is wrong here? I Really can't figure it out, please help!
You're explicitly using the German dictionary for the to_tsvector calls, but not for the to_tsquery or plainto_tsquery calls. Presumably your default dictionary isn't set to german; check with SHOW default_text_search_config.
Compare:
regress=> select plainto_tsquery('simple', 'Bäume'),
plainto_tsquery('english','Bäume'),
plainto_tsquery('german', 'Bäume');
plainto_tsquery | plainto_tsquery | plainto_tsquery
-----------------+-----------------+-----------------
'bäume' | 'bäume' | 'baum'
(1 row)
The language setting affects word simplification and root extraction, so a vector from one language won't necessarily match a query from another:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'bäume' | f
(1 row)
If you use a consistent language setting, all is well:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('german', 'Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('german', 'Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'baum' | t
(1 row)

powershell parsing of cdata-section

I'm trying to read an rss feed using powershell and I can't extract a cdata-section within the feed
Here's a snippet of the feed (with a few items cut to save space):
<item rdf:about="http://philadelphia.craigslist.org/ctd/blahblah.html">
<title>
<![CDATA[2006 BMW 650I,BLACK/BLACK/SPORT/AUTO ]]>
</title>
...
<dc:title>
<![CDATA[2006 BMW 650I,BLACK/BLACK/SPORT/AUTO ]]>
</dc:title>
<dc:type>text</dc:type>
<dcterms:issued>2011-11-28T22:15:55-05:00</dcterms:issued>
</item>
And the Powershell script:
$rssFeed = [xml](New-Object System.Net.WebClient).DownloadString('http://philadelphia.craigslist.org/sss/index.rss')
foreach ($item in $rssFeed.rdf.item) { $item.title }
Which produces this:
#cdata-section
--------------
2006 BMW 650I,BLACK/BLACK/SPORT/AUTO
2006 BMW 650I,BLACK/BLACK/SPORT/AUTO
How do I extract the cdata-section?
I tried a few variants such as $item.title."#cdata-section" and $item.title.InnerText which return nothing. I tried $item.title | gm and I see the #cdata-section listed as a property. What am I missing?
Thanks.
Since you have multiple of those, the title property itself would be an array, so the following should work:
$rss.item.title | select -expand "#cdata-section"
or
$rss.item.title[0]."#cdata-section"
based on what you need.