zend search lucene - zend-framework

I have a database that I would like to leverage with Zend_Search_Lucene. However, I am having difficulty creating a "fully searchable" document for Lucene.
Each Zend_Search_Lucene document pulls information from two relational database tables (Table_One and Table_Two). Table_One has basic information (id, owner_id, title, description, location, etc.), Table_Two has a 1:N relationship to Table_One (meaning, for each entry in Table_One, there could be one or more entries in Table_Two). Table_Two contains: id, listing_id, bedrooms, bathrooms, price_min, price_max, date_available. See Figure 1.
Figure 1
Table_One
id (Primary Key)
owner_id
title
description
location
etc...
Table_Two
id (Primary Key)
listing_id (Foreign Key to Table_One)
bedrooms (int)
bathrooms (int)
price_min (int)
price_max (int)
date_available (datetime)
The problem is, there are multiple Table_Two entries for each Table_One entry. [Question 1] How to create a Zend_Search_Lucene document where each field is unique? (See Figure 2)
Figure 2
Lucene Document
id:Keyword
owner_id:Keyword
title:UnStored
description:UnStored
location: UnStored
date_registered:Keyword
... (other Table_One information)
bedrooms: UnStored
bathrooms: UnStored
price_min: UnStored
price_max: UnStored
date_available: Keyword
bedrooms_1: <- Would prefer not to have do this as this makes the bedrooms harder to search.
Next, I need to be able to do a Range Query on the bedrooms, bathrooms, price_min and price_max fields. (Example: finding documents that have between 1 and 3 bedrooms) Zend_Search_Lucene will only allow ranged searches on the same field. From my understanding, this means each field I want to do a ranged query on can only contain one value (example: bedrooms:"1 bedroom");
What I have now, within the Lucene Document is the bedrooms, bathrooms, price_min, price_max, date_available fields being space delimited.
Example:
Sample Table_One Entry:
| 5 | 2 | "Sample Title" | "Sample Description" | "Sample Location" | 2008-01-12
Sample Table_Two Entries:
| 10 | 5 | 3 | 1 | 900 | 1000 | 2009-10-01
| 11 | 5 | 2 | 1 | 800 | 850 | 2009-08-11
| 12 | 5 | 1 | 1 | 650 | 650 | 2009-09-15
Sample Lucene Document
id:5
owner_id:2
title: "Sample Title"
description: "Sample Description"
location: "Sample Location"
date_registered: [datetime stamp YYYY-MM-DD]
bedrooms: "3 bedroom 2 bedroom 1 bedroom"
bathrooms: "1 bathroom 1 bathroom 1 bathroom"
price_min: "900 800 650"
price_max: "1000 850 650"
date_available: "2009-10-01 2009-08-11 2009-09-15"
[Question 2] Can you do a Range Query search on the bedroom, bathroom, price_min, price_max, date_available fields as they are shown above or does each range query field have to contain only one value (e.g. "1 bedroom")? I have not been able to get the Range Query to work in its current form. I am at a lose here.
Thanks in advance.

I suggest you create a separate Lucene document for each entry in Table_Two. This will cause some duplication of the Table_One information common to these entries, but this is not a high price to pay for much easier index structure in Lucene.
Use a boolean query to combine several range queries. The number-valued fields should be something like this:
bedrooms: 3
price_min: 900
and a sample query in Lucene syntax will be:
date_available:[20100101 TO 20100301] AND price_min:[600 TO 1000]

Related

Aggregate function to extract all fields based on maximum date

In one table I have duplicate values ​​that I would like to group and export only those fields where the value in the "published_at" field is the most up-to-date (the latest date possible). Do I understand it correctly as I use the MAX aggregate function the corresponding fields I would like to extract will refer to the max found or will it take the first found in the table?
Let me demonstrate you this on simple example (in real world example I am also joining two different tables). I would like to group it by id and extract all fields but only relating to the max published_at field. My query would be:
SELECT "t1"."id", "t1"."field", MAX("t1"."published_at") as "published_at"
FROM "t1"
GROUP By "t1"."id"
| id | field | published_at |
---------------------------------
| 1 | document1 | 2022-01-10 |
| 1 | document2 | 2022-01-11 |
| 1 | document3 | 2022-01-12 |
The result I want is:
1 - document3 - 2022-01-12
Also one question - why am I getting this error "ERROR: column "t1"."field" must appear in the GROUP BY clause or be used in an aggregate function". Can I use MAX function on string type column?
If you want the latest row for each id, you can use DISTINCT ON. For example:
select distinct on (id) *
from t
order by id, published_at desc
If you just want the latest row in the whole result set you can use LIMIT. For example:
select *
from t
order by published_at desc
limit 1

Relational databse design to represent similarity between rows of same table

For background purposes: I'm using PostgreSQL with SQLAlchemy (Python).
Given a table of unique references as such:
references_table
-----------------------
id | reference_code
-----------------------
1 | CODEABCD1
2 | CODEABCD2
3 | CODEWXYZ9
4 | CODEPOIU0
...
In a typical scenario, I would have a separate items table:
items_table
-----------------------
id | item_descr
-----------------------
1 | `Some item A`
2 | `Some item B`
3 | `Some item C`
4 | `Some item D`
...
In such typical scenario, the many-to-many relationship between references and items is set in a junction table:
references_to_items
-----------------------
ref_id (FK) | item_id (FK)
-----------------------
1 | 4
2 | 1
3 | 2
4 | 1
...
In that scenario, it is easy to model and obtain all references that are associated to the same item, for instance item 1 has references 2 and 4 as per table above.
However, in my scenario, there is no items_table. But I would still want to model the fact that some references refer to the same (non-represented) item.
I see a possibility to model that via a many-to-many junction table as such (associating FKs of the references table):
reference_similarities
-----------------------
ref_id (FK) | ref_id_similar (FK)
-----------------------
2 | 4
2 | 8
2 | 9
...
Where references with ID 2, 4, 8 and 9 would be considered 'similar' for the purposes of my data model.
However, the inconvenience here is that such model requires to choose one reference (above id=2) as a 'pivot', to which multiple others can be declared 'similar' in the reference_similarities table. Ref 2 is similar to 4 and ref 2 is similar to 8 ==> thus 4 is similar to 8.
So the question is: is there a better design that doesn't involve having a 'pivot' FK as above?
Ideally, I would store the 'similarity' as an Array of FKs as such:
reference_similarities
------------------------
id | ref_ids (Array of FKs)
------------------------
1 | [2, 4, 8, 9]
2 | [1, 3, 5]
..but I understand from https://dba.stackexchange.com/questions/60132/foreign-key-constraint-on-array-member that it is currently not possible to have foreign keys in PostgreSQL arrays. So I'm trying to figure out a better design for this model.
I can understand that you want to group items in a set, and able to query the set from any of item in it.
You can use a hash function to hash a set, then use the hash as pivot value.
For example you have a set of values (2,4,8,9), it will be hashed like this:
hash = ((((31*1 + 2)*31 + 4)*31 + 8)*31 + 9
you can refer to Arrays.hashCode in Java to know how to hash a list of values.
int result = 1;
for (Object element : a)
result = 31 * result + (element == null ? 0 : element.hashCode());
Table reference_similarities:
reference_similarities
-----------------------
ref_id (FK) | hash_value
-----------------------
2 | hash(2, 4, 8, 9) = 987204
4 | 987204
8 | 987204
9 | 987204
To query the set, you can first query hash_value from ref_id first, then, get all ref_id from hash_value.
The draw back of this solution is every time you add a new value to a set, you have to rehash the set.
Another solution is you can just write a function in Python to produce a unique hash_value when creating a new set.

Drools - Finding a single matching condition for a table of products ranked by consumers

I have a table displaying information for the top four ratings of produce in a store. I want to be able to find specific products in this rating table. Here is a structure of the table
----------------------------------------------------------------------------
sectId | product_code | product_category | consumer_raniking
10444 | 11222 | PRODUCE | RATING_1
10444 | 45555 | PRODUCE | RATING_1
10444 | 10005 | PR0DUCE | RATING_1
20555 | 11344 | PRODUCE | RATING_2
20555 | 94003 | PRODUCE | RATING_2
... and so on.
I wrote a rule to find inserted products which ins not working the way I want, i.e. to find the targetted fact that was inserted into the table. Here is the rule I put together:
rule "find by product codes rating_1"
when
$product_table: ProductRanking( $rank1: this.getProductCodesRankFirst())
$product1 : Product( this.product_code memberOf $rank1, $product_code: product_code )
$product2 : Product( this.product_code == 10444,this.product_code != $product_code ,$product_code2: product_code)
then
System.out.println("Found Products for product_codes "+$product_code+ " "+$product_code2 ) ;
end
Unfortunately, this returns 3 rows. I inserted into the session the product in row 2 i.e. product with ocde 45555 and it does find row 2. However, ir also brings in row 1 and row3.
I can see why it's doing that. It's because the skus are in the sectId with sectId 10444. However, I want to only bring in the row
that I inserted, which is sectionId(10444), product_code(45555). How can I achieve that?
I solved it by using a global to filter out the extra products. In the first line that brings the rankings, I eliminate the extra-matching products this way:
global ProductHelper productHelper
$product_table: ProductRanking( $rank1: productHelper.getProductCodesRankFirst(),
productCode != productHelper.getProductCodeFruitCategory() && productCode!=
productHelper.productCodeVegetableCategory())
The ProductHelper identifies the product codes I want to eliminate and hence the extra 2 products brought in are ignored, creating a single match. I'm sure there is a better way, but since I'm no expert, this is what I was able to come up with.

Count By and Find By in same Query

I'm trying to build a single query using a JPA method name that finds all of the results based on a parameter, then counts based on a second parameter.
Say I have data that looks like this:
ID | Word | Who said it
1 | Apple | Person1
2 | Banana| Person1
3 | Apple | Person1
4 | Apple | Person2
I want to pass in a "Who said it" String and receive a histogram of unique words and how many times they said it. So, if I pass in "Person1", I want to receive:
Apple: 2
Banana: 1
How would I combine both the findByWhoSaidIt(String whoSaidIt) with a countByWord?
Just use a #Query annotation with a native SQL query:
#Query(value = "select word, count(*) where who_said_it = :person group by word",
nativeQuery=true)
Object[] whatWasSaidBy(String person)

how to filter fields from jsonb type column while querying on postgresql

Here is my table (simplified, only significant columns):
CREATE TABLE details(
id serial primary key,
name text,
Address jsonb );
And some sample Data
# Select * from details
id | name | Address
----+----------+-----------------------------------------------------------
1 | Batman | {"city":"Gotham City","street":"1007 Mountain Drive"}
2 | Superman | {"city":"Metropolis","street":"344 Clinton Street"}
3 | Flash | {"city":"Central City","street":"122 Englewood street"}
Now I would like to select only name and City field of Address, Query would be
Select name, Address -> 'city' as Address from details
name | Address
----------+------------------
Batman | "Gotham City"
Superman | "Metropolis"
Flash | "Central City"
But I want it to be filtered as shown below.
name | Address
----------+-------------------------
Batman | {"city":"Gotham City"}
Superman | {"city":"Metropolis"}
Flash | {"city":"Central City"}
Is it possible to select only some fields from jsonb type column? If it is possible then what would be the query ?
If you want to include only 1 field, your query can be fairly easy:
select name, jsonb_build_object('city', address -> 'city') address
from details
However, if you want to include multiple fields, things will get complex. You could f.ex. remove unwanted keys one-by-one with the - operator, like: jsonb_column - 'key1' - 'key2':
select name, address - 'street' address
from details
But this will only work, when you have a fairly few fields inside of the JSON column (and they are well defined).
If you want a general solution, you should use some aggregation:
select name, (select jsonb_object_agg(e.key, e.value)
from jsonb_each(address) e
where e.key in ('city')) address
from details