How to set the FSM configuaration for Textricator PDF OCR reader? - itext

I'm trying to use the PDF document parser called Textricator. It can use 3 different methods for parsing a PDF with some common OCR libraries. (itext5, itext7, pdfbox) The available methods are: text, table and form. Text for normal raw OCR recognition, table to read out structured table data, and form for parsing less structured forms, using a Finite State Machine (FSM).
However, I am not able to use the form parser. Perhaps I simply don't understand how to organize the many configuration states. The documentation is lacking a simple form example, and someone recently posted an attempt to read a very basic table using the form method, but was not able to. I also gave it a shot, but without any success.
Q: Can someone help me configure the state machine in the YML file?
(This is used to parse the demo file from one of that repo's issues, and shown in the copied screenshot below.)
The YML configuration file.
extractor: "pdf.pdfbox"
header:
default: 100
footer:
default: 600
maxRowDistance: 2
rootRecordType: item
recordTypes:
item:
label: "item"
valueTypes:
- item
- date
- description
- order_number
- quantity
- price
valueTypes:
item:
label: "Item"
date:
label: "Date"
description:
label: "Description"
order_number:
label: "OrderNo"
quantity:
label: "Qty"
price:
label: "Price"
initialState: "INIT"
states:
INIT:
transitions:
-
condition: item
nextState: item
item:
startRecord: true
transitions:
-
condition: date
nextState: date
date:
include: true
transitions:
-
condition: description
nextState: description
description:
include: true
transitions:
-
condition: description
nextState: description
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
order_number:
include: true
transitions:
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
quantity:
include: true
transitions:
-
condition: price
nextState: price
price:
include: true
transitions:
-
condition: end
nextState: end
end:
include: false
transitions:
-
condition: any
nextState: end
conditions:
item: '73 < ulx < 110 and text =~ /(\\d)*/'
date: '110 < ulx < 181 and text =~ /([0-9\-]*)/'
description: '193 < ulx < 366'
# order_number: '12 <= uly_rel <= 16 and text =~ ^.+/((\d{6})\-)((\d{2}))/'
order_number: '12 <= uly_rel <= 16 and text =~ ^.+((\d{6})\-)((\d{2}))'
quantity: '393 < ulx < 459'
price: '459 < ulx < 523'
end: 'text =~ /(Footer)/'
any: "1 = 1"
You may wonder why I am insisting in using the form processor for this simple example, but it is because in my real life document I will have a much more complex sub-structure of child items under the Description field. This can only (?) be processed efficiently by a state-machine, AFAIK.
But, maybe this is not the right tool for the job? So what other options are there?
UPDATE: (2021-05-18)
The author of Textricate has now bumped the libraries used, the documentation and corrected several working examples and user issues. Thanks to user mweber I now have a perfectly working parser and no longer need to use awk to handle weird columns.

As Textricator is kind of a hidden gem for pdf parsing imo, I'm happy to see someone using it and posted a config working with the sample document to the github issue:
extractor: "pdf.pdfbox"
header:
default: 100
footer:
default: 600
maxRowDistance: 2
rootRecordType: item
recordTypes:
item:
label: "item"
valueTypes:
- item
- date
- description
- order_number
- quantity
- price
valueTypes:
item:
label: "Item"
date:
label: "Date"
description:
label: "Description"
order_number:
label: "OrderNo"
quantity:
label: "Qty"
price:
label: "Price"
initialState: "INIT"
states:
INIT:
include: false
transitions:
-
condition: item
nextState: item
- condition: any
nextState: INIT
item:
startRecord: true
transitions:
-
condition: date
nextState: date
date:
include: true
transitions:
-
condition: description
nextState: description
description:
include: true
transitions:
-
condition: description
nextState: description
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
-
condition: item
nextState: item
order_number:
include: true
transitions:
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
quantity:
include: true
transitions:
-
condition: price
nextState: price
price:
include: true
transitions:
-
condition: end
nextState: end
-
condition: description
nextState: description
-
condition: item
nextState: item
end:
include: false
transitions:
-
condition: any
nextState: end
conditions:
item: '73 < ulx < 110 and text =~ /(\\d)*/'
date: '110 < ulx < 181 and text =~ /([0-9\\-]*)/'
description: '193 < ulx < 366'
order_number: '12 <= uly_rel <= 16 and text =~ /^.+(([0-9]{6})\\-)(([0-9]{2}))/'
quantity: '393 < ulx < 459'
price: '459 < ulx < 523'
end: 'text =~ /(Footer)/'
any: "1 = 1"

Related

How to add reference to application/x-www-form-urlencoded request body?

I have built the JSON request below which works fine. But how to provide the $ref if the request media type is application/x-www-form-urlencoded?
For example, you can see the schema Filters which I have created. I want to provide the reference to that in the NewDogRequest.Filter parameter.
openapi: 3.0.2
info:
description: RESTful web services for writing and reading Dogs Data.
version: v1.0
title: Dogs Services
tags:
- name: Dogs
paths:
/dogs:
post:
tags:
- Dogs
summary: Add new dogs data.
description: Use this service when you want to add a new dog.
operationId: addDog
requestBody:
$ref: '#/components/requestBodies/NewDogRequest'
responses:
200:
description: OK
content:
application/json:
schema:
type: string
components:
# ****************** Request Bodies ****************** #
requestBodies:
NewDogRequest:
content:
application/x-www-form-urlencoded:
schema:
type: object
properties:
dogID:
description: >
* Unique dog id.
type: string
filter:
description: >
Specifies the type breed of the dog
type: string
enum:
- German Sheperd
- Husky
- DashHound
default: German Sheperd
# ****************** Schemas ****************** #
schemas:
Filters:
type: string
enum:
- German Sheperd
- Husky
- DashHound
default: German Sheperd
Tried the following, but it didn't work. If you see the image that I have attached the enums are not rendered in the html.
requestBodies:
NewDogRequest:
content:
application/x-www-form-urlencoded:
schema:
type: object
properties:
dogID:
description: >
* Unique dog id.
type: string
filter:
description: >
Specifies the type breed of the dog
$ref: '#/components/schemas/Filters'
Swagger editor image

TYPO3 - cms-forms - Remove Input from Email

I am trying to remove an input field from the generated email. With Powermail it is relatively easy. There I can exclude fields in the typoscript. How could something like this look with cms-forms?
Example powermail
excludeFromPowermailAllMarker {
# On Confirmation Page (if activated)
confirmationPage {
# add some markernames (commaseparated) which should be excluded
excludeFromMarkerNames = datenschutzbestimmungen, agb
}
}
TYPO3 11.5.12
php 8.1.2
This is possible with the form variants introduced in TYPO3 version 9.
Hide a form element in certain finishers and on the summary step:
type: Form
identifier: test
prototypeName: standard
label: Test
finishers:
-
identifier: EmailToReceiver
options:
subject: Testmail
recipientAddress: tritum#example.org
recipientName: 'Test'
senderAddress: tritum#example.org
senderName: tritum#example.org
renderables:
-
type: Page
identifier: page-1
label: 'Page 1'
renderables:
-
type: Text
identifier: text-1
label: 'Text 1'
variants:
-
identifier: hide-1
renderingOptions:
enabled: false
condition: 'stepType == "SummaryPage" || finisherIdentifier in ["EmailToSender", "EmailToReceiver"]'
-
type: Text
identifier: text-2
label: 'Text 2'
-
type: SummaryPage
identifier: summarypage-1
label: 'Summary step'
The relevant part (which disables rendering of the field in the summary page, the email to sender finisher or the email to sender finisher) is
variants:
-
identifier: hide-1
renderingOptions:
enabled: false
condition: 'stepType == "SummaryPage" || finisherIdentifier in ["EmailToSender", "EmailToReceiver"]'

Overwrite YAML-Files of individual forms with typoscript

i tried for a few hours to overwrite some parameters in a yaml file of a form with typoscript. I found this passage in the manual:
https://docs.typo3.org/c/typo3/cms-form/9.5/en-us/Concepts/FrontendRendering/Index.html?highlight=formdefinitionoverrides#typoscript-overrides
But i could not get it to work. Additional i could not find a way to debug my yaml definitions. I tried the hint with
typo3/sysext/form/Classes/Mvc/Configuration/ConfigurationManager.php::getConfigurationFromYamlFile()
but this shows only the prototypes not the forms.
So i do have some questions:
is there a possibility to debug the combined code of yaml and typoscript for a form?
does formDefinitionOverrides as described in the manual really work (TYPO3 9.5)
what is ? The identifier of my form in the yaml file or the identifier in the frontend with the number of the content element (myFormIdentifier-UidOfMyContentElement)
is it possible to work with identifiers instead of array indizes? (multiple nested array indizes with up to 10 or more entries driving me crazy.
Thanks!
I found how to use labels instead of using numbered arrays:
the yaml:
type: Form
identifier: test1
prototypeName: standard
renderables:
page1:
renderingOptions:
previousButtonLabel: 'Previous step'
nextButtonLabel: 'Next step'
type: Page
identifier: page-1
label: Step
renderables:
field1:
defaultValue: ''
type: Text
identifier: email-1
label: 'My Email address'
properties:
validationErrorMessages:
-
code: 1221559976
message: öasdlkhfö
and the typoscript:
plugin.tx_form{
settings{
formDefinitionOverrides {
# identifier of form
test1 {
renderables {
# first page of form
page1 {
renderables {
field1 {
label = TEXT
label.value = Eine ganze andere E-Mailaddresse
}
}
}
}
}
}
}
}
works nice, but you cannot mix it - all fileds in one level has to get labels. That makes sense because it is not possible to mix indices and keys in a php array.
Yes, it does work. Here's an example.
The identifier in this example is "myformEN".
With TypoScript you can't do it without this nested array syntax.
TypoScript
plugin.tx_form{
settings{
yamlConfigurations.100 = EXT:user_site/Configuration/Form/CustomFormSetup.yaml
formDefinitionOverrides {
# identifier of form
myformEN {
renderables {
# first page of form
0 {
renderables {
# number of element in form
0 {
# another level (because of "Grid: Row")
renderables {
0 {
defaultValue = TEXT
defaultValue.value = My Text
}
}
}
}
}
}
}
}
}
}
myformEN.form.yaml
renderingOptions:
submitButtonLabel: 'submit'
identifier: myformEN
label: 'Inquiry'
type: Form
prototypeName: standard
[…]
renderables:
-
renderingOptions:
previousButtonLabel: ''
nextButtonLabel: Next
identifier: page-1
label: ''
type: Page
renderables:
-
type: GridRow
identifier: gridrow-1
label: 'Grid: Row'
renderables:
-
defaultValue: 'this will be overwritten by TypoScript'
type: Text
identifier: article
label: Article
-
defaultValue: ''
type: Text
identifier: text-2
label: 'Amount'
[…]

How to display selected taxonomy of an article in Rest view in drupal 8

How to display selected taxonomy of an article in Rest view in drupal 8
I have created a view with rest export as follows
Format = Serializer
Settings = Fields , Settings
Fields :
Content: Title
Content: Category
Under category i have choosed the below settings:
Filter criteria
Content: Published (= Yes)
Content: Content type (= News)
But when i view the json, the category field, selected taxonomy term is not displaying.
I have created the rest view in Drupal 8 with Article content type for following fields
Title
Tag
Below the export views code change the your project UUID and import it.
uuid: 435345-345ty-4354-35345f-534534fdf
langcode: en
status: true
dependencies:
config:
- field.storage.node.field_tags
- node.type.article
module:
- node
- rest
- serialization
- user
id: test_rest_view
label: 'test rest view'
module: views
description: ''
tag: ''
base_table: node_field_data
base_field: nid
core: 8.x
display:
default:
display_plugin: default
id: default
display_title: Master
position: 0
display_options:
access:
type: perm
options:
perm: 'access content'
cache:
type: tag
options: { }
query:
type: views_query
options:
disable_sql_rewrite: false
distinct: false
replica: false
query_comment: ''
query_tags: { }
exposed_form:
type: basic
options:
submit_button: Apply
reset_button: false
reset_button_label: Reset
exposed_sorts_label: 'Sort by'
expose_sort_order: true
sort_asc_label: Asc
sort_desc_label: Desc
pager:
type: mini
options:
items_per_page: 10
offset: 0
id: 0
total_pages: null
expose:
items_per_page: false
items_per_page_label: 'Items per page'
items_per_page_options: '5, 10, 25, 50'
items_per_page_options_all: false
items_per_page_options_all_label: '- All -'
offset: false
offset_label: Offset
tags:
previous: ‹‹
next: ››
style:
type: serializer
row:
type: fields
options:
inline: { }
separator: ''
hide_empty: false
default_field_elements: true
fields:
title:
id: title
table: node_field_data
field: title
relationship: none
group_type: group
admin_label: ''
label: ''
exclude: false
alter:
alter_text: false
text: ''
make_link: false
path: ''
absolute: false
external: false
replace_spaces: false
path_case: none
trim_whitespace: false
alt: ''
rel: ''
link_class: ''
prefix: ''
suffix: ''
target: ''
nl2br: false
max_length: 0
word_boundary: false
ellipsis: false
more_link: false
more_link_text: ''
more_link_path: ''
strip_tags: false
trim: false
preserve_tags: ''
html: false
element_type: ''
element_class: ''
element_label_type: ''
element_label_class: ''
element_label_colon: false
element_wrapper_type: ''
element_wrapper_class: ''
element_default_classes: true
empty: ''
hide_empty: false
empty_zero: false
hide_alter_empty: true
click_sort_column: value
type: string
settings:
link_to_entity: false
group_column: value
group_columns: { }
group_rows: true
delta_limit: 0
delta_offset: 0
delta_reversed: false
delta_first_last: false
multi_type: separator
separator: ', '
field_api_classes: false
entity_type: node
entity_field: title
plugin_id: field
field_tags:
id: field_tags
table: node__field_tags
field: field_tags
relationship: none
group_type: group
admin_label: ''
label: ''
exclude: false
alter:
alter_text: false
text: ''
make_link: false
path: ''
absolute: false
external: false
replace_spaces: false
path_case: none
trim_whitespace: false
alt: ''
rel: ''
link_class: ''
prefix: ''
suffix: ''
target: ''
nl2br: false
max_length: 0
word_boundary: true
ellipsis: true
more_link: false
more_link_text: ''
more_link_path: ''
strip_tags: false
trim: false
preserve_tags: ''
html: false
element_type: ''
element_class: ''
element_label_type: ''
element_label_class: ''
element_label_colon: false
element_wrapper_type: ''
element_wrapper_class: ''
element_default_classes: true
empty: ''
hide_empty: false
empty_zero: false
hide_alter_empty: true
click_sort_column: target_id
type: entity_reference_label
settings:
link: false
group_column: target_id
group_columns: { }
group_rows: true
delta_limit: 0
delta_offset: 0
delta_reversed: false
delta_first_last: false
multi_type: separator
separator: ', '
field_api_classes: false
plugin_id: field
filters:
status:
value: '1'
table: node_field_data
field: status
plugin_id: boolean
entity_type: node
entity_field: status
id: status
expose:
operator: ''
group: 1
type:
id: type
table: node_field_data
field: type
relationship: none
group_type: group
admin_label: ''
operator: in
value:
article: article
group: 1
exposed: false
expose:
operator_id: ''
label: ''
description: ''
use_operator: false
operator: ''
identifier: ''
required: false
remember: false
multiple: false
remember_roles:
authenticated: authenticated
reduce: false
is_grouped: false
group_info:
label: ''
description: ''
identifier: ''
optional: true
widget: select
multiple: false
remember: false
default_group: All
default_group_multiple: { }
group_items: { }
entity_type: node
entity_field: type
plugin_id: bundle
sorts:
created:
id: created
table: node_field_data
field: created
order: DESC
entity_type: node
entity_field: created
plugin_id: date
relationship: none
group_type: group
admin_label: ''
exposed: false
expose:
label: ''
granularity: second
header: { }
footer: { }
empty: { }
relationships: { }
arguments: { }
display_extenders: { }
cache_metadata:
max-age: -1
contexts:
- 'languages:language_content'
- 'languages:language_interface'
- request_format
- url.query_args
- 'user.node_grants:view'
- user.permissions
tags:
- 'config:field.storage.node.field_tags'
rest_export_1:
display_plugin: rest_export
id: rest_export_1
display_title: 'REST export'
position: 1
display_options:
display_extenders: { }
path: testreset
pager:
type: some
options:
items_per_page: 10
offset: 0
style:
type: serializer
options:
formats:
json: json
row:
type: data_field
options:
field_options:
title:
alias: ''
raw_output: false
cache_metadata:
max-age: -1
contexts:
- 'languages:language_content'
- 'languages:language_interface'
- request_format
- 'user.node_grants:view'
- user.permissions
tags:
- 'config:field.storage.node.field_tags'
See rest output

How do I query MongoDB collection based on related documents (in Doctrine)

I have a few related collections in my Doctrine ODM project…
# Contract.mongodb.yml
Project\Contract\Domain\Contract:
type: document
repositoryClass: Project\SymfonyBundle\ContractBundle\Repository\ContractRepository
collection: Contracts
fields:
id:
type: id
id: true
strategy: UUID
slug:
type: string
length: 128
unique: true
gedmo:
slug:
separator: -
style: default
fields:
- refNo
- name
name:
type: string
refNo:
type: string
purpose:
type: string
budgetAmount:
type: int
budgetCurrency:
type: string
length: 3
startDate:
type: date_immutable
endDate:
type: date_immutable
provider:
type: string
referenceOne:
provider:
targetDocument: Project\Contract\Domain\Provider
cascade:
- persist
- merge
- detach
- refresh
referenceMany:
reports:
targetDocument: Project\Report\Domain\Report
cascade:
- all
# Provider.mongodb.yml
Project\Contract\Domain\Provider:
type: document
repositoryClass: Project\SymfonyBundle\ContractBundle\Repository\ProviderRepository
collection: Providers
fields:
id:
type: id
id: true
strategy: UUID
slug:
type: string
length: 128
unique: true
gedmo:
slug:
separator: -
style: default
fields:
- name
name:
type: string
unique: true
referenceMany:
users:
targetDocument: Project\User\Domain\User
cascade: []
# User.mongodb.yml
Project\User\Domain\User:
type: document
repositoryClass: Project\SymfonyBundle\UserBundle\Repository\UserRepository
collection: Users
fields:
id:
type: id
id: true
strategy: UUID
What I want to do is get the contracts for a given user, but I can't work out how to query the Contracts collection based on a user. Do I need to make 2 queries? 1 to get the user's providers & then a second to query for contracts that link to one of the providers?
If you're able to advise how I do this in the console as well as Doctrine, I'd appreciate the knowledge.
Thanks in advance for your help :o)
You can use the aggregation pipeline and use the $lookup operator to join the document, See - https://docs.mongodb.com/v3.2/reference/operator/aggregation/lookup/
However if this is common, I'd consider re-modeling your documents.