Importing TSV files into DocArray

Importing TSV files into DocArray - docarray

I have some data stored in a tsv. I saw that docArray offers the method from_csv() to import it, but i can't find any good documentation about it. Is there a paramter to change the delimiter from comma to tab?
the only thing i found in the docs is this

The docs that you find mentions a parameter dialect:
define a set of parameters specific to a particular CSV dialect. could be a string that represents predefined dialects in your system, or could be a csv.Dialect class that groups specific formatting parameters together. If you don’t know the dialect and the default one does not work for you, you can try set it to auto.
Actually, the value auto can successfully infer the tsv file type and delimeter, you just need to do:
# run this command to download a sample file
# wget https://gist.githubusercontent.com/alaeddine-13/76b4aa7805a347cf2cdf12db78e0a81c/raw/a7df1a867e8cf80b4c226f72f219d0b6f2cea8a2/sample.tsv
da = DocumentArray.from_csv('sample.tsv', dialect='auto')
In case you need a specific dialect, you can either provide a Dialect class or provide a dialect name available in python's dialects list.

Related

Full Text search with multiple synonyms in PostgresSQL

I am implementing Full Text Search with PostgreSQL. I am using following type query to search in document column.
FROM schema.table t0
WHERE t0.document ## websearch_to_tsquery('error')
I am working on to use FTS Dictionaries to search for similar words. I come across C:\Program Files\PostgreSQL\14\share\tsearch_data folder where I have defined word and its synonyms in xsyn_sample.rules file. File content is as mentioned below.
# Sample rules file for eXtended Synonym (xsyn) dictionary
# format is as follows:
#
# word synonym1 synonym2 ...
#
error fault issue mistake malfunctioning
I want to use this dictionary but don't know how to use it. When I search for 'error', I wants to display result for 'error', 'fault', 'issues', 'mistakes' etc which are having similar meanings. Kindly share if you have ever come across this implementation. Few things I am asking for
Is this xsyn_sample.rules is sufficient for this? If not then what other techniques can be used for this type of search?
How to configure postgreSQL 14 in my local system to use this dictionary instead of 'simple' or 'english'. I know how to use both of these dictionary with select plainto_tsquery('english','errors'); and select plainto_tsquery('simple','errors'); queries. Similarly I want to use my custom dictionary.
Is there any better source for dictionaries use in postgres in compare to https://www.postgresql.org/docs/current/textsearch-dictionaries.html ?

Don't edit the example rules file, create your own file mysyn.rules and add the synonyms there. Then create a dictionary that uses the file:
CREATE TEXT SEARCH DICTIONARY mysyn (TEMPLATE = xsyn_template, RULES = mysyn);
Then copy the English text search configuration and add your dictionary:
CREATE TEXT SEARCH CONFIGURATION myconf (COPY = english);
ALTER TEXT SEARCH CONFIGURATION myconf
ALTER MAPPING FOR word, asciiword WITH mysyn, english_stem;

Load csv to DB2 database

I'd like to ask if my syntax is correct in loading a csv format file to a DB2 Database. I cannot confirm as I'm having problems in configuring DB2 in my local. I'd also like to confirm the placement of double quote is correct for both dateformat and timeformat?
Below is my code snippet.
LOGFILE=/mnt/bin/log/myLog.txt
db2 "load from /mnt/bin/test.csv of del modified by coldel noeofchar noheader dateformat=\"YYYY-MM-DD\" timeformat=\"HH:MM:SS\" usedefaults METHOD P(1,2,3,4,5) messages $LOGFILE insert_update into myuser.desctb(DESC_ID,START_DATE,START_TIME,END_DATE,END_TIME)"

If you use modified by coldel then you should also specify the delimiter character. If the delimiter really is a comma, then omit the coldel option.
Additionally insert_update is for the IMPORT command (not for load command), but import is a logged action which reduces insert throughput. You can use ... replace into ... with the LOAD command. Study the docs for the details.
The quoting seems OK, but correctness of the formats depends on data file values.
Refer to the LOAD documentation for details, you should study this page and the related pages.
An alternative to LOAD is to use INGEST command (available in current Db2-clients) which has insert, replace, merge and other options and is high throughput (compared to import).

How do you specify that a parameter accepts wildcard characters in a custom script cmdlet

When you do Get-Help SomeCommand -Full, under each parameter, after the description, there are some additional parameter properties. One of those properties is 'Accept Wildcard Characters?'. When I create my help information for a custom script cmdlet how do I specify that a parameter accepts wildcards?

In the param section of your script, add the attribute SupportsWildcards().
ex.:
param (
[SupportsWildcards()][String]$variable
)

If you want to be able to do this it will require a few things. First off, you either have to create a .dll file, which you are not doing, or you have to create a module. I am not going to go into all of the ins and outs of creating a module, there are already many well written guides on how to do that out there on the internet that you can go look up.
As a part of your module you can include .XML files that provide Help information similarly to the commented help available for individual scripts. The XML style does have some advantages, such as consistency and some advanced features, but does require more effort. Towards this end I would strongly suggest reading Writing Help for Windows PowerShell Modules, as it will explain where to place your XML files, how to structure them, and required headers and what not.
If it were me I'd probably copy an existing XML help file and edit it to suit my needs for the cmdlet, find and read one of the quick-and-dirty HowTo's about creating a module, and then give up on the idea since it's not worth the effort involved to just add that 'Supports Wildcards' flag (in my opinion) if this all started out just as a basic script with commented help.
But the answer is, create a module and supporting XML based Help file for your cmdlet. With that you can add support for the Accepts Wildcards flag for your parameters.

Making stable names for doxygen html docs pages

I need to refer to Doxygen documentation pages. The file names however are not stable as they change after every generation. My idea is to create a symlink to each HTML file created by Doxygen , having a stable and human friendly name. Have anyone tried this?
Actually, it might be very easy just to parse the annotated.html file Doxygen produces. Any documented class shows up there as a line like:
`<tr><td class="indexkey"><a class="el" href="dd/de6/a00548.html">
ImportantClass</a></td>`
The hard problem for me is that I would like to have my file names (i.e. the symlinks) be visible on my server like:
http://www.package.com/com.package.my.ImportantClass.html
[Yes, the code is in java]. So the question actually reads: "how to connect a HTML page by Doxygen with the right java class name and its package name.

You seem to have SHORT_NAMES enabled, which will indeed produce volatile names. When you set SHORT_NAMES to NO in the configuration file (the default), you will get longer names, but these are stable over multiple runs (i.e. they are based on the name, and for functions also on (a hash of) the parameters.

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) ## to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.
Is there any way to configure the parser not to look for file tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.

I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Importing TSV files into DocArray - docarray

I have some data stored in a tsv. I saw that docArray offers the method from_csv() to import it, but i can't find any good documentation about it. Is there a paramter to change the delimiter from comma to tab? the only thing i found in the docs is this

Related

Full Text search with multiple synonyms in PostgresSQL

Load csv to DB2 database

How do you specify that a parameter accepts wildcard characters in a custom script cmdlet

Making stable names for doxygen html docs pages

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

Categories

Resources