Elastic Search bulk indexing using perl

Elastic Search bulk indexing using perl - perl

I have tried the bulk API Perl client for content indexing in Elasticsearch. I am getting Error on the Bulk Ingestion line. Please find the code below:
my $ifileid=0;
my $dir = '/home/bala/input_files/output';
opendir(DIR, $dir) or die $!;
my #arfiles = readdir (DIR);
closedir(DIR);
print scalar #arfiles." Total files\n";
foreach(#arfiles)
{
my $file = $_;
if ($ifileid>1)
{
$doc = {index => 'my_index', type => 'blog_post', id => $ifileid, body => {filename => $file, content => 'bala'}};
push #docs, { create => $doc };
if ($ibulkid==100)
{
# bulk index docs
my $res = $e->bulk(\#docs);
if ( $res->{errors} )
{
die "Bulk index had issues: " . $json->encode( $res->{errors} );
}
$ibulkid=0;
}
$ibulkid++;
}
$ifileid++;
}
I am getting the following error:
Error => Not a HASH reference at /usr/local/share/perl5/Search/Elasticsearch/Role/Client/Direct.pm line 15.

The above usage of bulk api is wrong. bulk takes as input a hashref where the body is a reference to array of actions and documents
For example something on these lines should work:
$action = {index => {_index => 'my_index', _type => 'blog_post', _id => $ifileid}};
$doc = {filename => $file, content => 'bala'};
push #docs, $action;
push #docs,$doc
if ($ibulkid==100)
{
# bulk index docs
my $res = $e->bulk(body => \#docs);
if ( $res->{errors} )
{
die "Bulk index had issues: " . $json->encode( $res->{errors} );
}
$ibulkid=0;
}
$ibulkid++;
}
$ifileid++;

Related

What is the purpose of passing undef to DBI's `do` method in this context?

I don't understand what undef is doing in this snippet:
$dbh->do (qq {
INSERT INTO todo SET t = NOW(), status = 'open', content = ?
}, undef, $content);
Can someone please explain? I think I understand the whole code, but not this where it came from.
use warnings;
use strict;
use lib q(/data/TEST/perl/lib);
use CGI qw(:standard);
use WebDB;
sub insert_item {
my $content = shift;
my $dbh;
$content =~ s/^\s+//;
$content =~ s/^\s+$//;
if ($content ne "") {
$dbh = WebDB::connect();
$dbh->do (qq {
INSERT INTO todo SET t = NOW(), status = 'open', content = ?
}, undef, $content);
$dbh->disconnect();
}
}
sub display_entry_form {
print start_form(-action=> url()),
"To-do item:", br (),
textarea ( -name => "content",
-value => "",
-override => 1,
-rows =>3,
-columns => 80),
br (),
submit(-name=> "choice", -value => "Submit"),
end_form();
}
print header(), start_html(-title=>"To-Do List", -bgcolor => "white"), h2("To-Do List");
my $choice = lc(param ("choice"));
if ($choice eq "") {
display_entry_form();
} elsif ( $choice eq "submit" ) {
insert_item(param("content"));
display_entry_form();
} else {
print p ("Logic error, unknown choice: $choice");
}

The do() method takes 3 arguments: the query, query attributes, and bind data. The undef in your example means that there are no attributes to apply.
See "do()" in DBI on CPAN.
$rows = $dbh->do($statement) or die $dbh->errstr;
$rows = $dbh->do($statement, \%attr) or die $dbh->errstr;
$rows = $dbh->do($statement, \%attr, #bind_values) or die ...

GetOption - Perl - Referencing

So I have stumbled upon a little issue when trying to build out a simple "Airport Search Script" in Perl.
my $filename = '/home/student/perl-basic/topic-07/iata_airports.csv';
my $number = '1';
my $matching;
my $latitude;
my $longitude;
my $word = 'false';
GetOptions (
"filename=s" => \$filename,
"number=i" => \$number,
"matching=s" => \$matching,
"latitude=f" => \$latitude,
"longitude=f" => \$longitude,
"word=s" => \$word
);
sub parse_airports {
my $file = shift;
my $csv = Text::CSV->new( { binary => 1, eol => $/ } );
open ( my $fh, "<", $file ), or die "Error opening input file: $!";
my $ra_colnames = $csv->getline ( $fh );
$csv->column_names( #$ra_colnames );
my $ra_airports = $csv->getline_hr_all( $fh );
close ( $fh );
return $ra_airports;
}
sub get_name_matching_airports {
}
my $rah_airports = parse_airports( $filename );
my $rah_airports_found = [];
if ($matching) {
say "Up to $number airports matching $matching in $filename:";
$rah_airports_found = get_name_matching_airports(
airports => $rah_airports,
matching_string => $matching,
word => $word,
);
}
elsif ($latitude && $longitude) {
say "Up to $number airports near [$latitude, $longitude] in $filename:"
}
else {
say "Must have at least --matching, or --latitude and --longitude as arguments";
}
print pp($rah_airports_found);
So where I am struggling is in the "sub get_name_matching_airports"
Because you do not have the file let me explain the file structure.
It is a hash (ALL IATA Airports) with hashes (DETAILS of each airport). There are around 15 keys in each airport hash and one of the keys titles is (NAME). I have opened the file and parsed all the info into a hash ref which is returned at the end of the sub "parse_airports".
In the sub "get_name_matching_airports" I need to find additional airports with similar names based on the argument I passed in, into ($matching).
EXAMPLE: I parse (case-insensitive) "London" as an argument from the command line e.g. ./search_airports2 --matching London. In the sub "get_name_matching_airports" I will need to respond with any airport that has london (case-insensitive) in key(name).
Then push these newly found airports which are similar into the array "rah_airports_found" and in the end print this out.
SO I SOLVED MY PROBLEM WITH THE FOLLOWING CODE:
sub get_name_matching_airports {
my %params = (
airports => undef,
matching_string => undef,
word => undef,
#_
);
my #rah_airports_found;
my $ra_airports = $params{airports};
my $counter = 0;
foreach my $i ( #$ra_airports ) {
if ( $params{word} ) {
if ( $i->{name} eq $params{matching_string} ) {
push #rah_airports_found, $i;
$counter++;
}
}
else {
if ( $i->{name} =~ /$params{matching_string}/i ) {
push #rah_airports_found, $i;
$counter++;
}
if ( defined( $number ) && $counter == $number ) {
return \#rah_airports_found;
}
}
}
return \#rah_airports_found;
}

Example:
for my $Airport_rf (keys %{$rah_airports}) {
if ( $Airport_rf->{NAME} =~ m{\Q$matching\E}xi) {
# do your stuff here
}
}
If you don´t know the exact key of the hashref, you have to match the CLI parameter against all values.

Perl LDAP search - over 1500 member in a group

I want to search with an Perl script and ldap connection all members of a group with over 10.000 member.
I can only find results, if i set $first=0 and $last=1499 and than i get only the first 1500 member of the group.
If i use other parameter for $first and $last, then i got no results.
"$ldapsearchresult = $ldapconnect->search (
Sizelimit => 0,
base => 'any_base',
filter => '(objectClass=*)',
attr => ['member;range=$first-$last'],
);"
Thanks for your help!

You need to search the attribute range as a subtype again and again until the last return '*'.
Here is the code I am using, it is also use paged search in AD.
use Net::LDAP;
use Net::LDAP qw(LDAP_CONTROL_PAGED);
use Net::LDAP::Util qw(ldap_error_name canonical_dn ldap_explode_dn ldap_error_text);
use Net::LDAP::Control::Paged;
my $page_page = Net::LDAP::Control::Paged->new( 'size' => $input{'page'} );
my $finished_search = 0;
my $page_cookie;
my $result;
my #page_search_args = (
'base' => $input{"base"},
'scope' => $input{'scope'},
'filter' => $input{'filter'},
'attrs' => $input{'attrs'},
'control' => [ $page_page ],
'deref' => 'never',
'raw' => qr!^DO_NOT_MATCH!,
);
while (!$finished_search) {
my $msg = $ldap->search(#page_search_args);
if ($msg->is_error()) {
die "ERROR: ",$msg->error,"\n";
last;
} else {
my ($response) = $msg->control(LDAP_CONTROL_PAGED);
$page_cookie = $response->cookie();
$finished_search = 1 if !$page_cookie;
$page_page->cookie($page_cookie);
while (my $entry = $msg->pop_entry()){
$ldap_searches++;
print_all_attributes($entry);
}
}
}
if ($page_cookie) {
$page_page->cookie($page_cookie);
$page_page->size(0);
$ldap->search(#page_search_args);
}
sub add_result {
my $dn = shift;
my $attr = shift;
my $data = shift;
my $res = shift;
$attr =~ s!(;range\=\d+\-\d+)!!i;
#print "removed $1 from $attr" if $1;
foreach my $subtype (keys %{$data}){
$attr = $attr.$subtype if $subtype ne '';
$attr =~ s!(;range\=\d+\-\d+)!!i;
if (defined $$res->{$dn}->{$attr}){
push(#{$$res->{$dn}->{$attr}},#{$data->{$subtype}});
} else {
push(#{$$res->{$dn}->{$attr}},#{$data->{$subtype}});
}
}
return $res;
}
sub print_all_attributes {
my $entry = shift;
foreach my $attr ($entry->attributes()) {
if ($attr =~ /;range=/) {
my $last = 0;my $first = 0;
### $var will look like this --> "member;range=0-1499"
(my $pure_attr,my $range) = split /;/, $attr,2;
(my $junk,$range) = split /=/, $range,2;
($first,$last) = split /-/, $range,2;
$i++;
add_result($entry->dn(),$pure_attr,$entry->get_value($attr,alloptions => 1, asref => 1),\$result) if $last eq '*' or $last >= $parms{'attribute_page'};
### if $last eq "*", indicates this is the last range increment, and
### we do not need to perform another supplemental search
if ($last ne "*") {
my $range_diff = ($last - $first) + 1;
my $increment = $last + $range_diff;
$last = $last + 1;
$attr = "$pure_attr;range=$last-$increment";
$parms{'attrs'} = [$attr];
search_nonpaged(%parms);
}
} else {### if $attr matches range pattern
add_result($entry->dn(),$attr,$entry->get_value($attr,alloptions => 1, asref => 1),\$result);
}
}
return 1;
}
sub search_nonpaged{
my %input = #_;
my #page_search_args = (
'base' => $input{"base"},
'scope' => $input{'scope'},
'filter' => $input{'filter'},
'attrs' => $input{'attrs'},
'deref' => 'never',
'raw' => qr!^DO_NOT_MATCH!,
);
my $msg = $ldap->search(#page_search_args);
if ($msg->is_error()) {
die "ERROR: ",$msg->error,"\n";
}
while (my $entry = $msg->pop_entry()){
$ldap_searches++;
print_all_attributes($entry);
}
}

You maybe able to simplify the program by searching for:
memberOf=CN=GroupOne,OU=Security Groups,OU=Groups,DC=YOURDOMAIN,DC=NET
You will still need to use the paged results control but will not need the range control.
Microsoft Active Directory uses the MaxValRange to control the number of values that are returned in the retrieval of multi-valued attributes of an entry.
By using the filter above, you can avoid the MaxValRange settings.
BY THE WAY: if you want to obtain nested members also, try:
(memberOf:1.2.840.113556.1.4.1941:=CN=GroupOne,OU=Security Groups,OU=Groups,DC=YOURDOMAIN,DC=NET)
This filter uses the LDAP_MATCHING_RULE_IN_CHAIN extensible match.
-jim

I found an easier way to search all member of a AD group:
http://permalink.gmane.org/gmane.comp.lang.perl.modules.ldap/246
use Net::LDAP; use Net::LDAP::Util;
# Connect to AD make sure to specify version 3
$ldap = new Net::LDAP("myGC.yy.xx.com",
port => 3268,
debug => 0,
version => 3
) or die "New failed:$ <at> ";
# Do an anonymous bind. You MAY have to do an authenticated bind in your configuration
$result=$ldap->ldapbind() || die "Bind Failed:$ <at> ";
# Some error trapping
$err=$result->code;
if ($err){
$errname=Net::LDAP::Util::ldap_error_name($err);
$errtxt=Net::LDAP::Util::ldap_error_text($err);
if ($errtxt){
print "($err) $errtxt\n";
}
else
{
if ($errname){
print "($err) $errname\n";
}
else
{
print "ERR: $err\n";
}
}
exit;
}
# The combination of the search base and filter determine which object that you
# retrieve
# set search filter to groups of objects. This is what you want to enumerate NT groups.
$filter="(objectClass=group)";
# Set the search base to the DN of the object that you want to retrieve. BTW, using this method on
# groups with less than 1000 members works as well.
$base='CN=mygroup,DCyyy,DC=xxx,DC=com';
# Set the initial attribute indexes and name
$found=1; $startr=0; $endr=-1; $startattr="member";
while($found){
# Create the attribute range specification
$startr=$endr+1;
$endr=$startr+999;
$attr="$startattr;range=$startr-$endr";
$saveattr=$attr;
<at> attr=("$attr");
# Perform the search
$result=$mesg = $ldap->search(base => "$base",filter => $filter,
attrs => [ <at> attr],
scope => "sub") or die "search died";
# Some error trapping
$err=$result->code;
if ($err){
if (!($err == 1)){
$errname=Net::LDAP::Util::ldap_error_name($err);
$errtxt=Net::LDAP::Util::ldap_error_text($err);
if ($errtxt){
print "($err) $errtxt\n";
}
else
{
if ($errname){
print "($err) $errname\n";
}
else
{
print "ERR: $err\n";
}
}
}
else
{
print "COUNT=$cnt\n";
}
exit;
}
$found=0;
# OK, get the attribute range...so we can update the value of the attribute
# on the next pass
foreach $entry ($mesg->all_entries) {
<at> attr=$entry->attributes;
foreach( <at> attr){
$curattr=$_;
}
}
# Print out the current chunk of members
foreach $entry ($mesg->all_entries) {
$ar=$entry->get("$curattr");
foreach( <at> $ar){
$cnt++;
print "$_\n";
}
$found=1;
if (! <at> $ar[0]){
$found=0;
}
}
# Check to see if we got the last chunk. If we did print toe total and set
# the found flag so we don't search for anymore members
if ($curattr=~/\;range=/){
if ($curattr=~/\-\*/){
print "LASTCOUNT:$cnt\n";
$found=0;
}
}
}

What is the most error resistance way to read a line from a CSV into a Hash in Perl?

I think I'm probably missing something obvious here, so please enlighten me.
Currently I'm reading a CSV file into perl using Text::CSV and it's 'parse' method (outlined below).
csv->parse method:
while (<FILE>) {
if ($csv->parse($_)) {
my #columns = $csv->fields();
'refer to items with: $columns[1]'
}
else {
'Handle the parse error here'
}
}
I'm now looking for a way to read these values into a hash instead of an array. Going through the Text::CSV documentation it seems the most efficient way to do this is by using the 'getline' method (Outlined below), but I'm unsure how to catch errors in a similar manner to the way they are caught using the array approach.
csv->getline method:
my #cols = ("col1", "col2", "col3");
my $item = {};
$csv->bind_columns( \#{$item}{#cols} );
while( $csv->getline($it_fh) ) {
'refer to items using: $item->{col1}'
}
Any hints/tips/links would be great, as my Googleing seems to have come up empty?
EDIT: So here's my understanding of the answer I've accepted, just to clarify what I understand as the fault tolerance of this method.
$csv->column_names( qw(col1 col2 col3) );
my $line;
until ( eof(FILE) ) {
$line++;
my $item = $csv->getline_hr( \*FILE );
if ( $item ) {
# refer to items as $item->{col1}
} else {
my $err = "Line: " . $line . "failed to parse\n"
. "Input: " . $csv->error_input . "\n"
. "Error: " . $csv->error_diag . "\n";
print STDERR $err;
}
}

Well, there's always the straightforward approach:
my #cols = qw(col1 col2 col3);
while ( <FILE> ) {
if ( $csv->parse($_) ) {
my %item;
#item{#cols} = $csv->fields();
# refer to items using $item{col1}
}
else {
# handle the parse error here
}
}
However, I suspect that the following may be a bit more efficient, at least if using the XS implementation of Text::CSV:
$csv->column_names( qw(col1 col2 col3) );
until ( eof(FILE) ) {
my $item = $csv->getline_hr( \*FILE );
if ( $item ) {
# refer to items as $item->{col1}
} else {
# handle the parse error here
}
}

The proper usage of Text::CSV_XS is
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ binary => 1 });
open my $fh, "<:encoding(UTF-8)", $qfn)
or die("Can't open \"$qfn\": $!\n");
$csv->column_names(qw( col1 col2 col3 ));
while (my $row = $csv->getline_hr($fh)) {
...
}
$csv->eof()
or die("CSV error processing \"$qfn\": ".($csv->error_diag())."\n");
The previously posted version hid errors for no benefit.
If the CSV file has a header row, you can use the following:
my $header = $csv->getline($fh)
or die("No header\n");
$csv->column_names(#$header);

You want Text::CSV::Slurp. From its docs:
use Text::CSV::Slurp;
my $data = Text::CSV::Slurp->load(file => $filename [,%options]);
my $data = Text::CSV::Slurp->load(filehandle => $filehandle [,%options]);
my $data = Text::CSV::Slurp->load(string => $string [,%options]);
$data is now an arrayref of hashrefs.
Edit: I missed that the point of the question is probably the error handling. I don't think it performs any extra validation out of the box.

Perl Working On Two Hash References

I would like to compare the values of two hash references.
The data dumper of my first hash is this:
$VAR1 = {
'42-MG-BA' => [
{
'chromosome' => '19',
'position' => '35770059',
'genotype' => 'TC'
},
{
'chromosome' => '2',
'position' => '68019584',
'genotype' => 'G'
},
{
'chromosome' => '16',
'position' => '9561557',
'genotype' => 'G'
},
And the second hash is similar to this but with more hashes in the array. I would like to compare the genotype of my first and second hash if the position and the choromosome matches.
map {print "$_= $cave_snp_list->{$_}->[0]->{chromosome}\n"}sort keys %$cave_snp_list;
map {print "$_= $geno_seq_list->{$_}->[0]->{chromosome}\n"}sort keys %$geno_seq_list;
I could do that for the first array of the hashes.
Could you help me in how to work for all the arrays?
This is my actual code in full
#!/software/bin/perl
use strict;
use warnings;
use Getopt::Long;
use Benchmark;
use Config::Config qw(Sequenom.ini);
useDatabase::Conn;
use Data::Dumper;
GetOptions("sam=s" => \my $sample);
my $geno_seq_list = getseqgenotypes($sample);
my $cave_snp_list = getcavemansnpfile($sample);
#print Dumper($geno_seq_list);
print scalar %$geno_seq_list, "\n";
foreach my $sam (keys %{$geno_seq_list}) {
my $seq_used = $geno_seq_list->{$sam};
my $cave_used = $cave_snp_list->{$sam};
print scalar(#$geno_seq_list->{$_}) if sort keys %$geno_seq_list, "\n";
print scalar(#$cave_used), "\n";
#foreach my $seq2com (# {$seq_used } ){
# foreach my $cave2com( # {$cave_used} ){
# print $seq2com->{chromosome},":" ,$cave2com->{chromosome},"\n";
# }
#}
map { print "$_= $cave_snp_list->{$_}->[0]->{chromosome}\n" } sort keys %$cave_snp_list;
map { print "$_= $geno_seq_list->{$_}->[0]->{chromosome}\n" } sort keys %$geno_seq_list;
}
sub getseqgenotypes {
my $snpconn;
my $gen_list = {};
$snpconn = Database::Conn->new('live');
$snpconn->addConnection(DBI->connect('dbi:Oracle:pssd.world', 'sn', 'ss', { RaiseError => 1, AutoCommit => 0 }),
'pssd');
#my $conn2 =Database::Conn->new('live');
#$conn2->addConnection(DBI->connect('dbi:Oracle:COSI.world','nst_owner','nst_owner', {RaiseError =>1 , AutoCommit=>0}),'nst');
my $id_ind = $snpconn->execute('snp::Sequenom::getIdIndforExomeSample', $sample);
my $genotype = $snpconn->executeArrRef('snp::Sequenom::getGenotypeCallsPosition', $id_ind);
foreach my $geno (#{$genotype}) {
push #{ $gen_list->{ $geno->[1] } }, {
chromosome => $geno->[2],
position => $geno->[3],
genotype => $geno->[4],
};
}
return ($gen_list);
} #end of sub getseqgenotypes
sub getcavemansnpfile {
my $nstconn;
my $caveman_list = {};
$nstconn = Database::Conn->new('live');
$nstconn->addConnection(
DBI->connect('dbi:Oracle:CANP.world', 'nst_owner', 'NST_OWNER', { RaiseError => 1, AutoCommit => 0 }), 'nst');
my $id_sample = $nstconn->execute('nst::Caveman::getSampleid', $sample);
#print "IDSample: $id_sample\n";
my $file_location = $nstconn->execute('nst::Caveman::getCaveManSNPSFile', $id_sample);
open(SNPFILE, "<$file_location") || die "Error: Cannot open the file $file_location:$!\n";
while (<SNPFILE>) {
chomp;
next if /^>/;
my #data = split;
my ($nor_geno, $tumor_geno) = split /\//, $data[5];
# array of hash
push #{ $caveman_list->{$sample} }, {
chromosome => $data[0],
position => $data[1],
genotype => $nor_geno,
};
} #end of while loop
close(SNPFILE);
return ($caveman_list);
}

The problem that I see is that you're constructing a tree for generic storage of data, when what you want is a graph, specific to the task. While you are constructing the record, you could also be constructing the part that groups data together. Below is just one example.
my %genotype_for;
my $record
= { chromosome => $data[0]
, position => $data[1]
, genotype => $nor_geno
};
push #{ $gen_list->{ $geno->[1] } }, $record;
# $genotype_for{ position }{ chromosome }{ name of array } = genotype code
$genotype_for{ $data[1] }{ $data[0] }{ $sample } = $nor_geno;
...
return ( $caveman_list, \%genotype_for );
In the main line, you receive them like so:
my ( $cave_snp_list, $geno_lookup ) = getcavemansnpfile( $sample );
This approach at least allows you to locate similar position and chromosome values. If you're going to do much with this, I might suggest an OO approach.
Update
Assuming that you wouldn't have to store the label, we could change the lookup to
$genotype_for{ $data[1] }{ $data[0] } = $nor_geno;
And then the comparison could be written:
foreach my $pos ( keys %$small_lookup ) {
next unless _HASH( my $sh = $small_lookup->{ $pos } )
and _HASH( my $lh = $large_lookup->{ $pos } )
;
foreach my $chrom ( keys %$sh ) {
next unless my $sc = $sh->{ $chrom }
and my $lc = $lh->{ $chrom }
;
print "$sc:$sc";
}
}
However, if you had limited use for the larger list, you could construct the specific case
and pass that in as a filter when creating the longer list.
Thus, in whichever loop creates the longer list, you could just go
...
next unless $sample{ $position }{ $chromosome };
my $record
= { chromosome => $chromosome
, position => $position
, genotype => $genotype
};
...