Lucene.NET with Windows Azure - lucene.net

I have an ASP.NET MVC 4 site hosted on Windows Azure. I needed full text search in this site so I used Lucene.NET. Lucene is using a Windows Azure Blob to store the indexing files. Currently, a query takes a long time (approx. 1 min). When I look in Fiddler, I notice that 285 requests are fired off to the Blob storage.
My Blob storage currently only has 10 files in it. The largest file is only 177kb. I also noticed that the Dispose call takes ~20 seconds. Here is my code. I don't feel like I'm doing anything too crazy
IndexWriter indexWriter = InitializeSearchIndex();
if (indexWriter != null)
{
foreach (var result in cachedResults)
{
var document = new Document();
document.Add(new Field("Name", result.Name, Field.Store.YES, Field.Index.NOT_ANALYZED));
document.Add(new Field("ID", result.ID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
document.Add(new Field("Description", result.Description, Field.Store.YES, Field.Index.NOT_ANALYZED));
document.Add(new Field("LastActivity", result.LastActivity, Field.Store.YES, Field.Index.NOT_ANALYZED));
indexWriter.AddDocument(document);
}
indexWriter.Dispose();
}
At the same time, I'm not sure why this is taking so long.

If your search set is small/bounded you might want to have a look at the cache (preview) version of a Lucene.NET directory I wrote - it will be MUCH faster than a blob based directory:
https://github.com/ajorkowski/AzureDataCacheDirectory
Of course... if you expect to have an unbounded number of documents etc this wont be an optimal solution.
I know that Lucene.NET creates a bunch of temp files and then combines them at points... Perhaps calling .Optimise or something similar might combine all the temp files before it actually gets to the point of pushing them up to blob storage (I think this step is obsolete in the newer Lucene.NET versions though...)

Related

How to generating Excel File with multi sheet by using SQL Server 2012?

I need to create a stored procedure or query in SQL Server 2012 to take a data table from C# code, and create an Excel file with multiple sheets based on the data that exists in said data table.
From data below as an example, I have a function GetDate returning data table.
Create File ABC.xlsx with two sheets - first sheet name source and second sheet name types, and load data related to every sheet based on data.
So the result will be a new file Abc.xlsx with two sheets Source and Types and every sheet has one row of data.
public static DataTable GetData()
{
DataTable dataTable = new DataTable();
dataTable.Columns.Add("PartId", typeof(int));
dataTable.Columns.Add("Company", typeof(string));
dataTable.Columns.Add("Files", typeof(string));
dataTable.Columns.Add("Tab", typeof(string));
dataTable.Rows.Add(1222,"micro","Abc","source");
dataTable.Rows.Add(1321, "silicon", "Abc", "Types");
return dataTable;
}
Can SQL Server 2012 create Excel files with multiple sheets or not?
I can do that in C# but from SQL I cannot; so how can I achieve that from SQL Server 2012 by any way ?
The SQL stored procedure it's not necessary for this. You can create the .xlsx file directly from C# iterating over the data rows. Something like:
var xlApp = new Microsoft.Office.Interop.Excel.Application();
var workbook = xlApp.Workbooks.Add();
var data = GetData();
foreach (DataRow row in data.Rows)
{
var worksheet = xlApp.Worksheets.Add();
worksheet.Name = row["Tab"].ToString();
}
xlWorkbook.SaveAs("path.xlsx");
Some more detailed docs about Microsoft.Office.Interop here
Personally, I would go to a NuGet package for the .xlsx file creation since with Microsoft.Office.Interop you need the Excel from Office actually installed on the machine you're running the code (maybe that won't be installed on you deployment server and may cost more money to install it).
My personal preference for this is this NuGet package.
Good luck!

Splunk to avoid duplication of data pulled by REST API

I have splunk instance where i configure Data Inputs as "REST API input for polling data from RESTful endpoints".
I have almost around 20+ endpoints and where i am pulling data in json format and loading in single index.
However each time any reports or search query runs it will double same data again like very first fetch brings 5 values and subsequent fetch will bring another 5 and so and keep increasing.
Now in my dashboards and reports i kind of landed into problem of duplicate data. How i should avoid it.
So for very unusual work around i increased interval from 1 min to 1 months which helps me to avoid data duplication.
However i cannot have stale data for month...i can still survive with 1 day interval but not with 1 month.
Is there any way in splunk where i can keep my REST API Call tidy(avoid duplicates) ... to make my dashboards and reports on the fly.
Here is snippet of my inputs.conf file for REST API.
[rest://rst_sl_get_version]
auth_password = ccccc
auth_type = basic
auth_user = vvvvvvv
endpoint = https://api.xx.com/rest/v3/xx_version
host = slrestdata
http_method = GET
index = sldata
index_error_response_codes = 0
response_type = json
sequential_mode = 0
sourcetype = _json
streaming_request = 0
polling_interval = 2592000
To remove data that you no longer need or want, you can use the clean command:
splunk clean eventdata -index <index_name>
From the Splunk documentation:
To delete indexed data permanently from your disk, use the CLI clean command. This command completely deletes the data in one or all indexes, depending on whether you provide an argument. Typically, you run clean before re-indexing all your data.
The caveat with this method is that you have to stop Splunk before executing clean. If you wanted to automate the process, you could write a script to stop Splunk, run clean with your parameters, then restart Splunk.
Assuming that every time you make REST API call you have new information you could code a new responsehandler atached in splunkweb/etc/app/rest_ta/bin/responsehandlers.py to include a new field to your json data, (an Id of report reportTime=ff/hh/yyyy h:m:s) so when coding your dashboard you would have a new field with which you could dynamically get last ID to report and at the same time save an historic of your data reports to get more business information.
class RestGetCustomField:
def __init__(self,**args):
pass
def __call__(self, response_object,raw_response_output,response_type,req_args,endpoint):
if response_type == "json":
output = json.loads(raw_response_output)
for flight in output["Data"]:
flight2.update({"My New Field":datetoday()})
print_xml_stream(json.dumps(flight2))
else:
print_xml_stream(raw_response_output)
def datetoday():
today = datetime.date.today()
return today.strftime('%Y/%m/%d')
And then you could configure:
response_handler = RestGetCustomField
And that's it, now the indexed data have a new field that you can use to identify and or filter reports

Bi-directional database syncing for Postgres and Mongodb

Let's say I have a local server running and I also have an exactly similar server already running on amazon.
Both server can CRUD data to its databases.
Note that the servers use both `postgres` and `mongodb`.
Now when no one is using the wifi (usually in the night), I would like to sync both postgres and mongodb databases so that all writes from each database on server to each database on local gets properly applied.
I don't want to use Multi-Master because:
MongoDB does not support this architecture itself, so perhaps I will need a complex alternative.
I want to control when and how much I sync both databases.
I do not want to use network bandwidth when others are using the internet.
So can anyone show me right direction.
Also, if you list some tools that solve my problem, it will be very helpful.
Thanks.
We have several drivers what would be able to help you with this process. I'm presuming some knowledge of software development and will showcase our ADO.NET Provider for MongoDB, which using the familiar-looking MongoDBConnection, MongoDBCommand, and MongoDBDataReader objects.
First, you'll want to create your connection string for connecting with you cloud MongoDB instance:
string connString = "Auth Database=test;Database=test;Password=test;Port=27117;Server=http://clouddbaddress;User=test;Flatten Objects=false";
You'll note that we have the Flatten Objects property set to false, this ensures that any JSON/BSON objects contained in the documents will be returned as raw JSON/BSON.
After you create the connection string, you can establish the connection and read data from the database. You'll want to store the returned data in some way that would let you access it easily for future use.
List<string> columns = new List<string>();
List<object> values;
List<List<object>> rows = new List<List<object>>();
using (MongoDBConnection conn = new MongoDBConnection(connString))
{
//create a WHERE clause that will limit the results to newly added documents
MongoDBCommand cmd = new MongoDBCommand("SELECT * FROM SomeTable WHERE ...", conn);
rdr = cmd.ExecuteReader();
results = 0;
while (rdr.Read())
{
values = new List<object>();
for (int i = 0; i < rdr.FieldCount; i++)
{
if (results == 0)
columns.Add(rdr.GetName(i));
values.Add(rdr.GetValue(i));
}
rows.Add(values);
results++;
}
}
After you've collected all of the data for each of the objects that you want to replicated, you can configure a new connection to your local MongoDB instance and build queries to insert the new documents.
connString = "Auth Database=testSync;Database=testSync;Password=testSync;Port=27117;Server=localhost;User=testSync;Flatten Objects=false";
using (MongoDBConnection conn = new MongoDBConnection(connString)) {
foreach (var row in rows) {
//code here to create comma-separated strings for the columns
// and values to be inserted in a SQL statement
String sqlInsert = "INSERT INTO backup_table (" + column_names + ") VALUES (" + column_values + ")";
MongoDBCommand cmd = new MongoDBCommand(sqlInsert, conn);
cmd.ExecuteQuery();
}
At this point, you'll have inserted all of the new documents. You could then change your filter (the WHERE clause at the beginning) to filter based on updated date/time and update their corresponding entries in the local MongoDB instance using the UPDATE command.
Things to look out for:
Be sure that you're properly filtering out new/updated entries.
Be sure that you're properly interpreting the type of variable so that you properly surround with quotes (or not) when entering the values in the SQL query.
We have a few drivers that might be useful to you. I demonstrated the ADO.NET Provider above, but we also have a driver for writing apps in Xamarin and a JDBC driver (for Java).

SQL Server CE. Delete data from all tables for integration tests

We are using SQL Server CE for our integration tests. At the moment before every test, we delete all data from all columns, then re-seed test data. And we drop the database file when the structure changes.
For deletion of data we need to go through every table in correct order and issue Delete from table blah and that is error-prone. Many times I simply forget to add delete statement when I add new entities. So it would be good if we can automate data-deletion from the tables.
I have seen Jimmy Bogard's goodness for deletion of data in the correct order. I have implemented that for Entity Frameworks and that works in full-blown SQL Server. But when I try to use that in SQL CE for testing, I get exception, saying
System.Data.SqlServerCe.SqlCeException : The specified table does not exist. [ ##sys.tables ]
SQL CE does not have supporting system tables that hold required information.
Is there a script that works with SQL CE version that can delete all data from all tables?
SQL Server Compact does in fact have system tables listing all tables. In my SQL Server Compact scripting API, I have code to list the tables in the "correct" order, not a trivial task! I use QuickGraph, it has an extension method for sorting a DataSet. You should be able to reuse some of that in your test code:
33
public void SortTables()
{
var _tableNames = _repository.GetAllTableNames();
try
{
var sortedTables = new List<string>();
var g = FillSchemaDataSet(_tableNames).ToGraph();
foreach (var table in g.TopologicalSort())
{
sortedTables.Add(table.TableName);
}
_tableNames = sortedTables;
//Now iterate _tableNames and issue DELETE statement for each
}
catch (QuickGraph.NonAcyclicGraphException)
{
_sbScript.AppendLine("-- Warning - circular reference preventing proper sorting of tables");
}
}
You must add the QuickGraph DLL files (from Codeplex or NuGet) and you can find the implementation of GetAllTableNames and FillSchemaDataSet here http://exportsqlce.codeplex.com/SourceControl/list/changesets (in Generator.cs and DbRepository.cs)

mass insert mongoid

I have this web app currently on Heroku that takes plain text of mostly comma separated values (or other delimiter separated values) that a user copies-and-pastes into a web form, and the app will then get data from each line and save it to a mongo db.
for instance
45nm, 180, 3
44nm, 180, 3.5
45nm, 90, 7
...
#project = Project.first # project embeds_many :simulations
#array_of_array_of_data_from_csv.each do |line|
#project.simulations.create(:thick => line[0], :ang => line[1], :pro => line[2])
#e.g. line[0] => 45nm, line[1] => 180, line[2] => 3
end
For this app's purposes, I can't let the user do any kind of import, we have to get the data from them from a textarea. And each time, the user can paste upto 30,000 lines. I tried doing that (30,000 data points) on Heroku with some fake data in a console, it terminated saying long processes are not supported in console, try rake tasks instead.
So I was wondering if anyone knows either way it takes so long to insert 30,000 documents (of course, it can be that's the the way it is), or knows another way to speedily insert 30,000 documents?
Thanks for your help
If are you inserting that many documents you should be doing it as a batch ... I routinely insert 200,000 document batches and they get created in a snap!
So, instead of making a loop that "creates" / inserts a new document each time just have your loop append an array of documents and then insert that into MongoDB as one big batch.
An example of how to do that with mongoid can be found in this question.
However you should keep in mind this might end up being fairly memory intensive (as the whole array of hashes/documents would be in memory as you build it.)
Just be careful :)