Searching for file in bucket - google-cloud-storage

we are using GCS to store print files that we send to a remote printer. Ideally, we would love to be able to quickly search for files in our GCS bucket like we do on Windows. For example:
"24x18 little lamb striped blue"
This returns a file in Windows, even with typing an inexact match, for example:
24x18 mary had a little lamb striped background blue text.jpg
In GCS it looks like I have to select the filter "Name contains:" and then enter:
24x18
little lamb
striped
blue
Each of those terms needs to be separate. It's a bit time-consuming relative to a native Windows search, is there a way to get the bucket search working in a different way? We perform many of these searches daily, so it would be a huge help if we could get this sorted out.
Thanks!
EDIT: I do realize another option is to search:
"24x18*little lamb*striped*blue"
I guess we are just looking to replicate the searches across platforms so files are uniformly found either locally or on GCS.

What Windows search do? Windows search scans your files in your computer and index them. This means, it creates an entry in its own database with the file information (name, date, type, size,...).
If you want a similar feature on GCS, you need to create your own database, with full text search capacity. Use MongoDB or ElasticSearch for that, maybe PostGreSQL but I'm not sure, and the addon has to be available on Cloud SQL (not sure again).
Anyway, there is no native feature to index the GCS file and to quick find them, you need to implement this mechanism on your side.

Related

Best way to report the growth of a file using powershell?

I would like to report the database size to myself via email every week and make a comparison to the week before and display the growth in Megabyte and/or %.
I have everything besides the comparison done.
Imagine this setup :
SQL server with 100 databases
Now there are plenty of ways to do a comparison, I thought about writing the sizes into XML by powershell and later read out using a second script and report to me.
Since I trained myself in powershell I might have gaps here, so I am afraid to miss an easy way.
Does anyone has a nice Idea of how to compare the size?
The report and calculation I will manage myself later, I just need a good way to do that.
Currently I am on Powershell 3.0 but I can upgrade to 4.0
Don't invent the wheel again. Sql Server already has tools to monitor DB file sizes. So does Performance Monitor. There are several 3rd party products available too. Ask your local DBA if there already is such a system present.
A common practice is to query the server for DB file sizes on, say, daily basis and store it in utility db table with timestamp. Calculating change volumes, ratios and whatnot can be done on TSQL side. (Not that it is CPU intensive anyway.)
I would creat foreach database an csv file. then write out two rows:
Date,Size
27.08.2014,1024
28.08.2014,1040
29.08.2014,1080
Then you can import the csv file, sort the row by date, compare the last two sizes and send the result by mail.

how much data can i store per node in Neo4j

I need to save big chunks of Unicode text strings in Neo4j nodes. The documentation does not mention anything about the size of data one can store per node.
Does anyone know that?
I just tried the following with the neo4j web interface :
I wrote a line of 26 characters and duplicated it through 32000 lines, which makes a total of 832000 characters.
I created a node with a property "text" and copied my text in it, and it worked perfectly.
I tried again with 64000 lines with white spaces at the end of lines, with a total of 1728000 characters. Created a new node, then queried the node and copied the result back in a file to check the size (you never know), and wc gave me 1728001 (the one must be an error in the copy/paste process I suppose).
It didn't seem to complain.
FYI this is equivalent to a text with 345600 words of an average size of 4 and a space (5 characters), and a book of 1000 pages with 300 words per page.
I don't know however how this could impact performances if there are too many nodes. If it doesn't work well because of this, you could always consider having neo4j for storing informations about relationships, with a property ID as an id for another document oriented database to retrieve the text (or simply the path of a file as a path property).
Neo4j is by default indexed using Lucene. Lucene was built as a full text search toolbox (with Solr being the de facto search engine implementation). Since Lucene was intended to search over large amounts of text, my suspicion is that you can put as much text into a node as you want and it will work just fine.
Neo4j is a very nice solution for managing relationships between objects. As you may already know these relationships can have properties as well as the nodes themselves. But I think you cannot store "a big chunk" of data on these nodes. I think Neo4j was intended to be used with another database such as MongoDb or even mysql. You get "really fast" the information you first need and then look up for it using another engine. On my projects I store usernames, names, date of birth, ids, and these kind of information, but not very large text strings.

A custom directory/folder merge tool

I am thinking about developing a custome directory/folder merge tool as part of learning functional programming as well as to scratch a very personal itch.
I usually work on three different computers and I tend to accumulate lots of files (text, video, audio) locally and then painstakingly merge them for backup purposes. I am pretty sure I have dupes and unwanted files lying around wasting space. I am moving to a cloud backup solution as a secondary backup source and I want to save as much space as possible by eliminating redundant files.
I have a complex deeply nested directory structure and I want an automated tool that automatically walks down the folder tree and perform the merge. Another problem is that I use a mix of Linux and Windows and many of my files have spaces in the name...
My initial thought was that I need to generate hashes for every file and compare using hashes rather than file names (spaces in folder name as well as contents of files could be different between source and target). Is RIPEMD-160 a good balance between performance and collision avoidance? or is SHA-1 enough? Is SHA-256/512 overkill?
Which functional programming env comes with a set of ready made libraries for generating these hashes? I am leaning towards OCaml...
Check out the Unison file synchronizer.
I don't use it myself, but I heard quite a few positive reviews. It is a mature software based on some theoretic foundation.
Also, it is written in OCaml.

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.

How do you deal with lots of small files?

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.
I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.
Is there a way to optimize NTFS or Windows so that it can work with all these small files?
NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.
NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.
For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.
The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.
One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:
ABC\
DEF\
ABCDEFGHI.db
EFG\
ABCEFGHIJ.db
Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.
I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.
I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.
You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.
In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.
You could try using something like Solid File System.
This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.
http://www.eldos.com/solfsdrv/
If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.
Also, could you move files older than say, a year, to a different (but still accessible) location?
Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.
One common trick is to simply create a handful of subdirectories and divvy up the files.
For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.
Aside from placing the files in sub-directories..
Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.
So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.
Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.
If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.
The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:
1.xml
24.xml
12331.xml
2304252.xml
you can sort them into directories like so:
data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml
This scheme will ensure that you will never have more than 100 files in each directory.
Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?
If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.
The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).
Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.
Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.
Rename the folder each day with a time stamp.
If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.
Then you will get one folder for each day, each containing several thousand files.
You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.
You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.
To create a folder structure that will scale to a large unknown number of files, I like the following system:
Split the filename into fixed length pieces, and then create nested folders for each piece except the last.
The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.
12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg
This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.
And here's a beautiful PowerShell one-liner to get you going!
$s = '123456'
-join (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )