Why Is Unstructured Data Such A Problem?

Friday, January 31, 2014

The fantasy of 99.999% availability

Too many storage companies, cloud vendors, service providers, etc. tout the idea of 99.999% uptime. (Please adjust the number of "9"s to match your favorite vendor). It is a good idea to have storage availability at a high level. We all have probably experienced the horror of not having access to your data. Can you say "abort, retry, ignore?".

However, everyone knows a chain is only as strong as its weakest link. So, what is the weak link with unstructured data? The storage people tend to forget that it is the business users that are the start of the chain. They go through the file system, which in turn goes to actual storage of some kind. These business users want to ensure their business data is always available.

That last sentence provides a key insight in to this uptime or availability metric. They want their business data always available, not just their storage system. These, unfortunately, are two different things.

Because the permissions are not set on unstructured data when stored on a file server (see previous blog posting on why that is so difficult), it is absolutely trivial to have that data disappear. As one customer put it, "I'm one mouse click away from data loss." It is very easy to simply drag-and-drop critical files out into oblivion without noticing it. Unless a person notices this, can repair it, or fire up restore before backup overwrites its media and destroys the last good copy of the file, it is pretty much gone. What needs to happen is the information must be protected at all points in the chain before it makes much sense to pay for yet another "9".

Is it really that reassuring that storage can, 99.999% of the time tell you your data is gone?

Wednesday, January 22, 2014

Why LTFS is not enough to bring tape tiering to the masses

"Those who ignore history are doomed to repeat it."

The release by IBM of the Linear Tape File System or LTFS has made it very easy for just about anyone to setup a file system interface to a tape library. Tape libraries have always held the promise that slow, huge, cheap storage could offload a large portion of the unstructured data currently clogging the servers. The passed-around statistic that "60% to 80% of the data stored hasn't even been looked at in over a year" would seem to be a market made in heaven for tape libraries. Wow, if we could just move this dead data (is the new term Dark Data now?) off to tape, we can recover huge amounts of online capacity resulting in huge storage savings. That would be amazing. Hey, now all we have to do is download LTFS and we are all set!

Hold on. Hum. Hasn't this been tried before? How well did that work out? Why is this time different?

Over my (ah, cough...) 3000 years in the storage industry, I've seen many companies attempt such a scheme. Almost all of them were based upon some type of file system interface to the storage device. Almost all of them made the above claim. All of them failed to gain much traction. Many of these companies are long since out of business. I've personally been involved in at least a half a dozen of these projects. (Does anyone remember Magneto-Optical storage?) Why have they all failed to gain much traction? Yes, there are some special cases where they work but of you look closely, they are almost always relegated to a single application and, even worse, are only used by a single user. Nearly every computer or file server could benefit from moving off its old stuff to tape. Despite all these efforts and millions (billions?) of dollars spent on it, why has it never taken off?

Why?

Without getting into the technical details, these systems have two "Time Bombs" that have killed the products and even the industry a couple of times.

The first one we call operating system "constipation". If users, applications, or other computers are free to simply make any random accesses to such a system, it is inevitable that certain OS resources will end up being reserved for use by the library leaving nothing left for anything else. The operating system simply locks up. You often can't even log in to kill some of the processes that are locking the system up because the system is locked up. Often the only way around this is to physically pull the plug out of the wall. The funny thing is that this looks much like a hardware failure. I've seen users send their systems in for repair only to find out it was the library software the killed it. Too many of these episodes and the product gets rolled back on the loading dock.

The second, and related problem is has to do with performance. We call this the Century Problem. Even the LTFS documentation warns about "poor performance" if certain operations are attempted. The reality is that it turns out to be amazingly easy to access a library in such a way that it will literally take 100 years for the processes to finish!

There is an entire technical dissertation that could be presented as to why this occurs but here is the bottom line:

The last thing you can do is expose a tape library as a file system!

Does this mean all is lost? No. An information asset management system can solve this problem to make a tape library, or other slow, cheap, and huge storage devices available to everyone all while avoiding the time bombs.

More to come on this one....

One reason why backup is so bad.

Here is a fun question to ask just about any computer user....

"You backup your computer, right? If a file is accidentally deleted, what does your backup do?"

The answer you will most likely get is, "ah, nothing."

Then you can ask them,

"What does backup do if you don't notice a file is lost soon enough?"

The experienced users will say, "It will eventually throw away my only good copy of the data!"

Is this a good thing? Is this the state of the art of backup? Yep. It is sad to say that when something changes on a file system, it could either be a valid change or a corruption but backup has no way to tell the difference. It simply ignores this and keeps making copies of the file system.

Every backup system will eventually need to recycle the capacity used (no one has infinite capacity) so it is quite likely that if a user doesn't notice a problem, backup will, by design, destroy the last valid copy of your data.

This is how all backup system works today and is nowhere near what customers want. (Yes, even the snapshot-based backups, and don't even get me started about Continuous Data Protection...)

Even small computers have hundreds of thousands of files on them. Do you mean to tell me that the entire backup industry relies on the hope that some human will miraculously notice a problem and see it soon enough to manually start the restore to get the file back? Yea, right...

What information users really want is a system that is smart enough to immediately save a copy of the asset when a valid change is made, but also immediately restore it from the "backup" if a corruption or deletion happens.

I can tell you from personal experience that this feature of Information Asset Management is the most fun to demonstrate. We have nicknamed the demo the "fat finger" demo. We show the user that even bypassing all the security, one can still delete the file. However, in just a second or two, the file automatically reappears. (We also log the fact that the file had to be restored and tell someone that a restore happened).

This is just one example of the power of Information Asset Management. This problem has been a part of every backup product since the invention of backup products. Isn't it about time someone fixed it?

Tuesday, January 21, 2014

Securing Unstructured Data? Like proping the door open with a rock!

The amount of time and effort that goes into creating information assets, not to mention the consequences of getting them trashed, lost, or somehow compromised would seem to warrant some basic efforts in securing them.

Oh well. Let's see what actually happens.

Information assets are stored on file systems. Sharing these assets tends to be important so they are put on some type of network server in a modern file system. Nearly every file system in use today has ways to limit access and modification rights to the files. The Windows NTFS file system has complex access control lists (ACLs) that have some very impressive capabilities.

The sad truth is that these are almost never set correctly. No one has the ability to set these in any mode other than letting pretty much everyone do anything to any file. It is simply too difficult to set the permissions to read only until someone wants to change it, set it to read/write until the update is done, then set it back to read only. Instead, all the permissions are set so anyone can change or delete any of the files at any time. This is like installing an electronic combination door lock with retinal scanning but finding out that is too much of a hassle so you prop the door open with a rock! What could possibly go wrong?

Does anyone else see the disasters waiting to happen? This is the reason why an errant mouse click can corrupt or even destroy the very information the company spent so much time and money creating!

Isn't it a sad state of affairs that the very basic, fundamental, first line of defense for protecting assets isn't done? No wonder people make so many copies of stuff. They can't be sure there is a good copy anywhere!

OK, what about backup? We do backups right? We can restore it can't we? Wait 'til you hear this...

Test your knowledge of storage.

"Take out a clean sheet of paper."

That statement can strike fear in the hearts of students everywhere.

Well, how about just a single question? That shouldn't be too bad. I'll make it multiple choice as well. How hard can that be? OK, here is the question....

A train is traveling west at 50 mph with 20 passengers... Oh, hold on, wrong question. Sorry!

Let's say we want to add the capability to perform some type of storage management function on a file. There are lots of functions to chose from but let's pick archiving just for fun. There is a bit that needs to be set that will indicate the file should be archived. Got it? Great.

Now, here is the question:

Which one is harder?

A. Setting the bit. (If it is set, the file will be archived)

B. Storing the bit. (Saving it someplace)

C. Interpreting the bit. (Actually doing the archiving function)

Think about this for a minute.....

Come on, a little longer than that!

OK, did you pick answer C? Most people think that actually performing the storage management function, such as archiving in this case, is the hardest one to do.

WRONG!

The correct answer is A. It turns out that setting the bit is astonishingly difficult. In many cases, it is pretty much impossible.

The problem is that knowing if the bit should be set and when it should be set (and when it should NOT be set) is hugely complicated.

First of all, having to make this decision on at the file level means that potentially hundreds of thousands or even millions of files means hundreds of thousands or millions of individual decisions need to be made. When are all those decisions made? Once (or if) they are made, what if things change?

The current strategy of the storage industry is to pass the buck onto their customers. They ship a GUI or some type of policy engine that don't really help. The problem is that no one has the RIGHT information.

It turns out that the information needed to properly set the bit is the business context. It is when the business context changes, say to the "closed" state or something similar, that is the time the bit can be set. Without the
business context, people are just guessing.

The next posting is going to show that this inability setup the configuration for storage management has reeked havoc on unstructured data. It can also explain why backup is so bad.

The Container - A collection of Assets

Users never have just one Information Asset. The next concept that needs to be introduced is the idea of a container. (Yea, concepts are boring but hang with me for a minute) A container is a collection of information assets with the same business context. The general idea of a business context is the point at which an asset is within its designated life cycle.

Extending our contract example from a previous post, a container could hold all the active contracts, another could hold those that have been closed, while yet another can hold those that have been rejected.

OK, this is all well and good but why set things up this way? Why organize around the lifecycle of the assets? The key reason is that the management of the information, how it should be stored, backed up, protected, secured, etc., are all based on the business context, aka, where the assets are within the lifelcycle. Keeping assets together makes it much easier to setup and manage the various attributes of the information.

For example, active contracts could benefit from fast storage. Rejected contracts can be kept for 6 months on very cheap storage and then deleted. Customers may want to access their active contracts via a web portal. The list could go on but the power of the "container" is the ability to specify, implement, and maintain various aspects of the information asset. By setting these up on a container, these configurations don't need to be setup on every file. They can be setup on collections of assets. Note that these collections can hold hundreds of thousands of files, depending on the activity of the asset.

The idea of a container is the cornerstone of solving the storage management portion of the unstructured data problem. The idea for this came from the shipping industry...

Prior to about 1956, the loading and unloading of ships was a very manual operation. Old movies and photos show longshoreman moving cargo nets full of dry goods and odd-shaped packages. It could take weeks to unload and reload a ship. An enterprising guy named Malcom McLean figured out that if you put things into standardized shipping containers, then move these containers on ships, trains, and trucks, the productivity of the shipping industry would be vastly improved. Keeping track of a shipping container is much easier and less costly than tracking every item that could be in the container.

The same goes for storage. "Putting all the assets in a container" makes managing that entire collection much easier. There will be much more about this but the concept of managing data this way has a huge impact on the cost, complexity and value of storage management.

Sunday, January 19, 2014

What if you managed your information assets as if it were money?

For a more satirical look at the problem, imagine if all the information assets in your company were actual money. In reality, this is somewhat true because it took money to generate these assets in the first place. So, how would you control your assets now? Would you simply let anyone delete or corrupt your assets? Would you put all your money on a shelf and let anyone in the company “access” it? Would you let your people simply stuff it into their desks? Take them home? Move them up to some internet sharing site? Hand them off to a contractor? Clearly no business that is to survive would ever consider managing their financial assets this way. Why would you manage your information assets this way? Businesses cannot afford to mismanage information assets in such a way.