Why Is Unstructured Data Such A Problem?: January 2014

Friday, January 31, 2014

The fantasy of 99.999% availability

Too many storage companies, cloud vendors, service providers, etc. tout the idea of 99.999% uptime. (Please adjust the number of "9"s to match your favorite vendor). It is a good idea to have storage availability at a high level. We all have probably experienced the horror of not having access to your data. Can you say "abort, retry, ignore?".

However, everyone knows a chain is only as strong as its weakest link. So, what is the weak link with unstructured data? The storage people tend to forget that it is the business users that are the start of the chain. They go through the file system, which in turn goes to actual storage of some kind. These business users want to ensure their business data is always available.

That last sentence provides a key insight in to this uptime or availability metric. They want their business data always available, not just their storage system. These, unfortunately, are two different things.

Because the permissions are not set on unstructured data when stored on a file server (see previous blog posting on why that is so difficult), it is absolutely trivial to have that data disappear. As one customer put it, "I'm one mouse click away from data loss." It is very easy to simply drag-and-drop critical files out into oblivion without noticing it. Unless a person notices this, can repair it, or fire up restore before backup overwrites its media and destroys the last good copy of the file, it is pretty much gone. What needs to happen is the information must be protected at all points in the chain before it makes much sense to pay for yet another "9".

Is it really that reassuring that storage can, 99.999% of the time tell you your data is gone?

Wednesday, January 22, 2014

Why LTFS is not enough to bring tape tiering to the masses

"Those who ignore history are doomed to repeat it."

The release by IBM of the Linear Tape File System or LTFS has made it very easy for just about anyone to setup a file system interface to a tape library. Tape libraries have always held the promise that slow, huge, cheap storage could offload a large portion of the unstructured data currently clogging the servers. The passed-around statistic that "60% to 80% of the data stored hasn't even been looked at in over a year" would seem to be a market made in heaven for tape libraries. Wow, if we could just move this dead data (is the new term Dark Data now?) off to tape, we can recover huge amounts of online capacity resulting in huge storage savings. That would be amazing. Hey, now all we have to do is download LTFS and we are all set!

Hold on. Hum. Hasn't this been tried before? How well did that work out? Why is this time different?

Over my (ah, cough...) 3000 years in the storage industry, I've seen many companies attempt such a scheme. Almost all of them were based upon some type of file system interface to the storage device. Almost all of them made the above claim. All of them failed to gain much traction. Many of these companies are long since out of business. I've personally been involved in at least a half a dozen of these projects. (Does anyone remember Magneto-Optical storage?) Why have they all failed to gain much traction? Yes, there are some special cases where they work but of you look closely, they are almost always relegated to a single application and, even worse, are only used by a single user. Nearly every computer or file server could benefit from moving off its old stuff to tape. Despite all these efforts and millions (billions?) of dollars spent on it, why has it never taken off?

Why?

Without getting into the technical details, these systems have two "Time Bombs" that have killed the products and even the industry a couple of times.

The first one we call operating system "constipation". If users, applications, or other computers are free to simply make any random accesses to such a system, it is inevitable that certain OS resources will end up being reserved for use by the library leaving nothing left for anything else. The operating system simply locks up. You often can't even log in to kill some of the processes that are locking the system up because the system is locked up. Often the only way around this is to physically pull the plug out of the wall. The funny thing is that this looks much like a hardware failure. I've seen users send their systems in for repair only to find out it was the library software the killed it. Too many of these episodes and the product gets rolled back on the loading dock.

The second, and related problem is has to do with performance. We call this the Century Problem. Even the LTFS documentation warns about "poor performance" if certain operations are attempted. The reality is that it turns out to be amazingly easy to access a library in such a way that it will literally take 100 years for the processes to finish!

There is an entire technical dissertation that could be presented as to why this occurs but here is the bottom line:

The last thing you can do is expose a tape library as a file system!

Does this mean all is lost? No. An information asset management system can solve this problem to make a tape library, or other slow, cheap, and huge storage devices available to everyone all while avoiding the time bombs.

More to come on this one....

One reason why backup is so bad.

Here is a fun question to ask just about any computer user....

"You backup your computer, right? If a file is accidentally deleted, what does your backup do?"

The answer you will most likely get is, "ah, nothing."

Then you can ask them,

"What does backup do if you don't notice a file is lost soon enough?"

The experienced users will say, "It will eventually throw away my only good copy of the data!"

Is this a good thing? Is this the state of the art of backup? Yep. It is sad to say that when something changes on a file system, it could either be a valid change or a corruption but backup has no way to tell the difference. It simply ignores this and keeps making copies of the file system.

Every backup system will eventually need to recycle the capacity used (no one has infinite capacity) so it is quite likely that if a user doesn't notice a problem, backup will, by design, destroy the last valid copy of your data.

This is how all backup system works today and is nowhere near what customers want. (Yes, even the snapshot-based backups, and don't even get me started about Continuous Data Protection...)

Even small computers have hundreds of thousands of files on them. Do you mean to tell me that the entire backup industry relies on the hope that some human will miraculously notice a problem and see it soon enough to manually start the restore to get the file back? Yea, right...

What information users really want is a system that is smart enough to immediately save a copy of the asset when a valid change is made, but also immediately restore it from the "backup" if a corruption or deletion happens.

I can tell you from personal experience that this feature of Information Asset Management is the most fun to demonstrate. We have nicknamed the demo the "fat finger" demo. We show the user that even bypassing all the security, one can still delete the file. However, in just a second or two, the file automatically reappears. (We also log the fact that the file had to be restored and tell someone that a restore happened).

This is just one example of the power of Information Asset Management. This problem has been a part of every backup product since the invention of backup products. Isn't it about time someone fixed it?

Tuesday, January 21, 2014

Securing Unstructured Data? Like proping the door open with a rock!

The amount of time and effort that goes into creating information assets, not to mention the consequences of getting them trashed, lost, or somehow compromised would seem to warrant some basic efforts in securing them.

Oh well. Let's see what actually happens.

Information assets are stored on file systems. Sharing these assets tends to be important so they are put on some type of network server in a modern file system. Nearly every file system in use today has ways to limit access and modification rights to the files. The Windows NTFS file system has complex access control lists (ACLs) that have some very impressive capabilities.

The sad truth is that these are almost never set correctly. No one has the ability to set these in any mode other than letting pretty much everyone do anything to any file. It is simply too difficult to set the permissions to read only until someone wants to change it, set it to read/write until the update is done, then set it back to read only. Instead, all the permissions are set so anyone can change or delete any of the files at any time. This is like installing an electronic combination door lock with retinal scanning but finding out that is too much of a hassle so you prop the door open with a rock! What could possibly go wrong?

Does anyone else see the disasters waiting to happen? This is the reason why an errant mouse click can corrupt or even destroy the very information the company spent so much time and money creating!

Isn't it a sad state of affairs that the very basic, fundamental, first line of defense for protecting assets isn't done? No wonder people make so many copies of stuff. They can't be sure there is a good copy anywhere!

OK, what about backup? We do backups right? We can restore it can't we? Wait 'til you hear this...

Test your knowledge of storage.

"Take out a clean sheet of paper."

That statement can strike fear in the hearts of students everywhere.

Well, how about just a single question? That shouldn't be too bad. I'll make it multiple choice as well. How hard can that be? OK, here is the question....

A train is traveling west at 50 mph with 20 passengers... Oh, hold on, wrong question. Sorry!

Let's say we want to add the capability to perform some type of storage management function on a file. There are lots of functions to chose from but let's pick archiving just for fun. There is a bit that needs to be set that will indicate the file should be archived. Got it? Great.

Now, here is the question:

Which one is harder?

A. Setting the bit. (If it is set, the file will be archived)

B. Storing the bit. (Saving it someplace)

C. Interpreting the bit. (Actually doing the archiving function)

Think about this for a minute.....

Come on, a little longer than that!

OK, did you pick answer C? Most people think that actually performing the storage management function, such as archiving in this case, is the hardest one to do.

WRONG!

The correct answer is A. It turns out that setting the bit is astonishingly difficult. In many cases, it is pretty much impossible.

The problem is that knowing if the bit should be set and when it should be set (and when it should NOT be set) is hugely complicated.

First of all, having to make this decision on at the file level means that potentially hundreds of thousands or even millions of files means hundreds of thousands or millions of individual decisions need to be made. When are all those decisions made? Once (or if) they are made, what if things change?

The current strategy of the storage industry is to pass the buck onto their customers. They ship a GUI or some type of policy engine that don't really help. The problem is that no one has the RIGHT information.

It turns out that the information needed to properly set the bit is the business context. It is when the business context changes, say to the "closed" state or something similar, that is the time the bit can be set. Without the
business context, people are just guessing.

The next posting is going to show that this inability setup the configuration for storage management has reeked havoc on unstructured data. It can also explain why backup is so bad.

The Container - A collection of Assets

Users never have just one Information Asset. The next concept that needs to be introduced is the idea of a container. (Yea, concepts are boring but hang with me for a minute) A container is a collection of information assets with the same business context. The general idea of a business context is the point at which an asset is within its designated life cycle.

Extending our contract example from a previous post, a container could hold all the active contracts, another could hold those that have been closed, while yet another can hold those that have been rejected.

OK, this is all well and good but why set things up this way? Why organize around the lifecycle of the assets? The key reason is that the management of the information, how it should be stored, backed up, protected, secured, etc., are all based on the business context, aka, where the assets are within the lifelcycle. Keeping assets together makes it much easier to setup and manage the various attributes of the information.

For example, active contracts could benefit from fast storage. Rejected contracts can be kept for 6 months on very cheap storage and then deleted. Customers may want to access their active contracts via a web portal. The list could go on but the power of the "container" is the ability to specify, implement, and maintain various aspects of the information asset. By setting these up on a container, these configurations don't need to be setup on every file. They can be setup on collections of assets. Note that these collections can hold hundreds of thousands of files, depending on the activity of the asset.

The idea of a container is the cornerstone of solving the storage management portion of the unstructured data problem. The idea for this came from the shipping industry...

Prior to about 1956, the loading and unloading of ships was a very manual operation. Old movies and photos show longshoreman moving cargo nets full of dry goods and odd-shaped packages. It could take weeks to unload and reload a ship. An enterprising guy named Malcom McLean figured out that if you put things into standardized shipping containers, then move these containers on ships, trains, and trucks, the productivity of the shipping industry would be vastly improved. Keeping track of a shipping container is much easier and less costly than tracking every item that could be in the container.

The same goes for storage. "Putting all the assets in a container" makes managing that entire collection much easier. There will be much more about this but the concept of managing data this way has a huge impact on the cost, complexity and value of storage management.

Sunday, January 19, 2014

What if you managed your information assets as if it were money?

For a more satirical look at the problem, imagine if all the information assets in your company were actual money. In reality, this is somewhat true because it took money to generate these assets in the first place. So, how would you control your assets now? Would you simply let anyone delete or corrupt your assets? Would you put all your money on a shelf and let anyone in the company “access” it? Would you let your people simply stuff it into their desks? Take them home? Move them up to some internet sharing site? Hand them off to a contractor? Clearly no business that is to survive would ever consider managing their financial assets this way. Why would you manage your information assets this way? Businesses cannot afford to mismanage information assets in such a way.

Information Assets, the key to unstructured data management

The idea of an information asset is a powerful new concept that is designed to bridge the gap between how business views information and how computers manage files. The information asset is an entirely new perspective of data and is necessitated by the fact that business users intuitively include much more than just an individual file when referring to their information. Simple things like, what is actually in the file, who it is for, etc., is often confusingly and inconsistently encoded in the file name or implied by its directory location. Other information such as prior revisions, the template that was used to create the file, associated files such as pictures, scanned copies, or other working documents, not to mention the relevant emails, are simply scattered if recorded at all. As one customer put it, “We have more controls, tracking, documentation, rules, checks, and management oversight of our petty cash than we do over our information assets”. The result of this is that computers are simply not designed to know what an information asset is, and therefore, are completely unable to manage it in any way.

To be clear, information assets are not simply the collection of files on a file server. The formal definition of an information asset is the set of all data, rules, and procedures that, collectively, represents a concept meaningful to the business. This set of data can include not only a collection of files but other important items such as tracking and descriptive information (e.g. the name of the customer), audit logs, emails, supporting documentation, images, etc., that collectively, have meaning to the business. These can range from a few items to very complex asset that might include thousands of files.

This new concept is best explained by example. Let’s take the common business concept of a contract, a written agreement between two or more parties enforceable by law. While that’s its definition, what actually constitutes a contact? Isn’t it just that Word document? That seemingly simple question points out why today’s computer systems are so unaware and unprepared to manage this type of asset from a business perspective.

An Example: The Composition Of A Contract

This type of asset, at least on the surface, appears to be pretty simple, probably manifesting itself as just an individual file. However, even a cursory analysis reveals there’s a bit more to it. The table below defines some of the components that can make up a contract.

1. The Word document

2. The PDF version suitable to send to the client

3. The client name

4. Client contact information

5. The type of contract

6. The value of the contract

7. Template that was used to create the contract

8. The audit log of everything that happened to it

9. Previous versions of the contract

10. Rejected versions

11. List of people who need to approve it

12. The sales rep who created it

13. The costing spreadsheet used to create sections

14. Scanned signature page

15. List of people who have not yet approved it

16. List of people to be notified when approved

17. Proof that people actually approved it

18. Proof that the customer received the copy

19. A cryptographic signature to detect tampering

20. A mirrored copy for protection

21. A copy to put up on the website

22. Login of users who can access it via the website

23. Copies of important emails from the customer

24. Any related photos, sketches, or drawings

25. References to previous contracts

26. Customer account number

27. Validation script to run against the contract

28. How long people are given to approve or reject it

The table simply lists the many things that can make up the contract asset. When these items are put into action in the real world of business, the complexities get even more interesting. Often times, managing something is simply the ability to answer some basic questions about the asset. Where is it? Is it on the server, on someone’s computer, on someone’s laptop that is traveling right now, in some email message in my inbox? Which one of these 5 copies is the right one? Which one did we send to the customer? Did they ever get it? Who was supposed to approve it? What state is it in? Has it been signed yet? Which template was used to create this? I have the PDF version, where is the real Word file? Who added this paragraph? Did anyone scan in the final signature page? Where did that go? What about that costing spreadsheet we used to derive the number? The list could continue but it should be painfully obvious that successful management of this information asset must encompass more than storing a file out on a file share someplace.

As can be seen from above, a contract is actually a quite complex entity. Anyone who has been involved in contract negotiations knows that problems related to managing just a single one of these assets can be costly often resulting in possible lost revenue, follow up opportunities put in jeopardy, employee productivity impacts, management stomach acid, and company reputation and brand impact. Contracts are just one example of the thousands of information assets that businesses struggle with every day.

Friday, January 17, 2014

Is the very definition of unstructured data wrong?

Everyone seems to agree unstructured data is a problem. Digging down into the real causes must start with challenging the assumptions being made about the problem itself. The first place to start is with the basic definitions. Just what is unstructured data? Who says so? Why? Who should really define unstructured data?

At first glance, the question as to what unstructured data is appears to be quite simple. It is just a pile of files. I’m going to show that such a definition is completely wrong and is at the heart of the problem. But first, let’s look into the question as to who gets to even set the definition.

The storage industry would seem to be a logical choice for having the responsibility of definition. They are the ones that are building the actual technology (file systems, object storage, etc.) Are they the right people though? Who else could be candidates? The IT organizations are the next choice but they aren’t much different than storage people. IT treats unstructured data as a pile of files that have to be stored and backed up. They are also missing the point.

Well, OK. Who’s left? The group of people who should have the most input to the definition of unstructured data are the ones that actually create them in the first place, are responsible for them once created, get judged by them, and get yelled at over them. These are the business users and managers that are actually using unstructured data to run their organizations. Remember them? They are the ones paying for all this storage in the first place. Unfortunately, they seem to be the most ignored by the storage industry. However, if you spend some time talking to these people, watching what they try to do with storage, and really listen to what they are saying, what you learn can be amazingly revealing. So, what do THEY talk about when they refer to unstructured data?

The first thing you notice talking to business users about managing their unstructured data is that they view it as something much different than files. Instead, they refer to what we call information assets, not individual files. We will drill into what these really are in much more detail but they are essentially business definitions of their information. They talk about their contracts, purchase orders, final reports, vendor agreements, marketing plans, etc. They don’t talk files and directories. Information assets are at a much higher level. They are collections of files, tracking data, logs, emails, rules, supporting items like photographs, people lists, spreadsheets, contact information, etc., that collectively are meaningful to that business. This is where the concept of the information asset came from. These business people work with these assets but are disappointed to find out all they get to contend with are individual files.

The next thing you notice is these same business users quickly transition to talking about collections of their assets. They start to refer to “all approved contracts”, “paid invoices from last year”, or some such descriptions. They are actually referring to something incredibly powerful for IT people. These are what we call containers, which are defined as collections of assets with the same business context. These containers are where storage management functions can be specified, not on the files individually. If you listen to these business people carefully, they are showing you how to solve one of the big problems with unstructured data management. You don’t have to decide how to apply functions to every single file separately. There are simply far too many to make that practical. However, applying them at the container level is now much simpler, more efficient, and can actually match the business requirements of that information.

One of the more surprising observations when doing these interviews is just how short they can be. The people in charge of the assets generally know how they work. They can tell you very quickly what they really need from storage. Many will ask for advice on the process, different ways of doing it, how others handle it, etc. They all complain about it, and most will tell a story or two about some disaster in the past. The key thing is that having a discussion with the actual stewards of the asset can very quickly and efficiently define how storage management should operate. Now IT can provide that level of service for the business. This is an example of the long standing desired but seldom realized objective to better align business and IT.

We will drill down into this new definition in much more detail in later postings but what are the ramifications of such a different view of unstructured data? The rest of this blog is going to be focused on how the information asset model impacts the storage industry but the ramifications are incredibly profound. It will send shock waves through the storage industry. Just about every aspect of unstructured data will be changed. Many of the core values of some companies will be impacted greatly. Even the economics of how storage management functions are funded, valued, implemented, configured, and deployed will change, in many cases drastically. Needless to say, this will make storage people “uncomfortable”!

The biggest challenge the new definition presents concerns the very operating system architecture and implementation of unstructured data. The core file interfaces down in the kernel of today’s operating systems, the file system implementations, file sharing protocols like CIFS, NFS, etc., are all incorrect. The very assumption these were built on so many years ago were actually wrong!

That is a pretty bold statement. It is also kind of scary. If this is right, and it will take a while to convince you it is right, what can we do about it? It would seem that the architecture so ingrained into today’s computers could not possibly be uprooted and replaced. Does this mean that a solution is simply not possible?

The “file infrastructure must change but the infrastructure can’t possibly change” paradox has kept a lot of good people away from attacking the problem. Even Microsoft isn’t big enough (foolish enough?) to rototill that much code. What chance does anyone else have at solving this? Does this mean we are stuck with all the problems and that’s just life? No.

All is not lost. We have shown that with careful systems engineering, adding information asset support to today’s operating systems (both Windows and Linux) can be done without ANY changes to the infrastructure. No changes to applications, operating systems, file systems, file sharing protocols, drivers, storage, or anything else for that matter need to change.

The ramifications of this new business view of unstructured data will be the focus of future posts…