Friday, January 17, 2014

Is the very definition of unstructured data wrong?



Everyone seems to agree unstructured data is a problem.  Digging down into the real causes must start with challenging the assumptions being made about the problem itself.  The first place to start is with the basic definitions.  Just what is unstructured data?  Who says so?  Why?  Who should really define unstructured data?

At first glance, the question as to what unstructured data is appears to be quite simple.  It is just a pile of files.  I’m going to show that such a definition is completely wrong and is at the heart of the problem.  But first, let’s look into the question as to who gets to even set the definition.

The storage industry would seem to be a logical choice for having the responsibility of definition.  They are the ones that are building the actual technology (file systems, object storage, etc.)  Are they the right people though?  Who else could be candidates?  The IT organizations are the next choice but they aren’t much different than storage people.   IT treats unstructured data as a pile of files that have to be stored and backed up.  They are also missing the point. 

Well, OK.  Who’s left?  The group of people who should have the most input to the definition of unstructured data are the ones that actually create them in the first place, are responsible for them once created, get judged by them, and get yelled at over them.   These are the business users and managers that are actually using unstructured data to run their organizations.  Remember them?  They are the ones paying for all this storage in the first place.  Unfortunately, they seem to be the most ignored by the storage industry.  However, if you spend some time talking to these people, watching what they try to do with storage, and really listen to what they are saying, what you learn can be amazingly revealing.  So, what do THEY talk about when they refer to unstructured data?

The first thing you notice talking to business users about managing their unstructured data is that they view it as something much different than files.  Instead, they refer to what we call information assets, not individual files.  We will drill into what these really are in much more detail but they are essentially business definitions of their information.  They talk about their contracts, purchase orders, final reports, vendor agreements, marketing plans, etc.  They don’t talk files and directories.  Information assets are at a much higher level.  They are collections of files, tracking data, logs, emails, rules, supporting items like photographs, people lists, spreadsheets, contact information, etc., that collectively are meaningful to that business.  This is where the concept of the information asset came from.  These business people work with these assets but are disappointed to find out all they get to contend with are individual files.

The next thing you notice is these same business users quickly transition to talking about collections of their assets.  They start to refer to “all approved contracts”, “paid invoices from last year”, or some such descriptions.  They are actually referring to something incredibly powerful for IT people.  These are what we call containers, which are defined as collections of assets with the same business context.  These containers are where storage management functions can be specified, not on the files individually.  If you listen to these business people carefully, they are showing you how to solve one of the big problems with unstructured data management.  You don’t have to decide how to apply functions to every single file separately.  There are simply far too many to make that practical.  However, applying them at the container level is now much simpler, more efficient, and can actually match the business requirements of that information.

One of the more surprising observations when doing these interviews is just how short they can be.  The people in charge of the assets generally know how they work.  They can tell you very quickly what they really need from storage.  Many will ask for advice on the process, different ways of doing it, how others handle it, etc.  They all complain about it, and most will tell a story or two about some disaster in the past.  The key thing is that having a discussion with the actual stewards of the asset can very quickly and efficiently define how storage management should operate.   Now IT can provide that level of service for the business.  This is an example of the long standing desired but seldom realized objective to better align business and IT.

We will drill down into this new definition in much more detail in later postings but what are the ramifications of such a different view of unstructured data?   The rest of this blog is going to be focused on how the information asset model impacts the storage industry but the ramifications are incredibly profound.  It will send shock waves through the storage industry.  Just about every aspect of unstructured data will be changed.  Many of the core values of some companies will be impacted greatly.  Even the economics of how storage management functions are funded, valued, implemented, configured, and deployed will change, in many cases drastically.   Needless to say, this will make storage people “uncomfortable”!

The biggest challenge the new definition presents concerns the very operating system architecture and implementation of unstructured data.  The core file interfaces down in the kernel of today’s operating systems, the file system implementations, file sharing protocols like CIFS, NFS, etc., are all incorrect.  The very assumption these were built on so many years ago were actually wrong! 

That is a pretty bold statement.   It is also kind of scary.  If this is right, and it will take a while to convince you it is right, what can we do about it?  It would seem that the architecture so ingrained into today’s computers could not possibly be uprooted and replaced.  Does this mean that a solution is simply not possible?

The “file infrastructure must change but the infrastructure can’t possibly change” paradox has kept a lot of good people away from attacking the problem.  Even Microsoft isn’t big enough (foolish enough?) to rototill that much code.  What chance does anyone else have at solving this?  Does this mean we are stuck with all the problems and that’s just life?  No.

All is not lost.  We have shown that with careful systems engineering, adding information asset support to today’s operating systems (both Windows and Linux) can be done without ANY changes to the infrastructure.  No changes to applications, operating systems, file systems, file sharing protocols, drivers, storage, or anything else for that matter need to change. 

The ramifications of this new business view of unstructured data will be the focus of future posts…

No comments:

Post a Comment