Managing Your Digital Content

Submit New Article


Last Modified On :   September 12, 2006 5:47 PM PDT
Rate
 



Introduction
By Richard Winterton

People are creating, capturing, and storing digital content in ever-increasing quantities. The content varies and covers a wide range of data, from audio and video files, to still images, e-mails, presentations, and a variety of other document types. There are several reasons for the increasing volume of content, but two reasons stand out:

  • Performance increase — the continuance of Moore's Law
  • Storage — increasing storage capability and decreasing storage costs

Moore's Law roughly states that the number of transistors on an integrated circuit doubles every 18 months and as a result, the integrated circuits continue to have an exponential increase in computing power. Not only is computing power increasing exponentially but so is mass storage capabilities and capacity. A terabyte or more of personal storage for an individual's library in the next few years is not only in reach but likely because of new technology and the decreasing costs of mass storage devices.

The author Samuel Johnson once said, "Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it." Albert Einstein expressed a similar idea by saying, "Information is not knowledge." Random and unorganized information of this magnitude is unmanageable and may be useless unless you can find it. In order to find information and make use of the information, it has to be managed.

To put into context the massive amount of data a terabyte is, some have estimated that all of the text based content in the Library of Congress would be approximately one terabyte in size. Just imagine how much less valuable that library would be if there were no rhyme or reason to the storing of information in the Library of Congress. Many people, including myself, do not spend the time or have the discipline to organize their digital content in their own personal libraries to make it more useful. Unless you can actually find, organize, and use the content you have, it is useless.

This document takes a different approach to digital content management:

  • First, this paper presents different techniques to take advantage of Moore's Law and Intel's multi-core processors using the extra processing capability to help organize digital content in the background.
  • Second, it explains and provides a source code example of an information management platform framework that integrates these techniques into a single application that may help organize personal digital content within an application.

Taking Advantage of Multi-Core
With the doubling of the number of transistors every 18 months, Intel is using the new transistor count to add in multi-core functionality into its processors. Starting with Hyper-Threading technology (HT Technology) and now dual-core technology, multi-core technologies have resulted in higher performance processors but not necessarily more functionality. The typical optimization of an application is to make tasks run in parallel as much as possible and execute as fast as possible. It is up to the application and operating system to take advantage of the multiple cores by using "worker threads" to provide useful work. For example, when writing this paper on my computer, I looked at the current applications that were running on my system. The applications did not come close to taking full advantage of the resources they had available to them. In some circumstances, such as battery saving modes, or minimizing heat dissipation, it is sometimes helpful to minimize the resources used; however, this typically isn't the case. For example, while writing this paper I had the following applications running on my desktop computer:

  • File Explorer
  • A commercially available C++ development environment and separate source code editor
  • Audio player playing a music CD
  • Virus protection software
  • Word processor

The CPU utilization was averaging about three percent. The hard drive access was about two percent. In a dual-core system, both CPUs were practically idle. Now consider other possibilities: what if while writing this paper the Audio Player was searching for audio files that had not been tagged with content specific information and possibly tagging them with IDV3 tagging information? What if the File Explorer used the multi-threaded capabilities to index specific files to help find the content faster and more efficiently? The background uses of the virus protection software and word processing software are straight forward: scanning new and modified files, spell checking, grammar checking, and so on. What if new features were added, such as looking for relevant content on the Internet as you are writing the document? All of these features are interesting and have big potential if they don't interfere with the current job or "foreground" task.

Besides the enhanced use models mentioned above, the task of managing content that is continually being added, modified, and deleted is not easy. In order for the information to be useful, the method in which the content is indexed and its associated information needs to be kept up to date. To keep the data about the disk content up to date, informational data needs to be gathered on the fly. In this case, it is incumbent upon the background tasks to limit the resources being used so as not to interfere with the foreground tasks. But what does this mean in actual practice? It may be much more complex than simply limiting CPU utilization. Resources such as disk utilization, disk cache, CPU cache, network utilization may all need to be considered. Being able to regulate background tasks is not a trivial job since it is a delicate balancing act between getting the job done as quickly and efficiently as possible without being intrusive or even being noticed. Step one is to regulate the most noticeable resources, CPU, disk and network utilization.

The remainder of this document will describe an initial implementation of a performance-regulating class that provides an efficient framework to implement this functionality.

Regulating Background Tasks
Gathering information about the platform state such as CPU utilization, disk utilization, network information, and so on, in a non-intrusive manner, is not a trivial task. Regulating applications that gather content information to help manage the data in large q uantities is significantly more complex as more content is added, deleted, and modified. The more data there is to manage, the more resources are required to keep the content characterizing information up to date. Some content management applications available on the market today do not take into consideration the management of platform resources, and significantly interfere with foreground applications. This creates a poor user experience not only for the application running in the background, but a poor overall platform performance experience since many users do not distinguish background tasks from the overall platform performance experience.Users may simply know that the user response time is poor. Multi-core systems may allow applications to offer the user a much better response time, but only if the applications are effectively multi-threaded for these benefits. Regulating resource utilization for a background task is an important part of effectively multithreading an application. For example, we could use information from Microsoft's performance monitor object known as "perfmon*." Perfmon provides a way to directly monitor system resources via perfmon's performance monitoring API set.

Monitoring resources is the first step in regulating background tasks, but if monitoring resources is all a background task does, it is useless. The background task needs to not only monitor the resources it uses, but it also needs to limit the resources being used by the task, giving the foreground tasks top priority. To help applications with this problem, a performance regulating class has been developed to take advantage of perfmon's functionality and extend perfmon's capabilities to help regulate the resources an application takes.

Performance Regulator Class
The Performance Regulator class is a simple class to help get information from the Microsoft performance monitoring class and regulate specific threads. The following is a listing of the class definition. The Performance Regulator class is defined in cmperfreg.h and is available on the Intelr Software Network Forum.

/* _____________________________________________________

PerformanceRegulator
_____________________________________________________ */

class PerformanceRegulator
{
public:
PerformanceRegulator();
~PerformanceRegulator();
int GetCounterNames(vector<string> &CounterNames);
int Regulate(string RegulationId, string Counter);
int StopRegulating(string RegulationId, string Counter);
int RegisterThread(string RegulationId, string Counter, int Throttle);
int Monitor();
int AddCounterInstance(string sCounterName);
};

GetCounterNames provides a list of counte rs that may be regulated by the Performance Regulator class. The reason this method is provided is that an application needs to choose what to regulate and how often it will be regulated. It will be different depending on the resources available and what the application is doing. The initial list of counters is fairly small but will incorporate more as the project evolves:

  • Processor total percent idle time
  • Physical disk total disk percent time
  • Page file percent usage

The RegisterThread method provides a way for an application to register a specific thread and monitor a specific counter for that thread. The throttle value is a relative number from zero to nine with zero being the maximum throttling and nine being the minimal amount of throttling.

The Regulate method is called from a thread that has been registered to actually do the regulating or throttling of the counter previously registered. The way a thread is regulated is fairly simple with a background thread of the Performance Regulator class monitoring the registered performance counters. If the throttle value is reached, a Boolean variable is set to true, as shown in the first snippet of code. The regulation of the resource is manifested in the application by calling Regulate. The Regulate method should be called from the thread needing to be regulated.

 if ( Counter.Value < (10 * (MIN_THROTTLE - Counter.Throttle)) )
Block = true;

The actual regulation is done by simply waiting for a single object for a calculated period of time. The wait time is calculated for a percentage counter is shown below:

if ( MonitoredCounter.Block == true )
WaitForSingleObject(MonitoredCounter.Semaphore,
((MIN_THROTTLE-1) - MonitoredCounter.Throttle) * 100);

The full implementation may be found in the file, cperfreg.cpp. The StopRegulating method is provided to stop the monitoring thread from monitoring the counter in the background for the particular registered thread.

Content Indexing and Identification
The Content Management Index class provides two basic functions as an example of how to use the Performance Regulator class in the context of a content-indexing application. The following is a list of public methods that are available in the CMIndex class. The complete class information may be found in the file, cmindex.h.

/* _____________________________________________________

CMIndex
_____________________________________________________ */

class CMIndex
{
public:
CMIndex();
virtual ~CMIndex();
int ReadConfigFile(byte *pFullName, byte **hData, int *pSize);
int ParseConf(vector<string> &Path, int &Method, int &FileType);
int ProcessFullPathFiles(bool Clear,
vector<CONTENT_SIGNATURE> &ContentSignature,
string Filter, vector<string> &Files);
int CRC32(int &Value, byte *pData, int Count);
int GetHash(vector<CONTENT_SIGNATURE> &ContentSignature);
int GetFiles(string Directory, string Filter,
vector<string> &Files);
int OpenContent(char FileType, char *pFileName,
CU_FILESIZE &cuFileSize);
void *ReadContent(unsigned long Start, unsigned long Size);
int WriteContent(unsigned long Start, unsigned long Size, void *pBuf);
int CloseContent(char FileType);
int QuickCP(void *pDest, const void *pSrc, unsigned int Cnt);
};

The two basic methods are ProcessFullPathFiles and GetHash:

  • The ProcessFullPathFiles method recursively parses directories for specific file types and creates a fairly unique hash of the given file including the full path and the file name. This hash, a Cyclic Redundancy Code (CRC32), provides a quick "handle" to a file on disk.
  • The GetHash method reads specific file types and creates a fairly CRC32 hash of the data content. The intent of this hash is to uniquely recognize the content, allowing the application to identify content that is duplicated but may be on disk with a different name or location.

Although both of these methods seem rudimentary and by themselves provide very little interest, they are an example of two features taking advantage of the Performance Regulator. The first example of recursively parsing directories for specific file types shows how to use the Performance Regulator to regulate disk access. The second is an example of reading content files and creating a unique hash. This process can be fairly CPU-intensive. The GetHash method is a good example of how to regulate CPU utilization by using CPU Idle time information that is provided by the Performance Regulator.

Content Recognition

The two previous examples, file path indexing and content hashing, are designed to show how to use the Performance Regulator class to regulate resource utilization. They are simple examples of how to use the Performance Regulator. In subsequent articles on advanced content management, two more examples will be provided to show how to use the Performance Regulator class and integrate all of these features into an information management platform framework. One of the new examples will be an audio content recognition capability, able to recognize an audio file's content, indepen dent of its encoded format. This feature is very CPU-intensive and requires significant resources during the file decoding stages and key generation used for audio content recognition. The second example will be a still image recognition capability to recognize still images that are of the same image even though they have been modified slightly, such as by cropping, tinting, changing hue, and file compressing. But how do these examples fit into an information management platform framework?

Information Management Platform Framework Architecture
The Information Management Platform Framework was created with two purposes in mind. The first is to provide a working example of some of the features described in this article. The second is to develop a framework to build upon for incorporating different examples and techniques for content management in the future. Although not all the functionality is available in its current released form, updates will soon be made available. Anyone who wishes to contribute to this framework may do so by downloading the current source code, and sending any modifications or feedback to the content management forum or to richard.winterton@intel.com. Any of the source code submitted to or used from this project may be freely used and distributed as examples or within a product with no warranty for completeness or correctness. The framework consists of three basic files: the framework executable, the framework loader, and the framework configuration file.

The Information Management Platform Framework (IMPF), is designed to be a simple portable framework that enables different types of plug-ins or modules. Each of the plug-ins is an independent stand-alone module. They are loaded by the platform framework loader and reside in the same process space as the information platform management executable. Modules are required to conform to the following requirements in order to fit into the framework:

  • Modules need to be defined in the IMPF configuration file (impfconfig.xml)
  • Modules must have an exported function that conforms to the IMPF Initialize function requirements
  • Modules that enable the menu option should have an Execute method that will be activated when the menu item is selected

Information Platform Management Framework

The Information Platform Management Framework (impf.exe) is a simple Windows* application that creates a tray icon and when clicked and will display a menu of items associated with the plug-ins that are loaded by the information platform loader. When an item in the menu is selected, a method from a base class of the appropriate plug-in is executed. All other methods or interfaces are plug-in specific.

Information Platform Management Loader

The Information Platform Management Loader (impl.dll) is just that, a plug-in loader. The loader will read a configuration file and load designated modules in the file. After the module is loaded, an Initialize function will be called. This gives the loaded module the chance to do any initialization outside of the standard OS initialization that is taking place in the DllMain (which preferably is not used to perform extensive initialization).

Plug-in Initialize Function

Each plug-in is required to export an Initialize function. The Initialize function is responsible for exporting classes that are used by other plug-ins within the framework. The plug-in is also responsible for providing a unique "FriendlyName" to be used a for the framework as well as other plug-ins as an identifier.

typedef struct _LOADED_CLASSES
{
string FriendlyName;
string Owner;
int Instance;
int Menu;
int Loaded;
BaseLoadClass *pBaseClass;
void *pDerivedClass;
} *PLOADED_CLASSES, LOADED_CLASSES;

int Initialize(vector<PLOADED_CLASSES> *pClasses);

The Initialize function is called by the framework loader at startup time to load the validated plug-in into the framework. The plug-in module is required to provide a "FriendlyName" for the module that can be used by other modules to gain access to it. The Instance value is a unique number for this group of classes in the running process. The Menu flag is used by the framework to expose a menu item from the "framework container" that would allow the user to pop up a user interface, allowing configuration of a plug-in module. The FriendlyName is the title of the menu item exposed. The loaded flag is an indication to the framework to load the module at start time. The final parameter in the LOAD_CLASS structure is a void pointer to a plug-in defined class that may be derived from a common base class shared with other plug-ins within the framework.

Information Management Platform Framework Configuration File
The Information Management Platform Framework configuration file is an Extensible Markup Language (XML) file that is used by the framework loader to dynamically load modules that conform to the framework specification. Each of the modules is required to have the following information contained in the configuration file:

  • Version information: for backwards compatibility.
  • Plug-in type: one of three types of plug-ins — discovery, provider, transport.
  • Module name: the actual name of the binary to be loaded.
  • Menu Flag: a flag to indicate to the framework that a user is to be accessed from the plug-in.
  • Load Flag: a flag to the framework to identify if the module is to be loaded by default.
  • Hash: a hash for the framework to validate the plug-in with the configuration file and to help prevent unauthorized plug-ins from entering the framework. (Future)
  • Signature: a unique signature for the XML configuration file.

The XML file information is described by the following table.

Tag Example Description
version 0.0.1 major.minor.build
plug-in type provider, transport, discovery framework provider, framework transport, framework discovery plug-in
module Local.dll .dll or .so name
menu Local file Menu Name Friendly name of module
load 1 for loaded 0 for not loaded  
hplugin 32 bit hex number a CRC32 hash of the plug-in
hsignature A encrypted signature of the configuration file A signature of the XML file for version and taper proof.

A sample XML configuration file is shown below:

<?xml version="1.0" encoding="utf-8"?>
<impf>
<version>0.0.1</version>
<plugin type="provider">
<module>localcp.dll</module>
<version>0.0.1</version>
<menu>1</menu>
<load>1</load>
<hplugin>201bea31</hplugin>
</plugin>
<hsignature>f213ea41f213ea41f213ea43 f213ea47
f213ea41f213ea41f213ea43 f213ea41</hsignature>
</impf>


Summary
As predicted by Moore's Law, computing capability continues to increase at an exponential rate. Consumer hard drive capacity also continues to increase at an accelerated rate. In 1982 the storage capacity of a consumer hard drive was about 10 megabytes (MB). Today it is more than 500 gigabytes (GB). However information is just a bunch of bits if it isn't organized and you can't find what you want, when you want it. In the days of a 10 MB hard drive, it wasn't too difficult to rely on a manual method of organizing the data by using directories and filenames. But with data sizes 50,000 times greater, better ways of organizing content is critical. Intel is committed to continually improving not only the hardware but the platform as a whole. As a pa rt of this effort to improve the platform, the Information Management Platform Framework has been written. This framework includes a couple of content management plug-ins to seed the forum and generate interest and innovative ways to improve content management. The content management forum is a vehicle to help the industry and the computing communities in general collaborate and share new and innovative ways to help manage and organize all types of digital content.

Additional Resources





Comments (0)



Leave a comment

Name (required)

Email (required; will not be displayed on this page)

Your URL (optional)


Comment*