Wednesday, June 29, 2005

Migrating data: In search of a lost context

How would you represent the knowledge and expertise that you possess? The answer to this question will vary with each individual. For example, you could write it down in a fashion similar to an encyclopedia with the terms and their meaning to you. Alternatively, you could creates an exhaustive How-to-Guide that listsa number of activities with detailed instructions on how someone would go about achieveing an end goal.

Knowledge representation has always been a difficult task. For years, researchers in the field of artificial intelligence have struggled to create expert systems that would contain rules. These rules will provide the system a detailed set of actions to undertake if provided a given stimulus. The stimulus could be a simple event or a complex situation. In a simple event, for example, the push of a keyboard, the system would have to process just that one stimulus. WIth a complex situation, such as the entry of a new customer's data the system will have to process multiple pieces of information as well as refer to its historical records to evaluate, for example, the risk associated with accepting the new individual as a customer.

Over the last twenty years, database systems have ammassed a large amount of data about businesses and their processes. Typically a database, at its core consists of a data model that seeks to represent all the relevant or meaningful information about a business, its processes, stakeholders, customers and partners. This representation seekd to identify the key attributes and entities and then map the relationships between them. This is a very complex task and as much an art as a science. Take the simple case of a picture. If the picture was drawn on a single sheet of white paper and consisted of two perpendicular lines, it would be very easy to represent or describe this picture. To describe the picture as accurately as possible, for example, one could state as follows:-
  1. The picture consists of two black lines intersecting each other at right angles on a white piece of paper.
  2. The picture is drawn on a A4 size paper. It consists of two black lines of length 15 cm each running parallel to the sides of the sheet. The intersection point is in the middle of the sheet
  3. The picture is drawn on a A4 size sheet. It consists of two black lines of length 15 cm each running parallel to the sides of the sheet. The lines intersect each other at a point one-third of the distance from one of their edges. The intersection point is in the middle of the sheet

As the astute reader has probably noticed, each of these descriptions is valid. However, as one goes down the list it is obvious that the amount of information present in each description increases. A good data modeler needs to decide which of these descriptions will be adequate for the data model. Obviously the more information one has in the description the better it will be. However, the more detailed the representation the more space it occupies. Furthermore, it takes more effort and time to create a description (in the case of the picture) or data model (in the case ofa database).

Once the decision is made on how to represent the picture in the database, everything that is not represented as data in the data model becomes the context. Some of the contextual information that was not captured in description 3 above for example was the texture of the paper, the artist that drew the picture, the time at which the picture was drawn, its age and so forth. Each of these pieces of information could become important at somepoint in the future. For example, if the picture is put up for auction, the identity of the artist that created it would become very important.

A similar situation is found is database technologies. Quite often, the data that has been stored in a database needs to be utilized at a later point in time for a number of reasons. One such reason is the migration of data from one database to another. Typically the system from which data is being obtained is called the source and the system to which data is being moved is called the target. Moving the data from a source to a target that has the same datamodel would be a trivial task if the no changes were needed. However, more often than not, the target database has a different data model, better data integrity requirements, higher data quality needs and so forth. If the target system has a higher data quality requirement, the data from the source system will have to be cleansed before it can be moved into the target system.

Data quality refers to the the actual data stored within the datamodel rather than the datamodel. For example, if the data model stores the name of the author, then the data could be entered into the system as "vivek pinto" or "vivekpinto" or " " or even "vkpinto". Clearly the fourth entry is empty while the third entry is mis-spelled and thus of inferior quality. However, the second entry could be of poor quality too. In the absence of contextual information such as the first name and last name of the author iut would be very difficult to know that the correct entry should be "vivek pinto" if all one has is entry three or four. To complicate matters further, it would be difficult to know which of the two words in the name was the first name and the last name.

The number of issues that are similar to the situation mentioned above are too numerous to count. However, they are real problems that come up during data migration. A few solution to the problem have been devised but they are at best limited. However, more on that later.



(c) 2005 Wonomi Technologies All rights reserved

0 Comments:

Post a Comment

<< Home