Nov 8, 2009

At last - someone we understand ...


It is rare with all the background reading and research that I have to do that I read a contemporary book and say "Yes, this guy gets it" - notice that I say contemporary. There are plenty of good books about the day-to-day business intelligence problems that I see which were written over 20 years ago (The Mythical Man Month by Frederick P Brooks [1975] and Peopleware: Productive Projects & Teams by Tom DeMarco and Timothy Lister [1987] to name but two). So it was with immense pleasure that I read the opening chapters of Netezza Underground: The Unauthorized Tales of Derring-do and Adventures in Resilient Data Warehousing Solutions by David Birmingham [2008]. This book describes, in a hugely readable way, many of the same concepts and ideas that I have been espousing for the last 15 years about how to deliver a successsful data warehouse solution.

David describes the building of data warehouses as the need to think of terabytes, not transactional systems and a discussion of very-large-scale data systems. He says that the rules are different here and yet oddly the same:
  1. Everything is requirements driven
  2. Simplify and clarify
  3. Use correctly powered and scalable systems
  4. Governance
  5. Data management
  6. Strong architectural approach
  7. Building the environment with the expectation of change
  8. Testability
  9. Go Parallel
  10. Never do bulk inside a traditional RDBMS
I would love to drill into each of these in more detail but doing so would simply reprise an excellent book and probably breach copyright! And if the mention of Netezza as a vendor puts you off, then read the book anyway, because whilst the author is an avid enthusiast of the product (and some reviewers dislike the consequentially irreverent style), the approach, techniques and philosophy he describes provide much of the same advice that we have been giving our customers for so long and that can be applied to other technical solutions. Of course if you are interested in implementing a system we are here to help too!
Share/Save/Bookmark

Nov 4, 2009

Data Warehousing Appliances – Fad or Future?

This article was originally written for Conspectus Magazine in December 2006 and has been updated in November 2009 by the original author

Despite all the hype from vendors the basics of data warehousing have remained fundamentally unchanged – extract data from multiple source systems, reformat the information into an easy to query structure, load it into a dedicated database and add an effective user interface to allow users to query the information. The cost of this environment is substantial and directly relates to the complexity of the Extract, Transform and Load (ETL) process and the volume of data held in the system.

The complexity of the ETL process has two cost impacts: the first is in the cost of the initial design and development and is reasonably well understood. The second is the cost of changes over the lifetime of the system, for example if an organisation have four source systems and each system under goes a change once a quarter then the data warehouse support team have to modify and test an interface every three weeks, and all this without any changes in the users requirements. The volume of data also hits the bottom line, not only in the cost of storage but in the size and (more expensive) skills of team required to support it, especially as data explosion forces the business to enter the very large database arena where load time and user query performance are critical.

Against this background it is unsurprising that vendors are looking to compete by reducing storage, improving query times and simplify administration. Oracle have taken steps to enhance their core database engine with features such as Exadata that improve each of these areas and continue to develop their strategy, however more and more is built into the core of its flagship general purpose engine resulting in software that has many features not needed by a specific application. Sybase have taken the more radical step of creating an entirely new database engine called Sybase IQ that does away with some of the limitations required of a general purpose engine to produce a solution that is both much faster in load and user query performance and far more efficient in its disk usage than other general purpose databases. The other traditional database vendors have all upgraded their product suite to chase the Business Intelligance market

Into this market enters the data warehousing appliance vendors, a breed of dedicated hardware and software solutions designed to solve a business’ data warehousing woes. Such systems use low cost commodity components in large volumes with dedicated business intelligence engines to deliver radically faster load times whilst at the same time reducing the query times and simplifying the systems administration process.

The first hurdle for many organisations is that data warehousing appliances are to some extent proprietary and therefore going against a corporate policy of open systems to allow technology re-use, however a solution built on one of the current market leading platforms, Teradata, is no less so. In fact Teradata can be considered one of the original data warehouse appliances and it is the use of the low-cost commodity components and the ability to achieve massive parallelism by the new-comers that differentiates them.

The second hurdle is credibility – the promises of such large benefits (typically query performance of ten to fifty times faster whilst using three to six times less storage on a platform that only requires a small amount of systems administration support) will be doubted, especially by systems and database administrators who have had to work so hard to maintain the performance of the existing solution. Vendors such as Netezza have overcome this challenge with some key accounts by providing a system on the basis that if it meets agreed performance criteria it will be purchased and thus significantly reducing the risk to the purchasing company.

The final obstacle is migration and its associated cost: an existing solution that is built, for example, on an Oracle database, using Oracle Warehouse Builder and Oracle Discoverer is effectively proprietary and therefore more difficult, but not impossible, to migrate.  This is also a reason to review the existing data warehousing architecture now to ensure that when these and other new technologies come along the business will be able to take advantage of them. Companies will be able to get a clear competative advantage if, by architecting their Business Intelligence systems into functional components, they can quickly change and adopt bigger, faster, cheaper technologies.

Those organisations that have overcome the hurdles report that they are achieving the immediate huge performance gains for their queries without the need for tuning the database whilst lowering the disk footprint and reducing the support costs. The systems also continue to deliver benefit as the fast query times allow more complex data models to be queried, which in turn reduces the need for complex ETL to restructure the data. These changes to the data model and to reduce the complexity of the ETL can be made either as part of the migration project (which delivers the largest benefit quickly but at the greatest risk) or as part of the change management process for the source systems (which delivers benefit over a longer time frame but significantly reduces the risk).

Added to this is the emergence of MapReduce, originally develeoped by Google, it is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. This is becoming a must-have feature for appliance vendors handling very large data sets.

There are now a significant number of vendors working to produce some form of data warehousing appliance (Netezza, GreenPlum, AsterData, etc. to name but a few) and it is clear that appliances are going to form a key part of data warehouse architectures going forward, the risks of using a smaller vendor and a proprietary solution being outweighed by the business benefit of much more timely information at a significantly reduced cost. Note also that there will be market consolidation and some vendors will disappear.

For further information analyst Curt Monash is just one of a number of analysts who follow this subject and provides regular updates on the market.

Read the original version of the article
Share/Save/Bookmark

Nov 2, 2009

Creating graphics for BI Reports in Ruby On Rails

So your manager wants reports with sexy graphs - what are your options and what are the pros and cons

I have found eight usable packages with different considerations

OpenFlashChart

This is a series of libraries including one for Ruby that allow the generation of great looking charts that are very dynamic (see the demo on the website). Data is easily passed via JSON making it easy to integrate. The only downside is that it requires the client to have an Adobe Flash player installed. This is common enough for it not to be a problem for most people but it does limit the portability. Also as it generates a flash object rather than a JPG, PNG or GIF it means that this is suitable for online output but not useful in RSS feeds etc and when printed all the dynamic features are lost anyway.

Gruff

Gruff provides a simple and easy to use Ruby package that works well with Rails. It does require that ImageMagick/RMagick is installed on the server (HowTo for OSX) and this can be a pain for some people. The advantage is that it produces good looking, native SVG, PNG, JPG, GIF files so they are easily portable and printable

Scruffy


Scruffy is very similar to Gruff although appears to be less well used and not recently updated. It too requires ImageMagick/RMagick installed and whilst it is based on SVG is capable of producing other graph formats.

Sparklines


Sparklines are small graphs that are usually embedded within the text of a paragraph or placed inside tables. The standard output is a png file and it also requires ImageMagick/RMagick to be installed. This size of the graphs makes it useful for small graphs such as indicators rather than main report graphics.

gchartrb & Google Charts
See also http://www.infoq.com/articles/bass-google-charts-gchartrb


Google charts provide perhaps the easiest interface to creating a graph. All the parameters get bundled up in the URL and Google does all the work. Added to this is gchartrb that provides an easy and intuitive way to build the URL. All this means that there is nothing to install on the server side however it also means that to get access to your image you have to be connected to the internet and pass data which your company may regard as secure to the Google servers. This may make it impractical for use with BI data.

SVG::Graph


SVG:::Graph is a pure Ruby library for generating charts, which are a type of graph where the values of one axis are not scalar. The output is an SVG which might not be supported in all browsers or a PNG file

Flot


Flot is a pure Javascript plotting library for jQuery. It produces graphical plots of arbitrary datasets on-the-fly client-side. The package once again takes a JSON string as input and produces great looking graphs that can be dynamically updated

CSS Graph Helper


A Ruby on Rails helper for making simple graphs. The graphs use only CSS/HTML (and an optional gradient image). Great at creating quick simple online graphs but very limited in its capabilities

So:
If you are looking for a simple graph that is easy to generate then gchartrb is for you
If you are looking for dynamic and very attractive content then Flot or OpenFlashChart are your options
If you are looking for small KPI indicator type graphs consider Sparklines
If you are looking for good looking images that can be distributed consider Gruff, Scruffy and SVG::Graph
If you are happy with a quick and easy CSS based solution then look at CSS Graph Helper
Share/Save/Bookmark

Nov 1, 2009

Analysis by ETL ...

Performing source-system analysis by writing ETL is just about the most expensive way you can do it; but many organisations do exactly that. Why? Because they "can't afford" to do their analysis the cheaper way. No, this does not make sense.

This is how it works...

In most data warehousing projects, business analysts are separated from data profilers, data modellers, technical designers, ETL developers, and every other possible sub-division of the task you could imagine.

We sometimes find brilliant business analysts, but sadly, they are rare. More usually the business analysts don't like to get their hands dirty, and present the designers with works of elaborate fiction which have some resemblance to how the system might work in an idealised world. Usually their documents are at a "high level" (actually meaning vague and inaccurate).

Maybe we get some help from data profilers, who run the data profiling tool on a tiny sample of the data. (One tool, in the top right corner of Gartner's Magic Quadrant, limits samples to 999,999 records. I suggest that this is completely useless for a database table of a quarter of a billion rows, and such scale is usual in the sort of enterprise that would feel the need for a data warehouse.) The data profilers dutifully print out the verbose reports from the data profiling tool and the technical designers ignore them (not through malice, or even laziness, but because they know that these reports will not be much use).

The designers are more likely to add a further layer of invention to produce "Mapping Specifications" based on the way the system must work (surely!).

Then the ETL developers build their ETL jobs, generally pretty faithful to the mapping specifications, but often with some elaboration to take care of exceptional cases (most of which never happen in the real data) and with some "corrections" to the mapping specifications which (surely!) must be wrong on these points.

Now we could do with some test data. It is, of course, too expensive to build test data to cover all the test cases (which have not been defined until now) so work stops while negotiations begin to get extracts of real data from the operational systems.

Eventually the data arrives. The people providing it have carefully selected it to represent the cleanest view of the source system. Testing commences.

Actually, what happens now is that analysis commences. The ETL crashes into an idealised version of the real world and... crashes. The data is very unlike what was defined and it just doesn't work. We don't get an end-to-end run because the data formats are not accurate and even the extracts fail. We patch up the extracts, one defect at a time, until - eventually - the data gets as far as the staging area. Unfortunately, the project is now beginning to run late, so we don't have time to correct the mapping specifications. As for the source systems analysis - well those guys are long gone.

So we've stumbled through the extract and now we can start on the transformation. Well, if we couldn't get the source data formats right what chance have we got with the relationships which exist in the operational systems (which have been patched and enhanced over the years, with no regard to good database design, object-oriented principles, user interface good practice, or any discipline which might have given us a nice database to work with)?

Many months later, after many late nights and weekends of work by the ETL team, we finally have something which appears to work.

But this is on the sample data. When the system goes live, the ETL is exposed to the data quality errors, missing reference data, old test data, clever bits of encoding within fields (which the users have invented to overcome shortcomings in the system) and real data volumes.

Everything fails again, there are recriminations, more late nights and weekends. Many months later the system finally goes live and users start running reports. By this time the users have no confidence in the data warehouse - it has got a bit of a reputation - so they question everything which does not match what they had before, even if the new reports happen to be more accurate.

But some good has come out of this. Finally we have a team of people - the ETL developers - who actually understand how the source systems work and where data quality and design flaws exist. But the ETL developers have finished their work, so their contracts end and they walk out of the door taking the knowledge with them. No-one is going to pay for them to write it down.

This all sounds terribly cynical - doesn't it? But sadly it happens, over and over again. The tool vendors encourage this approach by the lie that their tool is the Silver Bullet (see The Mythical Man-Month by F.P. Brooks) that will solve all these ETL problems. The IT analysis and advice companies are accessories to the crime, re-enforcing the myths from the vendors.

It doesn't have to be this way. The application of realistic, sound, rigorous, good practices can overcome all of these problems, but this does require that managers understand how such projects work and don't just believe that pretending to be Alan Sugar or Gordon Ramsay is the best way to get results.

So what makes it better. First of all, we need people with skills across the range required for data warehousing projects. We want ETL developers who can talk to users, we want business analysts who understand the principles of good database design.

Then we carefully dig for real requirements. We don't just gather them. Quantity is not our goal. We don't want a spreadsheet with a list of 1,500 fields that appear on today's reports. We want to know what the users really need, what will save them the most time, what will improve the quality of the information they use. In the terminology of Agile development we want "stories" - the users' definition of what they need, why they want it, how important it is to them and the "acceptance criteria" which will enable us to know when we have delivered what they want.

The people doing the development must be involved at this stage. As they build working relationships with the users they can ask for clarification, they can show what they have discovered, and they can expose what they are building for the users to verify as they go along.

The team may use data profiling tools to find out the extreme edges of the system, the biggest and smallest values, the unusual patterns, the missing reference data, the orphaned transactions. The users will be able to explain some of the apparent strange values, and will discover why they have not always got the results they expected in the past. With the users' involvement, data quality issues can be fixed in the source systems. The data warehousing project is starting to deliver benefits before a line of ETL code has been written.

We may find that some master data management is required. Typically we will find different versions of the product catalogue in different applications. A project to sort out a single product catalogue will also save money in the business before a line of ETL code has been written.

With a proper understanding of the source systems the team can build a data model for the data warehouse which really models the work of the enterprise, not a fictional idealisation of it, and not a mirror of each operational system, but a process neutral data model which will be a valuable investment for years to come.

So, at this stage we have a robust target model (the data warehouse) and a good understanding of the source systems. We can now define the source to target mapping accurately, and, at the same time, build test cases to prove that the mappings work correctly. Yes, we build test cases, we do not just grab some real(ish) data and bash the ETL into it to see what happens. It does take a little time, but we automate our test runs so that the test case we build today can be run over and over again to keep on proving that the system is running correctly. The data volumes needed for this kind of unit testing are tiny, so test runs are completed quickly and can be run every time a new ETL module is added or an existing one is changed. (The ETL tools do not give much help in this, despite their 6-figure price tags.)

When we get a part of the data warehouse built successfully, then we can try some volume tests. Real data is useful here, but if it is not available, we use test data generators to ensure that our ETL and our data warehouse can handle real-world volumes (and a bit more). We may need to do some tuning. This may be as simple as adding some indexes, or it may cause us to rewrite chunks of our ETL code. Either way, we run our automated test suite, which now contains hundreds, maybe thousands of tests, to prove that we have not broken anything. Now the cost of building the tests is really paying us back.

At this stage the users can start running reports on the first part of the data warehouse. We can work with them to reconcile the results back to the source systems and to determine where any discrepancies lie. If we have done the preceding steps well, the new data warehouse will be right more often than not. When it is wrong, we can fix it and run our test suite to make sure we have not broken anything else. The users' involvement here builds confidence in the new data warehouse and smoothes the path to implementation.

So, if the first part of this article was cynical; is the second part hopelessly optimistic? No, it is not. We have done this and made it work, many times.

The keys to success are:
• people in the team with broad and deep skills
• involvement of the users from start to finish
• openness and honesty

The tools we use encourage these characteristics, but no tool is a silver bullet - without the buy-in of the team (users and developers) no tool will make a data warehousing project work.

To develop and manage documents, we use a wiki, and we do not lock down the security within the team (it is secure from competitors and any other possibly malicious access). Everyone in the team - users, managers, developers - can add to or change the wiki at any time, and everyone can see the current content and the history. Mistakes can always be fixed because the wiki keeps a complete history of changes, but the need to revert to history is very rare in our experience.

We use a tracking system for tasks, risks, issues, enhancements, test cases and defects. Again, everyone can see and comment on all of these. If a problem is brewing, then the team knows about it before it becomes a catastrophe.

We do not accept quick and dirty fixes. Technical debt is paid off as soon as possible, it is not allowed to accumulate. We are not afraid to "re-factor" to make our database and our code clean - we can easily prove that it is still working by running our automated tests.

Of course, we also use data modelling, data profiling, ETL, version control, and other tools, but we do not expect the tools to solve all our problems, and we choose free or low-cost tools where they do the job.

And, we do not do our analysis by writing ETL. We do analysis the much less expensive way, using our brains before we write all the code.
Share/Save/Bookmark

Oct 26, 2009

The project wiki - a cost reduction tool

Some readers will be familiar with TED: Ideas Worth Spreading a series of talks on just about everything worthwhile in Technology, Education and Design. I recently revisited Yochai Benklers talk on the new open-source economics from 2005 where he explains how collaborative projects like Wikipedia and Linux represent the next stage of human organisation.

The principle discussed is one it is more productive for a large number of people to work collaboratively on producing content than it is for a single individual to do so.

In Business Intelligence projects there are a number of formal documents that need to be produced to satisfy constraints external to the project (e.g. company business processes). Data Warehousing projects often fail to satisfy two masters, firstly failing to record information and knowledge pertinent to the on-going success of the project and secondly producing too many 'formal' documents that are time consumingly produced by key individuals but of little value to the project itself.

Data Management & Warehousing, my company, uses something called Project Services that combines Trac, SVN and the ideas of the Data Warehouse Documentation Roadmap to combine a Wiki, Version Control, Ticketing and Project/Team Website. Bringing these together and using them optimally is a micro-example of the concepts in Yochai Benklers talk.

A data warehouse team can quickly and efficiently build wiki pages that record much, if not all of the information that is internal to the project. This can include the business definition dictionary, various pieces of the requirements, architecture, analysis, standards and definitions, etc. What is more in this environment it is possible to have links between wiki pages and documents and between wiki pages and the source code itself. This means that users of the system can quickly and intuitively navigate through information and, where there is missing information or inaccuracies they can fix it themselves. In the case of project services there is also a ticketing system for task, risk, issue, enhancement tracking built in to complete the project management and governance aspects. Because this approach makes it easier for users they are more willing to contribute to the overall solution and to follow the required processes.

Implementing and using a tool such as project services is key to creating a BI-on-Rails solution because whilst the strict version control and issue management can be enforced it also allows agile processes to work and strong internal communications within the team.
Share/Save/Bookmark

Oct 22, 2009

Oh what a tangled web we weave, when first we practise to deceive!

The deception in our case is the accuracy of the data and the return on investment of the data warehouse solution. Too often projects make compromises in the implementation of a business intelligence solution that create massive cost and user dis-satisfaction downstream that lead to the failure of BI projects:

For example:

* In the project management setting un-realistic or un-deliverable targets, opting for tactical solutions that soon overtake the main project and the failure to communicate issues to senior management and business users that will affect the delivery timescales and costs.

* In the requirements phase believing that you know better that the user and not getting their sign off that they understand and agree with the requirements.

* In the analysis phase the failure to correctly identify the master sources of information and to do sufficient work to understand how the data is stored in the source system and what data is needed to meet the requirements.

* In the design phase the expediency of allowing the developers to 'get on with their job' by not thoroughly validating the design and checking the sources to ensure that the design meets criteria such as timeliness, reliability and accuracy.

* In the build phase by coding to the bear minimum to get the job done and not to a standard that ensures it will run successfully time after time.

* In the testing phase where as well as the standard data and volume data you test all the data boundary conditions that might occur.

* In the data quality aspects of the programme where data is fixed on the way in to the system or after it has been loaded because it is 'too difficult' to fix it back at the source system.

These are only a few examples but each of them makes the final solution more difficult (a tangled web) and therefore more costly to operate and maintain. If you are in a project where doing the right thing is compromised then remember that the project will have to pay for these 'efficiencies' later. Project managers should always strive to improve any project lifecycle and deliver on time. Too often, however, it becomes expedient to bury the truth and deliver anything to meet an arbitrary date that, in the long term, leaves the business user unsatisfied.

For more on this subject have a look at:

* Black Swans and White Elephants - RoI in Business Intelligence

* Getting to NO

* Technical Debt

By the way the quote is often mistakenly attributed to William Shakespeare but was written by Sir Walter Scott (1771 - 1832) in his poem Marmion - another data quality problem!
Share/Save/Bookmark

Oct 14, 2009

BI Convention over Configuration

The convention over configuration discussion when designing a data warehouse often leads to fanatical discussions by technical people over the 'best' approach to do things.

Convention can be used to define the standard way to design a data model (see Process Neutral Data Modelling) or to implement an ETL transformation (for example a reducing sets approach) amongst other things but inevitably someone will claim to be able to make an individual element 'better' bet by configuration.

For example if we have an ETL mapping and always approach handling data change in a certain way then unless we are really sure of the benefits we should stick to the convention. In our example a DBA offers to make a single ETL mapping faster. The DBA can halve the time - so what is the harm?

Firstly if the new algorithm is as complete but just faster then there is no harm. In fact I would encourage you to update your conventions and re-factor your code so that all of the mappings exploit the algorithm.

But what if the faster algorithm works because it omits some element? Maybe it doesn't need to check all the columns, perhaps there is an assumption of referential integrity, or it joins two steps together where other tables need the intermediate results.

This is where the cost creeps back in. In the future the omitted column is needed, referential integrity is no longer true, the intermediate step is required, etc. At best this causes un-necessary rework. At worse it causes silent corruption of the data that causes a crash or takes months to unravel. It also means that because it does not conform to the standard the person fixing the code has to work through the code to understand it first, and assuming they understand it perfectly, only then they can then fix it - all of this taking more resource time and cost.

The convention over configuration option does not mean that there is only one way to do something, nor does it mean that it is optimal for the individual case - it means that there are a defined set of algorithms for a given process and that these are collectively optimal for the solution - a case of looking at the whole picture rather than individual elements. In the process neutral data modelling technique it will often create a data model with over 1000 tables but there are 10 named approaches to the data modelling issue. How many ETL algorithms will you need? My guess is that it is in the same order of magnitude, about 100:1 against the total number or transformations. What is more the number of ETL transformation algorithms required drops dramatically if the data model is uniform because it is easy to describe a standard ETL algorithm for a standard data model.

Returning to the start of this article - your DBA can do this mapping better? But in what sense? Is it more cost-effective, more maintainable? Easier to understand, or cheaper to maintain? The decision to go the bespoke route for this one case should balance these factors and their costs against the necessity of the performance gain.

Experience tells me that in 99.9% of cases bespoke is not worthwhile in the long term.
Share/Save/Bookmark