Nov 8, 2009

At last - someone we understand ...


It is rare with all the background reading and research that I have to do that I read a contemporary book and say "Yes, this guy gets it" - notice that I say contemporary. There are plenty of good books about the day-to-day business intelligence problems that I see which were written over 20 years ago (The Mythical Man Month by Frederick P Brooks [1975] and Peopleware: Productive Projects & Teams by Tom DeMarco and Timothy Lister [1987] to name but two). So it was with immense pleasure that I read the opening chapters of Netezza Underground: The Unauthorized Tales of Derring-do and Adventures in Resilient Data Warehousing Solutions by David Birmingham [2008]. This book describes, in a hugely readable way, many of the same concepts and ideas that I have been espousing for the last 15 years about how to deliver a successsful data warehouse solution.

David describes the building of data warehouses as the need to think of terabytes, not transactional systems and a discussion of very-large-scale data systems. He says that the rules are different here and yet oddly the same:
  1. Everything is requirements driven
  2. Simplify and clarify
  3. Use correctly powered and scalable systems
  4. Governance
  5. Data management
  6. Strong architectural approach
  7. Building the environment with the expectation of change
  8. Testability
  9. Go Parallel
  10. Never do bulk inside a traditional RDBMS
I would love to drill into each of these in more detail but doing so would simply reprise an excellent book and probably breach copyright! And if the mention of Netezza as a vendor puts you off, then read the book anyway, because whilst the author is an avid enthusiast of the product (and some reviewers dislike the consequentially irreverent style), the approach, techniques and philosophy he describes provide much of the same advice that we have been giving our customers for so long and that can be applied to other technical solutions. Of course if you are interested in implementing a system we are here to help too!
Share/Save/Bookmark

Nov 4, 2009

Data Warehousing Appliances – Fad or Future?

This article was originally written for Conspectus Magazine in December 2006 and has been updated in November 2009 by the original author

Despite all the hype from vendors the basics of data warehousing have remained fundamentally unchanged – extract data from multiple source systems, reformat the information into an easy to query structure, load it into a dedicated database and add an effective user interface to allow users to query the information. The cost of this environment is substantial and directly relates to the complexity of the Extract, Transform and Load (ETL) process and the volume of data held in the system.

The complexity of the ETL process has two cost impacts: the first is in the cost of the initial design and development and is reasonably well understood. The second is the cost of changes over the lifetime of the system, for example if an organisation have four source systems and each system under goes a change once a quarter then the data warehouse support team have to modify and test an interface every three weeks, and all this without any changes in the users requirements. The volume of data also hits the bottom line, not only in the cost of storage but in the size and (more expensive) skills of team required to support it, especially as data explosion forces the business to enter the very large database arena where load time and user query performance are critical.

Against this background it is unsurprising that vendors are looking to compete by reducing storage, improving query times and simplify administration. Oracle have taken steps to enhance their core database engine with features such as Exadata that improve each of these areas and continue to develop their strategy, however more and more is built into the core of its flagship general purpose engine resulting in software that has many features not needed by a specific application. Sybase have taken the more radical step of creating an entirely new database engine called Sybase IQ that does away with some of the limitations required of a general purpose engine to produce a solution that is both much faster in load and user query performance and far more efficient in its disk usage than other general purpose databases. The other traditional database vendors have all upgraded their product suite to chase the Business Intelligance market

Into this market enters the data warehousing appliance vendors, a breed of dedicated hardware and software solutions designed to solve a business’ data warehousing woes. Such systems use low cost commodity components in large volumes with dedicated business intelligence engines to deliver radically faster load times whilst at the same time reducing the query times and simplifying the systems administration process.

The first hurdle for many organisations is that data warehousing appliances are to some extent proprietary and therefore going against a corporate policy of open systems to allow technology re-use, however a solution built on one of the current market leading platforms, Teradata, is no less so. In fact Teradata can be considered one of the original data warehouse appliances and it is the use of the low-cost commodity components and the ability to achieve massive parallelism by the new-comers that differentiates them.

The second hurdle is credibility – the promises of such large benefits (typically query performance of ten to fifty times faster whilst using three to six times less storage on a platform that only requires a small amount of systems administration support) will be doubted, especially by systems and database administrators who have had to work so hard to maintain the performance of the existing solution. Vendors such as Netezza have overcome this challenge with some key accounts by providing a system on the basis that if it meets agreed performance criteria it will be purchased and thus significantly reducing the risk to the purchasing company.

The final obstacle is migration and its associated cost: an existing solution that is built, for example, on an Oracle database, using Oracle Warehouse Builder and Oracle Discoverer is effectively proprietary and therefore more difficult, but not impossible, to migrate.  This is also a reason to review the existing data warehousing architecture now to ensure that when these and other new technologies come along the business will be able to take advantage of them. Companies will be able to get a clear competative advantage if, by architecting their Business Intelligence systems into functional components, they can quickly change and adopt bigger, faster, cheaper technologies.

Those organisations that have overcome the hurdles report that they are achieving the immediate huge performance gains for their queries without the need for tuning the database whilst lowering the disk footprint and reducing the support costs. The systems also continue to deliver benefit as the fast query times allow more complex data models to be queried, which in turn reduces the need for complex ETL to restructure the data. These changes to the data model and to reduce the complexity of the ETL can be made either as part of the migration project (which delivers the largest benefit quickly but at the greatest risk) or as part of the change management process for the source systems (which delivers benefit over a longer time frame but significantly reduces the risk).

Added to this is the emergence of MapReduce, originally develeoped by Google, it is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. This is becoming a must-have feature for appliance vendors handling very large data sets.

There are now a significant number of vendors working to produce some form of data warehousing appliance (Netezza, GreenPlum, AsterData, etc. to name but a few) and it is clear that appliances are going to form a key part of data warehouse architectures going forward, the risks of using a smaller vendor and a proprietary solution being outweighed by the business benefit of much more timely information at a significantly reduced cost. Note also that there will be market consolidation and some vendors will disappear.

For further information analyst Curt Monash is just one of a number of analysts who follow this subject and provides regular updates on the market.

Read the original version of the article
Share/Save/Bookmark

Nov 2, 2009

Creating graphics for BI Reports in Ruby On Rails

So your manager wants reports with sexy graphs - what are your options and what are the pros and cons

I have found eight usable packages with different considerations

OpenFlashChart

This is a series of libraries including one for Ruby that allow the generation of great looking charts that are very dynamic (see the demo on the website). Data is easily passed via JSON making it easy to integrate. The only downside is that it requires the client to have an Adobe Flash player installed. This is common enough for it not to be a problem for most people but it does limit the portability. Also as it generates a flash object rather than a JPG, PNG or GIF it means that this is suitable for online output but not useful in RSS feeds etc and when printed all the dynamic features are lost anyway.

Gruff

Gruff provides a simple and easy to use Ruby package that works well with Rails. It does require that ImageMagick/RMagick is installed on the server (HowTo for OSX) and this can be a pain for some people. The advantage is that it produces good looking, native SVG, PNG, JPG, GIF files so they are easily portable and printable

Scruffy


Scruffy is very similar to Gruff although appears to be less well used and not recently updated. It too requires ImageMagick/RMagick installed and whilst it is based on SVG is capable of producing other graph formats.

Sparklines


Sparklines are small graphs that are usually embedded within the text of a paragraph or placed inside tables. The standard output is a png file and it also requires ImageMagick/RMagick to be installed. This size of the graphs makes it useful for small graphs such as indicators rather than main report graphics.

gchartrb & Google Charts
See also http://www.infoq.com/articles/bass-google-charts-gchartrb


Google charts provide perhaps the easiest interface to creating a graph. All the parameters get bundled up in the URL and Google does all the work. Added to this is gchartrb that provides an easy and intuitive way to build the URL. All this means that there is nothing to install on the server side however it also means that to get access to your image you have to be connected to the internet and pass data which your company may regard as secure to the Google servers. This may make it impractical for use with BI data.

SVG::Graph


SVG:::Graph is a pure Ruby library for generating charts, which are a type of graph where the values of one axis are not scalar. The output is an SVG which might not be supported in all browsers or a PNG file

Flot


Flot is a pure Javascript plotting library for jQuery. It produces graphical plots of arbitrary datasets on-the-fly client-side. The package once again takes a JSON string as input and produces great looking graphs that can be dynamically updated

CSS Graph Helper


A Ruby on Rails helper for making simple graphs. The graphs use only CSS/HTML (and an optional gradient image). Great at creating quick simple online graphs but very limited in its capabilities

So:
If you are looking for a simple graph that is easy to generate then gchartrb is for you
If you are looking for dynamic and very attractive content then Flot or OpenFlashChart are your options
If you are looking for small KPI indicator type graphs consider Sparklines
If you are looking for good looking images that can be distributed consider Gruff, Scruffy and SVG::Graph
If you are happy with a quick and easy CSS based solution then look at CSS Graph Helper
Share/Save/Bookmark

Nov 1, 2009

Analysis by ETL ...

Performing source-system analysis by writing ETL is just about the most expensive way you can do it; but many organisations do exactly that. Why? Because they "can't afford" to do their analysis the cheaper way. No, this does not make sense.

This is how it works...

In most data warehousing projects, business analysts are separated from data profilers, data modellers, technical designers, ETL developers, and every other possible sub-division of the task you could imagine.

We sometimes find brilliant business analysts, but sadly, they are rare. More usually the business analysts don't like to get their hands dirty, and present the designers with works of elaborate fiction which have some resemblance to how the system might work in an idealised world. Usually their documents are at a "high level" (actually meaning vague and inaccurate).

Maybe we get some help from data profilers, who run the data profiling tool on a tiny sample of the data. (One tool, in the top right corner of Gartner's Magic Quadrant, limits samples to 999,999 records. I suggest that this is completely useless for a database table of a quarter of a billion rows, and such scale is usual in the sort of enterprise that would feel the need for a data warehouse.) The data profilers dutifully print out the verbose reports from the data profiling tool and the technical designers ignore them (not through malice, or even laziness, but because they know that these reports will not be much use).

The designers are more likely to add a further layer of invention to produce "Mapping Specifications" based on the way the system must work (surely!).

Then the ETL developers build their ETL jobs, generally pretty faithful to the mapping specifications, but often with some elaboration to take care of exceptional cases (most of which never happen in the real data) and with some "corrections" to the mapping specifications which (surely!) must be wrong on these points.

Now we could do with some test data. It is, of course, too expensive to build test data to cover all the test cases (which have not been defined until now) so work stops while negotiations begin to get extracts of real data from the operational systems.

Eventually the data arrives. The people providing it have carefully selected it to represent the cleanest view of the source system. Testing commences.

Actually, what happens now is that analysis commences. The ETL crashes into an idealised version of the real world and... crashes. The data is very unlike what was defined and it just doesn't work. We don't get an end-to-end run because the data formats are not accurate and even the extracts fail. We patch up the extracts, one defect at a time, until - eventually - the data gets as far as the staging area. Unfortunately, the project is now beginning to run late, so we don't have time to correct the mapping specifications. As for the source systems analysis - well those guys are long gone.

So we've stumbled through the extract and now we can start on the transformation. Well, if we couldn't get the source data formats right what chance have we got with the relationships which exist in the operational systems (which have been patched and enhanced over the years, with no regard to good database design, object-oriented principles, user interface good practice, or any discipline which might have given us a nice database to work with)?

Many months later, after many late nights and weekends of work by the ETL team, we finally have something which appears to work.

But this is on the sample data. When the system goes live, the ETL is exposed to the data quality errors, missing reference data, old test data, clever bits of encoding within fields (which the users have invented to overcome shortcomings in the system) and real data volumes.

Everything fails again, there are recriminations, more late nights and weekends. Many months later the system finally goes live and users start running reports. By this time the users have no confidence in the data warehouse - it has got a bit of a reputation - so they question everything which does not match what they had before, even if the new reports happen to be more accurate.

But some good has come out of this. Finally we have a team of people - the ETL developers - who actually understand how the source systems work and where data quality and design flaws exist. But the ETL developers have finished their work, so their contracts end and they walk out of the door taking the knowledge with them. No-one is going to pay for them to write it down.

This all sounds terribly cynical - doesn't it? But sadly it happens, over and over again. The tool vendors encourage this approach by the lie that their tool is the Silver Bullet (see The Mythical Man-Month by F.P. Brooks) that will solve all these ETL problems. The IT analysis and advice companies are accessories to the crime, re-enforcing the myths from the vendors.

It doesn't have to be this way. The application of realistic, sound, rigorous, good practices can overcome all of these problems, but this does require that managers understand how such projects work and don't just believe that pretending to be Alan Sugar or Gordon Ramsay is the best way to get results.

So what makes it better. First of all, we need people with skills across the range required for data warehousing projects. We want ETL developers who can talk to users, we want business analysts who understand the principles of good database design.

Then we carefully dig for real requirements. We don't just gather them. Quantity is not our goal. We don't want a spreadsheet with a list of 1,500 fields that appear on today's reports. We want to know what the users really need, what will save them the most time, what will improve the quality of the information they use. In the terminology of Agile development we want "stories" - the users' definition of what they need, why they want it, how important it is to them and the "acceptance criteria" which will enable us to know when we have delivered what they want.

The people doing the development must be involved at this stage. As they build working relationships with the users they can ask for clarification, they can show what they have discovered, and they can expose what they are building for the users to verify as they go along.

The team may use data profiling tools to find out the extreme edges of the system, the biggest and smallest values, the unusual patterns, the missing reference data, the orphaned transactions. The users will be able to explain some of the apparent strange values, and will discover why they have not always got the results they expected in the past. With the users' involvement, data quality issues can be fixed in the source systems. The data warehousing project is starting to deliver benefits before a line of ETL code has been written.

We may find that some master data management is required. Typically we will find different versions of the product catalogue in different applications. A project to sort out a single product catalogue will also save money in the business before a line of ETL code has been written.

With a proper understanding of the source systems the team can build a data model for the data warehouse which really models the work of the enterprise, not a fictional idealisation of it, and not a mirror of each operational system, but a process neutral data model which will be a valuable investment for years to come.

So, at this stage we have a robust target model (the data warehouse) and a good understanding of the source systems. We can now define the source to target mapping accurately, and, at the same time, build test cases to prove that the mappings work correctly. Yes, we build test cases, we do not just grab some real(ish) data and bash the ETL into it to see what happens. It does take a little time, but we automate our test runs so that the test case we build today can be run over and over again to keep on proving that the system is running correctly. The data volumes needed for this kind of unit testing are tiny, so test runs are completed quickly and can be run every time a new ETL module is added or an existing one is changed. (The ETL tools do not give much help in this, despite their 6-figure price tags.)

When we get a part of the data warehouse built successfully, then we can try some volume tests. Real data is useful here, but if it is not available, we use test data generators to ensure that our ETL and our data warehouse can handle real-world volumes (and a bit more). We may need to do some tuning. This may be as simple as adding some indexes, or it may cause us to rewrite chunks of our ETL code. Either way, we run our automated test suite, which now contains hundreds, maybe thousands of tests, to prove that we have not broken anything. Now the cost of building the tests is really paying us back.

At this stage the users can start running reports on the first part of the data warehouse. We can work with them to reconcile the results back to the source systems and to determine where any discrepancies lie. If we have done the preceding steps well, the new data warehouse will be right more often than not. When it is wrong, we can fix it and run our test suite to make sure we have not broken anything else. The users' involvement here builds confidence in the new data warehouse and smoothes the path to implementation.

So, if the first part of this article was cynical; is the second part hopelessly optimistic? No, it is not. We have done this and made it work, many times.

The keys to success are:
• people in the team with broad and deep skills
• involvement of the users from start to finish
• openness and honesty

The tools we use encourage these characteristics, but no tool is a silver bullet - without the buy-in of the team (users and developers) no tool will make a data warehousing project work.

To develop and manage documents, we use a wiki, and we do not lock down the security within the team (it is secure from competitors and any other possibly malicious access). Everyone in the team - users, managers, developers - can add to or change the wiki at any time, and everyone can see the current content and the history. Mistakes can always be fixed because the wiki keeps a complete history of changes, but the need to revert to history is very rare in our experience.

We use a tracking system for tasks, risks, issues, enhancements, test cases and defects. Again, everyone can see and comment on all of these. If a problem is brewing, then the team knows about it before it becomes a catastrophe.

We do not accept quick and dirty fixes. Technical debt is paid off as soon as possible, it is not allowed to accumulate. We are not afraid to "re-factor" to make our database and our code clean - we can easily prove that it is still working by running our automated tests.

Of course, we also use data modelling, data profiling, ETL, version control, and other tools, but we do not expect the tools to solve all our problems, and we choose free or low-cost tools where they do the job.

And, we do not do our analysis by writing ETL. We do analysis the much less expensive way, using our brains before we write all the code.
Share/Save/Bookmark