Data Quality: Impact on Project Planning

Last year Rob Thomsett published a series of excellent articles on the concept of quality in respect of system or product build. You can find the link here:

For those of you who missed it the first time around it’s well worth a read.

He made the point that quality can be defined as series of objective attributes, but it’s a subjective choice as to which of those attributes you want and are prepared to pay for.

For some golfers, the sound a driver makes when they tee off is very important. Golf club manufacturers spend lots of money on this quality attribute and people pay extra for this. Others look at the scorecard and see there is no way to adjust your score with “sounds really great off the tee” and won’t pay for the privilege of having your golf buddies nod approvingly as you smack one down the middle with a satisfying thwack.

Quality attributes for system and product design as defined by ISO Standard 25010 are conformity, usability, accuracy, efficiency, flexibility, portability, auditability. Rob adds job impact as a final attribute.

The question I am posing and the reason for this article is – is there more to quality than just the quality of the product or system – here I am talking about the quality of the data in the system.

Have you ever had this experience? You ring a call centre, the very helpful robot who answers asks you to enter a range of details such as your date of birth, account number etc. After the requisite 15 minutes on hold, while the company is at pains to tell you seventeen times how important you are to them, you get through to a human who then asks you for your date of birth and account number. You explain through gritted teeth that you have already provided this.

Let’s assume the system which houses your data was built properly and meets the right quality standards above – it’s conformant, usable, flexible, auditable etc. The sponsor for the project that built the system has been lauded for the success of the build.

As a customer, your care factor for this “quality” outcome – zero. Why? Because the call centre human can’t see the data you just entered. Data quality matters, it’s integral to customer experience.

With the shift in technology to cloud based and the ability to gather and process large volumes of data, they key strategic advantage for companies has become their data assets not their technology assets. For decades traditional companies had a natural barrier to entry for competitors from their legacy tech stacks – it was simply too hard to match the technology incumbency of say, an insurance company or a bank. Thus the internal focus was on the technology, with big investments maintaining and improving this. As a consequence many projects considered investments from a technology point of view, with data almost thrown in as an afterthought. In many companies I have seen the technology asset owner is readily identifiable and understands her obligations. Ask those companies for a register of data owners and you will get an uncomfortable silence.

Great companies today realise data is the key asset, and the technology which enables its use is secondary. When anyone can spin up cloud based tech in under a day this traditional barrier to entry for competitors is gone, replaced by data – much harder to break into a new market when your competitors know so much about the customer base.

Just like product quality, data quality can be defined as a range of objective attributes, which can be subjectively chosen (and paid for). In the new, data centric world, intelligent choices still need to be made about which attributes are important. Too often in project planning insufficient thought is given to these choices.

There is a broad range of definitions of DQ attributes which could be debated at length. I thought I’d pick out a few key ones:

Institutional environment

The institutional and organisational factors that have an influence on the effectiveness and credibility of the creation, retrieval, update and deletion of data.

Libraries are good at this. Your teenage son organising his diary for the next school term – not so much.

Relevance

The degree to which data meets the needs of users. Assessing relevance is a subjective matter dependent upon the varying needs of users.

If I am buying a smart watch, its ability to monitor my blood pressure is not something I would pay for if this data is not relevant to me. For others this could be very relevant and price point adjusts accordingly.

Timeliness

The period between the creation of the data and time at which the data becomes available. It often involves a trade-off against accuracy.

Smart electricity meters were deployed across the state of Victoria in the early 2000’s, they now measure home electricity usage every 30 minutes. However the data is only required to be transmitted to energy retailers and made available to customers on an overnight basis. By comparison a consumer who installs solar panels or a home battery gets real time data. Day old data is much less effective in allowing customers to manage and reduce their energy consumption.

Accuracy

The degree to which the data correctly describe the phenomena they were designed to measure.

When Apollo 11 landed on the moon they reportedly had 20 seconds worth of fuel left. The accuracy requirement for their fuel gauge was orders of magnitude higher than that of a family car, with commensurate increase in cost.

Consistency

Does the data match required formatting standards, is it captured in a repeatable way?

A simple example here is date formatting – DD-MM-YY might contain the same data as YYYY-MM-DD but it’s not easy to combine this data into a single table.

Accessibility

Can the data be accessed?

In 2015 regulators looking into allegations of manipulation in the FX markets demanded large volumes of phone calls from Banks. Phone records were often on digital tapes, were only ever used for occasional settling of queries around trade details. Systems were not designed for bulk retrieval of calls and meeting the demands required significant system upgrades and took months.

Uniqueness

Is there a single view of the data set?

How many versions of your name and address does your bank have? How many copies of the same photos of your family trip to Vietnam do you have on your PC? And don’t get me started on my music collection….

In addition to these attributes a range of other questions should be considered when specifying data quality requirements.

Data lineage. Data lineage refers to the origin and transformations that data goes through over time. Data lineage tells the story of a specific piece of data.

If you draw up a new will and it says that the new will invalidates all the old ones, this may not matter too much. If you need records on ownership of land title over time you will want a record of what has changed and when.

Confidence in the data – Are Data Governance, Data Protection and Data Security in place? What is the reputation of the data, and is it verified or verifiable? Are there controls on the creation, retrieval, update and deletion of data.

Value of the data – Is there a good cost/benefit case for the data? Is it being optimally used? Does it endanger people’s safety or privacy or the legal responsibilities of the enterprise? Does it support or contradict the corporate image or the corporate message?

In the planning stages for a project a lot of effort is put into the “as is – to be” system states. Far less effort is put into this question for data. A high-level diagnostic of data quality at the outset of a project can pay huge dividends. Remediation of data quality gaps discovered during build (or worse, after go-live) can be expensive in terms of time, cost and reputation.

As examples:

identification of major gaps in completeness of data in a standing data repository could lead to a whole new scope item in the project plan.
realisation that data is required real-time when it’s currently available overnight changes system architecture.
understanding data retrieval requirements can inform GUI design, user access hierarchies and controls
design and implementation of appropriate reconciliation processes can add cost but also to users’ confidence in the data

To come back to the traditional vs new world company comparison earlier – both types of companies have made data quality attribute choices, often by default. Traditional companies tended to focus on completeness and accuracy at the expense of timeliness. When you are using data to calculate a customer’s energy bill completeness and accuracy matters a lot, if the bill takes a day or so to produce so be it.

Data centric consumer companies flip this around – timeliness of data (and the real-time analytics that come with this) is more important than ensuring all data from all sources is captured and reconciled to high levels of accuracy. If you are an on-line media publisher adjusting content to maximise eyeballs on screens, then trends based on data which is mostly right but available now is what matters.

Regardless of what type of company you are, data quality matters. Having a good understanding of which DQ attributes matter to you, and whether these desired attributes exist and can be used should be an explicit stream of any good project plan, and budget and resources should be set aside to ensure you get what you need from your data.

About Mike Stockley

Mike is a seasoned program director with significant experience at Executive level within the Financial Services and Energy industries, with a range of global roles. Mike brings several years direct and current experience in setting up, leading, and executing large scale transformation programs. He also boasts deep subject matter expertise in markets related regulatory reform, risk governance, and large-scale portfolio management.