Need to brush up on data quality concepts? You're in the right place. Let's cut through the complexity and focus on what really matters in the world of data quality.
First Things First: Why does Data Quality Matter?
Have you ever wondered, "What can my data do for me?" If the answer is yes, then it's likely that your organisation's data journey has hit a roadblock.
Data-driven decision-making should be a strategic initiative, not a random exploration. Instead of wondering what insights data might reveal, you should focus on how data can directly support your business objectives. To achieve this, your data must be meaningful, timely, representative, and most importantly, reliable.
Quality-controlled data is the most important aspect for effective data-driven decisions.
Need more convincing? Check out our blog ‘Assessing your Data’s quality: Dimensions & Methodologies for evaluation’ that delves in deeper into why data quality is crucial for your organisation.
Here's the thing - data quality isn't just about having clean spreadsheets anymore. In our connected world, it's about managing a constant flood of information coming from countless sources. That's where the 5 V's of big data come in…
Understanding the bigger picture: The 5Vs of Big Data
In simple terms, Big Data represents the complex interplay of various data components that an organisation either generates or requires in its operations.
These characteristics are traditionally understood as follows:
1. Volume
Volume refers to the amount of data generated. It is statistically important as it represents the population size of data points that directly contributes to the effectiveness of the inferred outcome i.e. it suitably represents the events leading to a trend or inference.
Example: Consider a survey result showing 90% of respondents preferred chocolate ice cream. This seems impressive until you learn only 10 responses were received from 2,000 surveys distributed.
2. Velocity
Velocity is the time it takes for data to be generated, collected and processed. Velocity should be evaluated from the point of generation to the point of decision making, rather than focusing solely on technical aspects like data loading times
Example: Consider financial institutions making stock trading decisions using near real-time data, or emergency services responding to incidents based on 999 call information.
3. Variety
Variety is about the diversity of data formats i.e. structured, semi structured and unstructured data.
Structured data: Contains a fixed schema, typically in tabular format like database tables, spreadsheets, and delimited files.
Unstructured data: Lacks a predefined schema, common in text comments, audio files, and images .
Semi-structured: : Combines elements of both, or employs a flexible structure like JSON or XML
4. Veracity
Veracity represents the truthfulness of data - its accuracy, reliability, and quality. This is by far the most important characteristic for effective data driven decision making.
Decisions can only be as good as the evidence presented.
Example: Consider a hospital's patient records. If a patient's blood type is incorrectly recorded as A+ instead of O-, any medical decisions based on this data could be life-threatening. Similarly, if temperature sensors in a manufacturing plant are poorly calibrated and showing incorrect readings, production quality decisions based on this faulty data could lead to defective products.
5. Value
Value pertains to the effectiveness of the outcome of the data evidence. A good data strategy will support the business strategy and enable correct decisions being made from inferences drawn from the data evidence.
Example: A retail store collects and analyses customer purchase history, browsing patterns, and inventory data. By using this data effectively, they can optimise their stock levels, create personalised promotions, and predict seasonal demands - directly improving sales and reducing costs. However, if they collect this data but never analyse or act upon it, the data holds no real value despite its volume, velocity, variety, and veracity.
Data Quality Governance
The concept of Quality in data management isn't as straightforward as categorising something as perfect, adequate, or incorrect. It requires a nuanced approach through proper governance.
Given this subjectivity, it is important that an organisation forms a governance framework that discovers, understands and mitigates DQ (Data Quality) issues and prioritises addressing these from a known state so that data driven decisions can be effectively taken within those constraints.
Data quality management is not only about effective data warehousing, but it must also be an organisation wide strategy.
The “Six Dimensions Model” is one of the industry standard tools for DQ governance. It's widely used today as the go-to reference guide, including by government organisations.
Convinced by its importance but need help taking the first step? Let us help you.
What makes my data ‘good’? 6 Data Quality Dimensions you need to know
Want to assess your data quality? Focus on these six key dimensions:
Consistency
Accuracy
Validity
Completeness
Timeliness
Uniqueness
Consistency
Data must follow the "Single Source of Truth" principle - meaning a data point should have the same value wherever it appears. For example, in a data warehouse using medallion architecture, the gold layer maintains this consistency, while the silver layer enforces format standards (like keeping all employee codes in uppercase).
Accuracy
Data should be factually correct and logical e.g. data of birth of an active student in a primary school cannot be more that 18 years. Accuracy of data is difficult to pinpoint as it needs to be evaluated in the context of a larger picture. In this example, the issue might actually be with the "active" status rather than the age being incorrect.
Validity
Validity of a date point is ascertaining if the format of the data is correct or abiding by the rules of the data constraint e.g. age of a person is negative, or value of a field is not expected withing the organisations rules.
Completeness
A Data is complete when it has all necessary attributes for its specific use case. Although, data can still be considered complete even when optional fields are missing, or when mandatory fields aren't needed for a particular audience.
For instance, A delivery feedback form without a postcode might be complete for customer satisfaction analysis, even though the postcode is mandatory for delivery purposes.
Timeliness
Timeliness is about ensuring information is current enough to be useful, but "current" means different things in different contexts. What counts as timely depends entirely on the needs of the data consumer and their specific use case. For example: morning newspaper's articles are considered timely for readers even though they're printed hours before consumption, while television news must deliver real-time updates to the same audience.
Uniqueness
Data records should not have unnecessary duplicates within a dataset, as duplications can lead to incorrect analysis and representation of occurrence counts.
Data uniqueness is a measure of how well the organisation understands its enterprise data estate/model. For example, a student's record should appear only once in a school's main admission database. Yet, if that same student joins the school's evening swimming club, what appears to be a duplicate entry might actually be valid. To manage this correctly, schools should add distinguishing attributes that clearly show why multiple records exist for the same student.
Final Takeaway
Remember: Data quality isn't about perfection – it's about fitness for purpose. Your organisation needs a clear governance framework that helps you identify, understand, and address data quality issues. And while tools and warehouses are important, true data quality is an organization-wide commitment.
The most crucial thing? Start at the source. Catching and correcting quality issues at the point of data entry is always more efficient than cleaning up messy data later.
Need more tailored guidance? Contact us using the form below.