High-Quality Data
Last updated
Last updated
Data Quality Considerations for Big Data and Machine Learning: Going Beyond Data Cleaning and Transformations.
A data governance-driven framework for data quality lifecycle
Data governance is essentially the rules and practices that govern the management of data in an organization. It's like having traffic laws for data; they help ensure that the data flows smoothly, remains well-organized, and is used properly.
Data quality is crucial because it ensures that the information used for making decisions is accurate, up-to-date, and reliable. In this lifecycle, we see that data is not just collected and used; it's carefully managed every step of the way, from ensuring it meets certain standards to continuously monitoring its quality. Organizations strive to maintain high data quality to make informed decisions, much like how a chef ensures every ingredient and every step in cooking is perfected to serve the best meal.
Certainly, let's delve a bit deeper into each step of the data quality lifecycle, expanding on the kitchen analogy to make the concepts more relatable and clearer.
This is the rulebook for managing data. It includes a data dictionary (definitions of data terms), metadata (data about data), data models (blueprints of how data is organized), and business rules (guidelines on handling data).
Imagine you're opening a restaurant. Before you start, you need a plan for how your kitchen will operate. In data governance, this plan includes:
Data Dictionary: This is like having a menu for your staff that explains each dish in detail, what ingredients are needed, and how they should be combined.
Reference & Meta Data: This can be likened to the standard measurements and kitchen scales you use to ensure consistency in your recipes.
Data Models & Business Rules: These are the kitchen protocols that tell your staff how to handle various cooking scenarios. For instance, if a key ingredient is missing, what should be the substitute?
Roles & Responsibilities: Just as in a restaurant where you have chefs, sous chefs, and dishwashers, in data governance, different people are assigned specific roles like data analysts or data stewards to manage the data efficiently.
Data Quality Analytics
Think of this as the quality control process, involving:
Descriptive Analytics: Describing the current state of data quality.
Diagnostic Analytics: Figuring out the reasons behind any data issues.
Predictive Analytics: Forecasting future data quality under different scenarios.
Prescriptive Analytics: Providing solutions for improving data quality.
This is where you measure and ensure that every dish you serve is up to your restaurant's standards:
Descriptive Analytics: The head chef inspects ingredients for freshness and quality before they are used, ensuring they meet the restaurant's high standards.
Diagnostic Analytics: If a dish is sent back by a customer, the chef needs to figure out what went wrong. Was the dish undercooked? Were the ingredients not fresh?
Predictive Analytics: Based on past experiences, the chef can predict which dishes might not be well-received if certain ingredients are missing or substituted.
Prescriptive Analytics: It’s the chef’s experience that then prescribes what to do next time to prevent the same mistake or to make a dish better.
This involves collecting data and giving it meaning. It includes filtering, cleaning, enriching, and generating metadata. It's like prepping ingredients before cooking a meal; the quality of your ingredients will affect the final dish.
Think of this as sourcing the best ingredients. You need to:
Filter & Sample: Just as you would select the freshest fish or ripest fruit, filtering, and sampling in data acquisition means you select only the most relevant and high-quality data.
Clean & Enrich: This is like washing and marinating your ingredients to enhance their flavor, where data is cleaned of errors and enriched with additional information to make it more useful.
Semantics & Meta Data Generation: You add labels to ingredients to indicate their use-by date and best combination with other ingredients. Similarly, metadata is generated to add meaning to the data, explaining its use and relevance.
TIA stands for Transformation, Integration, and Aggregation. This is about converting data to a usable format, combining different data sources, and compiling data for use. It’s akin to mixing ingredients to create a specific flavor or dish.
Now that your ingredients are prepped, you need to cook them to create your dishes:
Transformation: This could be cutting vegetables or marinating meat — you're transforming raw data into a format suitable for analysis.
Integration: Just like blending ingredients to make a sauce, integrating data means combining different datasets to create a more comprehensive view.
Aggregation & Semantics Generation: You take individual dishes and put them together to create a complete meal for a table. Aggregating data involves compiling and summarizing it for final analysis.
This is the process of storing data, retrieving it when needed, and deleting it when it's no longer useful. Imagine a well-organized kitchen where ingredients are stored, used, and thrown away after their expiry date.
This is your storage and waste management plan. You need to keep ingredients and dishes at the right temperature and dispose of anything that’s expired.
This step assesses the data quality using dashboards and metrics. It's like a chef tasting the dish to ensure it’s up to standard before serving.
Before the meal reaches a customer's table, a final check is done for taste and presentation. In data terms, you use dashboards to visually inspect data quality and apply measures and metrics to ensure it meets your quality thresholds.
Ongoing monitoring to ensure continuous data quality through dashboards, setting thresholds for quality, and issue management. This would be similar to a quality assurance team in a restaurant continuously checking the quality of the food.
Your restaurant may have a feedback form that diners fill out. This feedback helps you continuously monitor the quality of your food and service. Similarly, in data quality monitoring, you constantly check the data for errors, track how often issues occur, and have plans in place to address any problems.
Lastly, with the rise of big data, just as restaurants now have to deal with online orders and food delivery apps in addition to dine-in customers, data governance and quality need to adapt to new types of data and storage systems like NoSQL databases, which are designed for big data's variety, velocity, and volume.
In essence, just like running a successful restaurant involves much more than just cooking food, managing data quality in the era of big data requires a detailed, process-oriented approach that ensures the data used for making decisions is of the highest quality. Through an organized lifecycle of data governance, organizations can maintain the integrity and utility of their data, resulting in better business outcomes and decisions.