What is the difference between data cleansing and data quality




















Where the data doesn't meet business requirements, the data is 'cleansed'. Well, that is a very conscious choice. We don't ask you to send us your data tables to cleanse. We clean enterprise data. Data cleaning refers to the process of identifying and deleting redundant, obsolete and trivial data objects within an enterprise data landscape.

This process is carried out at a much wider scale, on a data object level. Instead of looking within a data table, we compare all existing data tables - and more - with one another. We identify redundant, obsolete and trivial data objects.

Some of the aspects we look at are the following:. It is difficult to analyze data in different data sources. Data warehousing provides a solution to this issue. It helps to collect, store and manage data from a variety of data sources into a central location called a data warehouse. The data warehouse gets data from transactional systems and various relational databases.

Finally, this data is processed and analyzed to get meaningful business insights. The data should be cleaned and transformed before loading into the warehouse. The extracted data from multiple sources can consist of meaningless data. Dummy values, contradictory data, absence of data are considered as meaningless data. These unnecessary data must be removed from the dataset. Irrelevant observations are when you notice observations that do not fit into the specific problem you are trying to analyze.

For example, if you want to analyze data regarding millennial customers, but your dataset includes older generations, you might remove those irrelevant observations. This can make analysis more efficient and minimize distraction from your primary target—as well as creating a more manageable and more performant dataset. Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or classes.

Often, there will be one-off observations where, at a glance, they do not appear to fit within the data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-entry, doing so will help the performance of the data you are working with.

However, sometimes it is the appearance of an outlier that will prove a theory you are working on. This step is needed to determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider removing it. There are a couple of ways to deal with missing data. Neither is optimal, but both can be considered. At the end of the data cleaning process, you should be able to answer these questions as a part of basic validation:.

Before you get there, it is important to create a culture of quality data in your organization. To do this, you should document the tools you might use to create this culture and what data quality means to you. Try Tableau for free to create beautiful visualizations with your data. Try Tableau for free. Determining the quality of data requires an examination of its characteristics, then weighing those characteristics according to what is most important to your organization and the application s for which they will be used.

Related posts. Read more. What is the difference between Primary Key and Surrogate Key? Is there any limit on number of Dimensions as per general or best practice for a Data Warehouse?



0コメント

  • 1000 / 1000