Data Quality

This site is created with the vision to provide an end to end training and solution for Oracle EDQ (OEDQ). As the name suggests, OEDQ is an enterprise data quality tool by Oracle. But before we delve deeper into OEDQ, we should first understand the basics of data quality and the terminologies used commonly in data quality. If you already have idea about data quality you can jump directly to Oracle Enterprise Data Quality tab else read on.

The very first question which comes to mind is what exactly is data quality and why now organizations are emphasizing so much on it. Data quality is nothing but analyzing the data to check whether the data is really as per the business requirement and to modify it to match the requirement. According to Wikipedia “Data quality includes the processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria”. It is the analysis of data which includes process of inspecting, cleaning, transforming and modeling data with the goal of highlighting useful information, suggesting conclusions and supporting decision making.”

The below examples will make it more clear.

1. Name of a customer having special characters

Ex: Cust_name= Rajes%h

2. If salary column is coming from file it may have $ sign

Ex: $134,Rs1234

3. A Male customer is having Ms as his title

Ex: Ms Rajkumar Kumar

4. A female is having Title as Ms although she is married

Ex: Ms Rajni Rathore( should be Mrs Rajni Rathore)

5. A company is having business at many levels and customer name in different systems look different although they refer to the same customer.

Ex:

Name Email Phone Address
Dipen Mehta dipen.mehta@gmail.com

123456

 New York City
Deepen Mehta dipen.mehta@gmail.com

123456

New York City

The above customer is same but we may treat him as two different person

6. People using different names for same thing.

Ex: The column Country_Name can have values as

USA
U.S.
U.S.A.
America

 

However all are referring to United States of America

Ad:

7. The company has been used differently

Ex: The column Company_Name can have values as

International Business Machine
IBM
JP Morgan
JP Morgan LTD

 

Terminology

Data Profiling: Data profiling is analyzing data to find structure, completeness, pattern, clarity and relationship between the data.

Eg: Checking how many columns are having NULL values, finding available unique values of a column (Gender column has Male, Female, Unknown and blank space), finding pattern of phone no(1111-11-1111)

Standardization: Standardization is converting, enhancing or modifying the data as per the business requirement

Ex: The salary should be in range of 10000-500000 per month

Gender should have only Male and Female as its values (Remove everything

else)

Data Matching: Identifying equivalent and duplicates. The duplicates can be in two forms.

  • Physical Duplicates: Where the rows are exactly duplicated. This is much easy to identify

 

Name Email Phone Address
Dipen Mehta dipen.mehta@gmail.com

123456

New York City
Dipen Mehta dipen.mehta@gmail.com

123456

New York City

 

 

  • Logical Duplicates: where the rows are not having the same values but represent the same entity. This is very difficult to identify.

 

Name Email Phone Address
Dipen Mehta dipen.mehta@gmail.com

123456

New York City
Deepen Mehta dipen.mehta@gmail.com

123456

New York City

Data Correction: Identifying issues like invalid values, spaces, special characters, invalid formats

 

Missing Data: Identifying missing values for important attributes.

If you like this post, please provide your comments. Also like us at our Facebook Page

Jump to Oracle Enterprise data quality

Also Read: Data Virtualization

Informatica PowerCenter Express Installation

Informatica PowerCenter Express Architecture

Informatica PowerCenter Express- Getting Started

Idiot’s Guide to Big Data

 

 

 

Ad:

Leave a Reply

Your email address(not mandatory) will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>