Understanding Data Quality With Disney

If you were born and raised in Florida like me, you got used to seeing Disney everywhere! Even if you weren’t born in Florida, you’ve probably taken a vacation to the theme parks, or perhaps recently subscribed to Disney+. That’s the inspiration for this article – data literacy using shared experiences that most of us can relate to as humans!

The last thing you need in life is low-quality data that gives you inaccurate insight or takes you in the wrong direction. As companies in various industries of all sizes collect increasing larger amounts of data, focus on data quality and data management has become more important than ever. That’s where understanding common data quality themes can help you ensure that your data is used appropriately and that decisions and strategies are based on accurate information. 

The challenge is, not to overcome the issue of data availability, but how to manage it as an input to strategy, decision-making, and business decisions. The following sections are summaries to explain many common themes around data quality aimed at making these topics approachable for the average business user.

1. Data Normalization

Does all of your data values dot their i’s and cross their t’s? Data Normalization is a way of making sure everything fits together in the picture and is critical to data integrity.

Not all data is equal. Normalization is the process of organizing data in tables, it also eliminates redundancy and increases integrity. If your making decisions on denormalized data then you could be making the wrong decisions.

Are “FassPass”, “FP”, and “F.P+” the same things? Not in terms of data, and if you counted the number of “FassPass” given out without normalizing the data, you may find yourself with longer wait times since you did not consider the abbreviated alternates.

Which one is right? Is the data in the correct format? Is there an industry standard for how should the data be normalized? If you discover a normalization issue, have a conversation with your data team, and work together to fix it.

2. Misclassified Data

Is mickey a mouse or a rat? Does it matter if they are both classified as rodents? Well, the answer is: it could matter. Misclassified Data is when you misrepresent the data.

In business we make many data-informed decisions, but what if we misrepresented the data? Using the picture, if we returned brand values for Darth Vader we would see Apple. This is very easy to spot, but in a data-set of millions, would we have caught this so easily?

3. Data Relationships

We love data, data loves data. Related data tables need a way to find each other, and the way we do this is to share a common value between them – similar to doing a vlookup in excel. This shared value is called the primary key.

Data relationships give you the ability to explore related data and cross-functional data from other sources. In the picture, if [table 2] included all movies for these brands, we could easily pull all Disney movies that stared Mickey for example by selecting Mickey from [table 1] and returning all of the movies from [table 2].

4. Missing data relationships

HELP, my data left me! Missing data relationships cause big problems when you try to create meaning between 2 data sets. Like vlookups, you would get back some ugly #NA errors.

What if on paper you didn’t exist? Your family tree would have some major gaps. Missing data relationships work the same way. In the picture, we would not be able to use any additional data on [table 2] related to Pixar simply because it does not exist. What would happen if these were your customer sales records?

5. Extract Transform Load (ETL)

Data can be like frames in a movie. All the pieces are there, but it has to be put together to tell a story. This is what data teams call ETL: Extract Transform Load.

The data in your dashboard has been meticulously put together and no matter how big or small, one small piece in the wrong place can change meaning entirely! There’s a lot of trust when you are using a dashboard, and behind the scenes, there’s a lot of pressure on data teams to build it perfectly.

The best dashboard is one where the business users do not even know how complex it was to put together. Do you know how each of the 535,680 frames in Fantasia was put together, or did you just enjoy the story?

6. Incomplete Data – Null Values

Danger, we have a Null() value! Null values are false values where proper data should be.

Either a result of incomplete data or accidents resulting from some poor ETL – null() values love to hide in data. What would be Mickey with a Null() instead of a face? We would miss out on emotional expressions that help tell the story, not to mention he would be missing his ears! One null() value just changed an iconic brand image!

Imagine what null() values could be causing havoc in your data. It’s good to do quality reviews, and as a business user, it’s good to understand how to review the underlining data.

7. Incomplete Data – Missing Values

Was there supposed to be some data there? Missing values are not always apparent when you are looking at a dashboard.

Sure you would notice Mickey missing his face, but would you notice a single missing window in the exterior of Cinderella’s castle? Probably not, but if you were in the room with the missing window it would be pretty dark.

Details matter, and when it comes to data, the devil is in the details. It never hurts to check the underlining data when you are making a data-based decision. Does the data make sense? Is there obvious missing data? It will be impossible to review every value in a data set, but it never hurts to keep an open mind and review the underlining data.

8. Duplicate Data

Attack of the clones! Duplicate data is caused by records with the same content or inadvertently shares data with another record.

Duplicate data can give extra credit to a single value similar to the above picture crediting baby Yoda twice in the rolling credits. While a movie contains many scenes featuring a reoccurring character you do not credit them for every appearance in the credits.

It’s important to understand the context behind how you are using the data. Let’s say your sales staff makes phone calls by using a list. You may end up calling the same customer multiple times due to data duplication, this business impact can be caused by data duplication and can harass the customer with phone calls. 

A duplicate can be the result of a database error, a data management system error, a database error, or a problem with your data processing. If you encounter this issue you can work with your data team to review the underlining data. Additionally, you may want to consider using a distinct way of calculating unique records.

Tags:

No responses yet

Leave a Reply

Your email address will not be published. Required fields are marked *