Tuesday, July 5, 2022
  • Login
  • Register
IT Skills You Need
  • Home
  • Monitoring
  • DevOps
  • AI
  • Big-data
  • Automation
  • Agile
  • ITIL
  • AIOps
No Result
View All Result
  • Home
  • Monitoring
  • DevOps
  • AI
  • Big-data
  • Automation
  • Agile
  • ITIL
  • AIOps
No Result
View All Result
IT Skills You Need
No Result
View All Result
Home Datascience

The How-To Guide for Cleaning and Preparing Data for Analysis

admin by admin
21 July 2021
in Datascience
0 0
0
The How-To Guide for Cleaning and Preparing Data for Analysis
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter

Tidying up your data is part science, part art, and all work. If you’re lucky, you’ll get your hands on some perfectly formatting data (Slack does a nice job, for example). But more often than not, you’ll need to do some data cleaning before it is ready for analysis. Survey data is a particularly common clean-up challenge, but even pulling data from Google Analytics will require some clean-up.To help you on this journey, we’ve created this opinionated guide to preparing your data for analysis. The goal is to get to a flat table of data — which will serve as the raw materials that need to analyze data. Think of it like the kind of data you’d use as input to a PivotTable in Excel.One of the seminal discussions of this problem is by Hadley Wickham, who wrote a paper, summarized in this post, called Tidy Data. He says”:Like families, tidy datasets are all alike but every messy dataset is messy in its own way.

Tweet to share with a friend

A couple of important things to understand as you try to create a beautiful, analyzeable block of data.Each row is an observation about a thing. It is helpful to get a clear idea of what a row represents in your data before you start the cleaning/preparation process. Rows might represent individual people. Or transactions. Or activity for a product for a given day. If you are rows are too aggregated (e.g. “financial results by year”) you won’t be able to do a lot of analysis on the data table.Columns, then, are the descriptive information about each of the rows. We call them measures (variables that you add up, average, or are otherwise used for calculations) or dimensions (variables that can be put into buckets or describe the things in the rows). Date/time is a special kind of dimension. Often a data set will have lots of columns. It is nice to have a lot of data; it is even nicer to work with only the columns you need.A nice and tidy data table
A cleaned and prepared data table should both look tidy (i.e. a flat table with column headers) and be tidy (i.e. values in each cell are consistent when they represent the same things and don’t include multiple values in a cell).How do we get there?While we have outlined a complete data preparation process below, but you can also use the associated Google spreadsheet to practice our growing collection of tips and tricks. Here are the steps to use for getting to a clean, tidy, and analytics-ready data table:Get organizedPractice time-saving skillsPrep your columnsFix columnsFix rowsRestructure data1. Get Organized (The Easy Stuff)You want the first row of your data table to be the names for the columns. If there is other stuff in your data table above these data field labels, delete those rows.Remove grid and color formatting. As a general rule (soon to be broken in the next section), color and grid formatting is distracting. Data for analysis is your raw materials, not the final canvas.Add a sheet to document your changes. It is possible that you’ll need to clean this data again; you’ll want a record of what you did the first time. This is also a good place to keep lookup tables, references, and links to sources.Here’s an example:

2. Practice Time-Saving SkillsThere are a few spreadsheet skills that will save you a ton of time with cleaning data:Learn to move quickly around the page. Holding the Command + Arrow key will let you move around the page without smashing on the arrow keys. This key combination jumps you to the next empty cell of a row (left or right arrow) or up and down a column. This is a quick way to move to the top of your table, or jump to the last column or last rowCommand + Shift + Arrow moves in the same way as described above, but it also selects/highlights the set of cells that you moved through. We will be using a lot of formulas to clean the data. However, once you’ve updated the data, you’ll often want to copy the results without the formula. In order to paste just values, learn the key combination: Command + Shift + VIn general, you’ll want to do the following when you clean a data column:Add a new column to the right of the column you want to clean. Give it the same column label as the original — or a new label if you are constructing new values.Add a formula at the top of the new column (see examples below) to adjust or revise the values in the original column.Test the formula on 20-30 rows in the new column to make sure it is giving you the result you want.Copy the formula to the bottom of the new column. This is where you use the key combinations Command + Arrow key and Shift + Command + Arrow keyCopy the new column in its entirety and then Paste as Values to replace the formulas with the static values.Delete the original column.3. Prep Your ColumnsThe majority of data preparation and cleaning involves evaluating and fixing one column at a time. Therefore, it is useful to get organized about the different types of columns you are working with.Find the columns that have useful dimensions. Dimensions are ways that you might categorize the things represented by the rows.Color these in blue.Find measure columns that are either empty or all the same value. Consider deleting these columns because they don’t provide useful information for analysis.Change the data field names to give them a label that is around 5-15 characters — abbreviations can be confusing, long labels will be hard to show in your visualizations.Next, find the columns that have measures. These are the values you might add up, average, or do other calculations on.Color these in orange.Find measure columns that are either empty or all the same value. You can probably delete these columns.Change the data field names to something that isn’t too long (like the whole question from a survey) or too short (a hard-to-decipher acronym).Select your full range of data and turn on the data filter function. We’ll want to use this for sorting and filtering.Three types of data cleaning/preparation problemsThe majority of data preparation involves “fixing” columns. A column of data can have many types of problems: data formats, extraneous information, missing information. Section 4 provides a list of common problems, a solution, and a link to the example.Your data may have individual rows that need to be fixed or removed. Section 5 highlights a few common issues here.Finally, your data may need to be restructured. This is where things get complicated. You may need to turn certain columns into rows (or rows into columns). Data tables can be too granular for analysis (e.g. “transactional-level data”) or too summarized. In Section 6 we’ll provide a few examples of these problems.4. Fix ColumnsProblem: Values are broken into sections, with empty cells that imply values in between.Solution: IF() function can fill in the implied values. Spreadsheet example

Problem: Cells contain multiple values and require complex logic to break out the values.Solution: Use MID(), LEFT(), RIGHT() functions to identify break points and parse the field values. Spreadsheet example

Problem: Cells contain multiples that are consistently delineated by a comma or other characterSolution: SPLIT() function. Spreadsheet example

Problem: ALLCAPS or other capitalization of words.Solution: Change text with LOWER(), UPPER(), or PROPER() functions. Spreadsheet example

Problem: Values for the desired data field are spread across multiple columns, E.g. first name, last name.Solution: Use CONCATENATE() function to consolidate multiple values into a single column. Spreadsheet example

Problem: Numbers are being treated as text.Solution: Convert text into numbers with VALUE() function. Spreadsheet example

Problem: You want a date field that uses date values (i.e. year, month, day) from other columns.Solution: Use DATE() function to construct a date value. Spreadsheet example

Problem: The values in a column include information that is unnecessary.Solution: Use a combination of CLEAN(), FIND(), and MID() functions to remove extra characters. Spreadsheet example

Problem: Values in a column need to be converted to 0 or 1 values for calculations. Solution: Use a combination of IF() and ISBLANK() functions. Spreadsheet example

Problem: ZIP codes are being displayed as numbers and therefore losing the preceding ‘0’ for some codes.Solution: Identify “shortened” ZIP codes and use the CONCATENATE() function to append the missing zero. Spreadsheet example

Problem: The values in a column need to be converted into a label that is more completer or easier to read, e.g. country names rather than country abbreviations.Solution: Use the LOOKUP() function. Spreadsheet example

Problem: The values in a column need to “bucketed” into groups, e.g. age ranges.Solution: Either use a nested IF() function and reference a lookup table or use the SWITCH() function. Spreadsheet example

Problem: Some measure values are represented by non-numeric values such as “—” or “..” or “NULL”Solution: Identify the columns with issues using the COUNT() and COUNTA() functions. Use the data filtering feature to find the problematic values. Spreadsheet example

5. Fix RowsProblem: The header row is repeated multiple times throughout the table.Solution: Use data filtering to filter to only these rows and delete all but one. Spreadsheet example

Problem: Some cells are merged. This is not conducive to data analysis.Solution: In Excel, select all cells in the spreadsheet, then use the menu option for ‘Merge & Center > Unmerge Cells’. Otherwise, Look for patterns as to when cells get merged and use data filter in the column header to filter and manually un-merge cells. Spreadsheet example

Problem: The table has a summary row or rows that adds up or averages values all the values in a column. These rows are not helpful for tidy data.Solution: Identify how these rows are labeled. Use the data filter in the column header to narrow to only these rows and delete.7. 🚧 Restructuring Data 🚧There are three common ways that data needs to get restructured. This is where things get hard and where there isn’t a formula to make the necessary changes.The same metric is broken out into separate columns. We will often see data tables where a metric, like sales, is broken into separate columns by a dimension like month/quarter, geographic region, product, etc. This structure is not conducive to analysis. The preferred structure is to have a column that represents that dimension and a single column for the metric.Data has been pre-aggregated. Your data table needs to include enough details to support the slicing and dicing you want to do. If a table has already been summarized, all that information will be hidden behind aggregated values. In this case, you may need to get back to the original data source to access disaggregated data.Data is too granular. The flip side of the previous problem is when you are dealing with very detailed data. Data that shows individual transactions or activities, like weblog data, maybe too granular for analysis. This data needs to be pre-aggregated so that it is more manageable. In web analytics data, for example, rather than having each row represent an action by a user, each row could be summarized to describe all the actions of each user.
Get Your Free Workspace

Tags: Data science

Get real time update about this post categories directly on your device, subscribe now.

Unsubscribe
Exclusive Interview with Srikanth Velamakanni, Co-Founder, Fractal Analytics – Fractal Becomes a Unicorn!

Exclusive Interview with Srikanth Velamakanni, Co-Founder, Fractal Analytics – Fractal Becomes a Unicorn!

10 January 2022
Become a Web 3 Developer in 2022

Become a Web 3 Developer in 2022

21 June 2022
5 Steps to Become a Citizen Developer

5 Steps to Become a Citizen Developer

23 December 2020
AI FinTech: Making Data Science a Competitive Differentiator

AI FinTech: Making Data Science a Competitive Differentiator

12 April 2021
Top Data Science Projects to add to your Portfolio in 2021

Top Data Science Projects to add to your Portfolio in 2021

13 December 2021
How to run a business during economic uncertainty

How to run a business during economic uncertainty

23 May 2022
SAFe Program Board 101: Everything You Need To Know

SAFe Program Board 101: Everything You Need To Know

21 December 2020

Stock price using LSTM and its implementation

6 December 2021

[Live Sessions] Ask us anything!

11 January 2021

Knowledge Distillation: Theory and End to End Case Study

4 January 2022

Handy Tips #31: Detecting invalid metrics with Zabbix validation preprocessing

9 June 2022

Food for Agile Thought #315: Fulfilling Working Experiences, Sustainable Disruption, Toxic Company Culture, Jira: Razor Blades?

24 October 2021

Predictive Analytics: The it Plays in Real Estate Sector

8 November 2021

Building an ETL Data Pipeline Using Azure Data Factory

16 June 2022

An Intelligent Digital Workforce: The Great Integrator

29 September 2020

Introduction to Cloud Computing for Machine Learning Beginners

10 January 2022
IT Skills You Need

© 2022 IT SKILLS YOU NEED

Navigate Site

  • Activity
  • Classifieds
  • Groups
  • Home
  • Members

Follow Us

No Result
View All Result
  • Home

© 2022 IT SKILLS YOU NEED

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

*By registering into our website, you agree to the Terms & Conditions and Privacy Policy.
All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.