10. Data Wangling & Cleaning

We will use a case study approach for the class to understand steps before building machine learning models to ensure the data is robust for making prodections.

This problem statement from an online education platform where we’ll look at factors that help us select the most promising leads, i.e. the leads that are most likely to convert into paying customers.

Our ultimate goal- We shall use the data from previous leads who did convert to a customer and many who did not to build a model that we can use to score incoming leads for preferential retargeting.

The data dictionary for the data set is here

10.1 Load data and High level review

10.1.1 Assign 'handles'

10.2 Review Target Variable

We have no idea what kind of variable it is...

10.3 Variable Types

10.3.1 Categorical Variables [Non-Numeric]

10.3.2 Numeric Variables

10.4 Cleaning Numeric Fields

There is little difference as to how Ratio & Interval variables usually are cleaned.

Some of the integers that a are 1/0 are nominal variables

10.4.1 totviz

The total number of visits made by the customer on the website.

10.4.1.1 Null treatment of Numeric Values

Is this value missing because it wasn't recorded or because it doesn't exist?

10.4.1.2 Dropping all records with null values

10.4.1.3 Outlier treatment of Numeric Values

Outlying values don't add significant

Depends on usage and practice. What is seen and expected in reality?

What you want to understand is what are generally acceptable ranges of values

10.5 Categorical Values

10.5.1 Low density missing values

10.5.2 High density missing values

10.5.3 Reducing the number of categories

10.6 Methods not covered

The following references are being offered to the class aas these haven't been covered in the class due to the data set or the advanced nature.

You may want to take look an see if you need them for your Term Project

  1. Parsing dates
  2. Character Encoding Challenges
  3. Human Error in Data Entry
  4. Cleaning text data
  5. Another one on text data