Exercise - Clean weather data to analyze rocket launch criteria

Completed

Now that we have the data imported, we need to apply a machine learning practice known as "cleaning the data." We take data that looks incorrect or messy and clean it up by changing the value or deleting it altogether. Common examples of cleaning data are:

  • Ensuring that there are no null values
  • Making every value in a column look the same

We clean data because computers get confused if they look at inconsistent data or if lots of values in the data are null.

Data cleaning

The first step in cleaning your data is to replace all missing values with something. Replacing these values usually requires subject matter expertise. But in this case, you'll use your best judgment. Some rows (remember, rows represent days) are missing weather or launch data.

To get started, first get an overview of the launch data by running this command in your .ipynb file:

launch_data.info()

Of 300 rows, some columns have missing information:

RangeIndex: 300 entries, 0 to 299
Data columns (total 26 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Name                          60 non-null     object        
 1   Date                          300 non-null    datetime64[ns]
 2   Time (East Coast)             59 non-null     object        
 3   Location                      300 non-null    object        
 4   Crewed or Uncrewed            60 non-null     object        
 5   Launched?                     60 non-null     object        
 6   High Temp                     299 non-null    float64       
 7   Low Temp                      299 non-null    float64       
 8   Ave Temp                      299 non-null    float64       
 9   Temp at Launch Time           59 non-null     float64       
 10  Hist High Temp                299 non-null    float64       
 11  Hist Low Temp                 299 non-null    float64       
 12  Hist Ave Temp                 299 non-null    float64       
 13  Precipitation at Launch Time  299 non-null    float64       
 14  Hist Ave Precipitation        299 non-null    float64       
 15  Wind Direction                299 non-null    object        
 16  Max Wind Speed                299 non-null    float64       
 17  Visibility                    299 non-null    float64       
 18  Wind Speed at Launch Time     59 non-null     float64       
 19  Hist Ave Max Wind Speed       0 non-null      float64       
 20  Hist Ave Visibility           0 non-null      float64       
 21  Sea Level Pressure            299 non-null    object        
 22  Hist Ave Sea Level Pressure   0 non-null      float64       
 23  Day Length                    298 non-null    object        
 24  Condition                     298 non-null    object        
 25  Notes                         3 non-null      object 

You can see that Hist Ave Max Wind Speed, Hist Ave Visibility, and Hist Ave Sea Level Pressure have no data.

It makes sense that Wind Speed at Launch Time, Temp at Launch Time, Launched, Crewed or Uncrewed, Time, and Name have only 60 values, because the data includes only 60 launches. The remaining are the days before and after the launch.

Here are a few ways we'll clean the data:

  • The rows that don't have Y in the Launched column didn't have a rocket launch, so make those missing values N.
  • For rows missing information on whether the rocket was crewed or uncrewed, assume uncrewed. Uncrewed is more likely because there were fewer crewed missions.
  • For missing wind direction, mark it as unknown.
  • For missing condition data, assume it was a typical day and use fair.
  • For any other data, use a value of 0.

In the next cell, paste and run this code:

## To handle missing values, we will fill the missing values with appropriate values 
launch_data['Launched?'].fillna('N',inplace=True)
launch_data['Crewed or Uncrewed'].fillna('Uncrewed',inplace=True)
launch_data['Wind Direction'].fillna('unknown',inplace=True)
launch_data['Condition'].fillna('Fair',inplace=True)
launch_data.fillna(0,inplace=True)
launch_data.head()

Try running launch_data.info() again to see the changes that you just made to the data.

Note

You're changing the data that's stored in the launch_data variable, not the data saved in the Excel file. If you find that you modified or removed any data that you didn't mean to, you can re-run your notebook to bring the original data back in.

Data manipulation

Because computations are best suited for numerical inputs, convert all text into numbers. As an example, we'll use 1 if a rocket is crewed and 0 if a rocket is uncrewed.

## As part of the data cleaning process, we have to convert text data to numerical because computers understand only numbers
label_encoder = preprocessing.LabelEncoder()

# Three columns have categorical text info, and we convert them to numbers
launch_data['Crewed or Uncrewed'] = label_encoder.fit_transform(launch_data['Crewed or Uncrewed'])
launch_data['Wind Direction'] = label_encoder.fit_transform(launch_data['Wind Direction'])
launch_data['Condition'] = label_encoder.fit_transform(launch_data['Condition'])

Let's look at all the data again and verify that it has been cleaned.

launch_data.head()