Ethics in data science and machine learning


It's important to understand that ethics play a role in every part of the data science lifecycle. You should consider the ethics of your decisions in each step. It begins with your central question and progresses through the availability of your model.

In the berry example, you discovered that a significant piece of data was missing from the training and testing datasets. You didn't know about thimbleberries or that there were six types of berries instead of only five. Although the subject of berry identification might seem trivial, the phenomenon represents a much larger problem. In addition to rocket launch safety, the absence of this data can skew results and even be life-threatening. For example, did you know that men and women exhibit vastly different heart attack symptoms? In recent health studies, huge populations of people were omitted from initial data collection, and it affected models for heart attack symptoms that were used in healthcare.

Ethics and rocket launch safety

The knowledge and expertise of NASA scientists and collaborators help ensure the highest probability of a safe and successful rocket launch. You might not have access to the same resources, but you can try to be as ethical as possible with the limited data that's available to you.

In the remaining modules of this learning path, you'll explore how weather data that's publicly available can help you understand what a successful launch day looks like. The dataset that you'll work with contains information about 64 crewed and uncrewed rocket launches. With this data, you can look at the weather on those 64 launch days to try to get an accurate understanding of what the weather needs to be to support a successful launch.

The dataset you'll use contains one unsuccessful rocket launch that was pushed back because of weather. Think about the thimbleberry example. If you don't have a complete representation of data, you won't know when to look out for new categories. With the berry example, you're missing the fact that there are six berry types, and you haven't identified thimbleberries. With the NASA data, you're missing pushed launched dates.

Data science problems require rigor and iteration. With each new level of knowledge we gain from our data, we learn what other data might be missing, what new questions to ask, and how we might prioritize the data to yield more accurate understandings of the world around us.

Analysis that considers only one example of negative factors isn't the kind of data NASA would use when real lives are at risk. More data and subject matter expertise would be required before it should be used for any kind of real decision making. However, the dataset you'll work with in the next modules in the learning path provides an introduction to the type of analysis that might be used as a starting point.