When Does Imbalanced Data Require more than Cost-Sensitive Learning?

Dragos Margineantu

Most classification algorithms expect the frequency of examples form each class to be roughly the same. However, this is rarely the case for real-world data where very often the class probability distribution is nonuniform (or, imbalanced). For these applications, the main problem is usually the fact that the costs of misclassifying examples belonging to rare classes differ significantly from the costs of misclasifying examples from classes represented in a higher proportion in the data. Cost-sensitive learning studies and provides methods for the design and evaluation of classification algorithms for arbitrary cost functions. This paper outlines an issue that can occur in the imbalanced data setting but has not been studied, according to our knowledge, in the cost-sensitive learning literature---the situation when the class probability distribution on the training data differs significantly from the class probability distribution test data. We will present a brief overview of cost-sensitive learning methods applied on imbalanced data and we will extend the existing theoretical results for the setting in which training and test class priors are different.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.