pd.cut()

pd.cut() is a handy way of converting a linear set of values into categorical values. First, let me point out that pd is not a standard python command. It a common way of referring to the pandas library. You need to import it, and while we are at it, lets import NumPy as well:

import pandas as pd
import numpy as np

Pandas is a great tool to simplify the reading and writing of Comma Separated Value files (.CSV). Next, we will need an example CSV file to read in, why not the wine quality dataset from: http://www3.dsi.uminho.pt/pcortez/wine/

wine_df = pd.read_csv("winequality-red.csv", sep=';')
wine_df.describe()

Imagine if you wanted to group these wines by their alcohol content? Well, with a language like Perl, you would do something along the lines of this:

# Assuming you put all the individual record's alcohol content into an array called @records:

foreach $alcohol (@records) {
   if ($alcohol < 10){
      $alcohol = "low";
   elsif($alcohol >= 10 && $alcohol < 12){
      $alcohol = "medium";
   elsif($alcohol >= 12){
      $alcohol = "high";
   }
}

That is quite oversimplified, but you get the idea. In python, with pandas, it’s a whole lot easier and flexible:

wine_df['alcohol_category'] = pd.cut(wine_df['alcohol'], bins=[0., 10., 12., np.inf], labels=['low', 'medium', 'high'])
wine_df.sample(n=10)

bins=[] assigns the boundaries of the categories by passing a list of boundaries. The first value is the floor value of the first category, the second list item is the maximum value for that category. For the second category, the second item of the list becomes the minimum, and the third the max. The category names assigned are specified by labels= [].

pd.cut returns a panda series object that is then assigned to the panda column [‘alcohol_category’]

Note the last item in the bins=[] list is np.inf. This is the NumPy equivalent of infinity. There is also a NumPy negative infinity: np.ninf.

Leave a Reply

Your email address will not be published. Required fields are marked *