# Some very simple classification using scikit-learn
We are going to predict from transactional data whether customers are loyal. We will do this by training a classifier that will predict whether a given customer will return in 2015.

## Load relevant modules

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

## Data Preparation
Get the data from http://liacs.leidenuniv.nl/~takesfw/DSPM/data/sales.csv and read it with Pandas. The argument `names` provide the column headers.

In [2]:
df = pd.read_csv(
    'http://liacs.leidenuniv.nl/~takesfw/DSPM/data/sales.csv', 
    names=[
        'saleId', 'saleDateTime', 'accountName', 'coins', 'currency', 
        'priceInCurrency', 'priceInEUR', 'methodId', 'ip', 'ipCountry'
    ],
    parse_dates=['saleDateTime']
)
df

Unnamed: 0,saleId,saleDateTime,accountName,coins,currency,priceInCurrency,priceInEUR,methodId,ip,ipCountry
0,22497,2010-01-01 08:00:31,214001065a105d24da76ad30f,14000,EUR,10.00,10.000,40,21.213.175.133,CY
1,22499,2010-01-01 11:53:24,1f9d7ef7422a94de52ada4870,35000,EUR,20.00,20.000,40,151.36.162.144,CY
2,22500,2010-01-01 11:56:50,d459930438f5610a5d915a767,14000,EUR,10.00,10.000,40,151.36.162.144,CY
3,22506,2010-01-01 13:31:41,d35bf11e9d005f64d8027521b,28000,EUR,20.00,20.000,40,155.196.146.232,GR
4,22507,2010-01-01 13:33:54,d35bf11e9d005f64d8027521b,6000,EUR,5.00,5.000,40,155.196.146.232,GR
...,...,...,...,...,...,...,...,...,...,...
747106,2559412,2015-05-14 03:40:51,b158d12fbe85dc2bc50bad52c,2100,EUR,2.00,2.000,2000,144.76.183.73,FR
747107,2559413,2015-05-14 03:41:30,b7f401f488a08b823bd7d9c6c,4000,EUR,4.00,4.000,2000,154.95.162.120,GF
747108,2559425,2015-05-14 19:01:32,7ba22509dff6afa562d5e73b4,1100,PLN,4.99,1.228,2000,1.50.212.16,PL
747109,2559427,2015-05-14 23:11:26,0b7ce8a6f37f0a19726469ce8,4000,EUR,4.00,4.000,2000,218.131.46.249,RE


## Feature construction
Slice the data such that only transactions before 2015 are included. We want to predict whether customers return in 2015 and hence this information may not be included in the features.


In [3]:
learning_set = df[df['saleDateTime'].dt.year < 2015]

Create some very basic features:
- number of transactions
- mean priceInCurrency per transactions
- maximum number of coints bought in one transaction

Each row in the following table is one customer (result is similiar to SQL's `GROUP BY` function), each column one feature.

In [4]:
features = learning_set.groupby(['accountName']).agg({'coins': 'max', 'priceInCurrency': 'mean', 'saleId': 'nunique'})
features

Unnamed: 0_level_0,coins,priceInCurrency,saleId
accountName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
000113b526fbc8d19f03fccb9,6100,4.666667,3
00027408fd5b8ff231814dbf2,31000,3.598167,120
0002c9dd969f5e5831d492524,15000,8.000000,5
0004dd02811925311678f7c85,6100,4.338000,5
00066a451be9a6bc5fda16fd9,2100,5.617500,8
...,...,...,...
fffaa6a39be1ab2f86fd2571f,31000,20.000000,1
fffb2efd72b3318b86173d294,7200,8.000000,1
fffcd5cce92731953647fe62f,17000,4.454262,122
fffd3e63c839efd5435be9619,14000,15.000000,2


The following array presents all customers that had a transaction in 2015. This are the customers that are loyal, hence we want the classification model to target these customers.

In [5]:
target = df[df['saleDateTime'].dt.year == 2015]['accountName'].values

In [6]:
target

array(['d99b7d0aa59c953e20cf3b5d7', '85d9e4def318173c70568f269',
       'bffbcb55d031774bbcbeca576', ..., '7ba22509dff6afa562d5e73b4',
       '0b7ce8a6f37f0a19726469ce8', 'cc7267f27c95acc46be8b8a8b'],
      dtype=object)

In [7]:
features['target'] = features.index.isin(target)

Count the numbers that did or did not return in 2015. Note the class imbalance. A model predicting always `False` will have an accuracy over 90%.

In [8]:
features['target'].value_counts()

False    62895
True      5616
Name: target, dtype: int64

In [9]:
train, test = train_test_split(features)

## Clasification

In [10]:
classifier = RandomForestClassifier()

In [11]:
classifier.fit(X=train.drop(columns='target'), y=train['target'])

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [12]:
y_pred = classifier.predict(X=test.drop(columns='target'))

We've trained a model with the following characteristics:

|    | Actual Negatives | Actual Positives |
|:---|:---:|:---:|
| Predicted Negatives | 15373 | 314 |
| Predicted Positives | 1313 | 128 |

In [13]:
confusion_matrix(y_true=test['target'], y_pred=y_pred)

array([[15428,   350],
       [ 1220,   130]], dtype=int64)