Some notes from machine-learning-projects of coursera

  1. do best each steps of machine learning systems
    1. train
    2. dev/test
    3. real world
  2. if the real-world failed, check again all previous steps
  3. Always create a SINGLE NUMBER metric to quickly choose which machine learning algorithm is best.
  4. If the current metrics (i.e., precision, recall) cannot capture the aforementioned number, create a new number, or create a new optimization matrics + new satisfaction, for example, there are Presion, Recall, and Running time. The optimization matrix is [Precision, Recall] subject to the satisfaction that the running time is below number T. All the algorithms which have running time longer than T is discard completely, without reconsider.
  5. Always look at the purpose of machine learning tasks and aim for it. Changes things to make that one works, do not use the metric which is nice to see but cannot capture the goals correctly.
Advertisements

About linear regression, regularization L1 and L2 (Lasso and Ridge)

There are for my own memory:

  1. The data must be normalized before applying those
  2. The data must be i.i.d and must follow the Gaussian ditribution (normal distrition)
  3. L1 and L2 is designed to use when there are many multicorrelated variables, they become so big under the high order of regression.
  4. L1 is called L1 because of its form ||vector||: it is L1 form of vector. The same for L2.
  5. L1 will remove some features. Therefore, it will reduce the number of features. While L2 will make some features very close to zero only. L1 is computed intensively in related to L2.
  6. L1 and L2, again, to make the model less complex. It set the bounds for the coefficients that will never bigger that their own weights.
  7. Alpha or lamba parameter big will make more features lead to zero.

 

quick set-up pyspark (apache spark) in 1 minute under Mac

Thanks to HomeBrew, it is easy, just do like this:

brew install apache-spark

add those lines to the end of your ~/.bash_profile

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=’notebook’

 

Open the terminal, type

pyspark

That’s all, folks. Jupyter/Ipython Notebook will automatically launch by your favorite browsers.

Update

Instead of doing all command above, just simple run:

pip install pyspark

also do the magic!

true meaning of precision and recall

In order to evaluate/compare the performance of classification algorithms, people  tend to use precision and recall

 

Basically, here are the meanings:

  1. Precision: The bigger the better. It tries to measure how well the algorithm avoids false positive. (i.e., the number of false positive is big or not). Or, it is the percentages of true positive which are correctly measured. Or, ratio of correctly true positive items in the true positive set.
  2. Recall (i.e., sensitive): How well the algorithm tries avoiding the false negative. (the number of false negative is big or not). Or, ratio of true positive according to the real/actually true positive set (training/test set)

Some preferences

1.Google Course

2. Deep Learning – A practitioner’s approach, Josh & Adam, O’reilly, 2017

 

Update:

Another easy to grab explanation is from Apple documentation:

Precision and recall are actually two metrics. But they are often used together. Precision answers the question: Out of the items that the classifier predicted to be true, how many are actually true? Whereas, recall answers the question: Out of all the items that are true, how many are found to be true by the classifier?

 

Install Gitlab on Mac using Docker

Following the official instruction from gitlab to install the docker image by this command

docker pull gitlab/gitlab-ce

However, the guides to run that docker is for Linux and does not work on Mac.  In order to make it work on Mac, the following command must be use/or modidifed according to your usage

docker run  –hostname gitlab.example.com –publish 443:443 –publish 80:80 –publish 2200:22 –name gitlab –restart always  -v logs:/var/log/gitlab -v data:/var/opt/gitlab gitlab/gitlab-ce:latest

The first run, access Gitlab y http://localhost, create new password for root account. Then, the default login is root/<your new password>.

Good extension for scikitlearn

For me, this helps to visualize scikit-learn stuffs in a nice ways, for example, confusion matrix here http://rasbt.github.io/mlxtend/#examples

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.data import iris_data
from mlxtend.plotting import plot_decision_regions

# Initializing Classifiers
clf1 = LogisticRegression(random_state=0)
clf2 = RandomForestClassifier(random_state=0)
clf3 = SVC(random_state=0, probability=True)
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3],
                              weights=[2, 1, 1], voting='soft')

# Loading some example data
X, y = iris_data()
X = X[:,[0, 2]]

# Plotting Decision Regions

gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10, 8))

labels = ['Logistic Regression',
          'Random Forest',
          'RBF kernel SVM',
          'Ensemble']

for clf, lab, grd in zip([clf1, clf2, clf3, eclf],
                         labels,
                         itertools.product([0, 1],
                         repeat=2)):
    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y,
                                clf=clf, legend=2)
    plt.title(lab)

plt.show()