- do best each steps of machine learning systems
- real world
- if the real-world failed, check again all previous steps
- Always create a SINGLE NUMBER metric to quickly choose which machine learning algorithm is best.
- If the current metrics (i.e., precision, recall) cannot capture the aforementioned number, create a new number, or create a new optimization matrics + new satisfaction, for example, there are Presion, Recall, and Running time. The optimization matrix is [Precision, Recall] subject to the satisfaction that the running time is below number T. All the algorithms which have running time longer than T is discard completely, without reconsider.
- Always look at the purpose of machine learning tasks and aim for it. Changes things to make that one works, do not use the metric which is nice to see but cannot capture the goals correctly.
There are for my own memory:
- The data must be normalized before applying those
- The data must be i.i.d and must follow the Gaussian ditribution (normal distrition)
- L1 and L2 is designed to use when there are many multicorrelated variables, they become so big under the high order of regression.
- L1 is called L1 because of its form ||vector||: it is L1 form of vector. The same for L2.
- L1 will remove some features. Therefore, it will reduce the number of features. While L2 will make some features very close to zero only. L1 is computed intensively in related to L2.
- L1 and L2, again, to make the model less complex. It set the bounds for the coefficients that will never bigger that their own weights.
- Alpha or lamba parameter big will make more features lead to zero.
Thanks to HomeBrew, it is easy, just do like this:
brew install apache-spark
add those lines to the end of your ~/.bash_profile
Open the terminal, type
That’s all, folks. Jupyter/Ipython Notebook will automatically launch by your favorite browsers.
Instead of doing all command above, just simple run:
pip install pyspark
also do the magic!
In order to evaluate/compare the performance of classification algorithms, people tend to use precision and recall
Basically, here are the meanings:
- Precision: The bigger the better. It tries to measure how well the algorithm avoids false positive. (i.e., the number of false positive is big or not). Or, it is the percentages of true positive which are correctly measured. Or, ratio of correctly true positive items in the true positive set.
- Recall (i.e., sensitive): How well the algorithm tries avoiding the false negative. (the number of false negative is big or not). Or, ratio of true positive according to the real/actually true positive set (training/test set)
2. Deep Learning – A practitioner’s approach, Josh & Adam, O’reilly, 2017
Another easy to grab explanation is from Apple documentation:
Precision and recall are actually two metrics. But they are often used together. Precision answers the question: Out of the items that the classifier predicted to be true, how many are actually true? Whereas, recall answers the question: Out of all the items that are true, how many are found to be true by the classifier?
Following the official instruction from gitlab to install the docker image by this command
docker pull gitlab/gitlab-ce
However, the guides to run that docker is for Linux and does not work on Mac. In order to make it work on Mac, the following command must be use/or modidifed according to your usage
docker run –hostname gitlab.example.com –publish 443:443 –publish 80:80 –publish 2200:22 –name gitlab –restart always -v logs:/var/log/gitlab -v data:/var/opt/gitlab gitlab/gitlab-ce:latest
The first run, access Gitlab y http://localhost, create new password for root account. Then, the default login is root/<your new password>.
For me, this helps to visualize scikit-learn stuffs in a nice ways, for example, confusion matrix here http://rasbt.github.io/mlxtend/#examples
import numpy as np import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec import itertools from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from mlxtend.classifier import EnsembleVoteClassifier from mlxtend.data import iris_data from mlxtend.plotting import plot_decision_regions # Initializing Classifiers clf1 = LogisticRegression(random_state=0) clf2 = RandomForestClassifier(random_state=0) clf3 = SVC(random_state=0, probability=True) eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], weights=[2, 1, 1], voting='soft') # Loading some example data X, y = iris_data() X = X[:,[0, 2]] # Plotting Decision Regions gs = gridspec.GridSpec(2, 2) fig = plt.figure(figsize=(10, 8)) labels = ['Logistic Regression', 'Random Forest', 'RBF kernel SVM', 'Ensemble'] for clf, lab, grd in zip([clf1, clf2, clf3, eclf], labels, itertools.product([0, 1], repeat=2)): clf.fit(X, y) ax = plt.subplot(gs[grd, grd]) fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2) plt.title(lab) plt.show()
This is a free book and very good