Machine Learning Basics Webinar  Prateek Narang
Machine Learning Webinar for Beginners
Webinar Contents
 Intro to Machine Learning
 What is Machine Learning/ Machine Intelligence ?
 Few interesting Applications of Machine Learning
 Supervised,Unsupervised Learning and Reinforcement Learning
 Hands on Session
 Some Python Basics
 Working with Numpy & Pandas
 Steps involved in Machine Learning
 Building your first Machine Learning Algorithm
 KNearest Neighbours
 Working with Datasets(MNIST Dataset)  Handwritten Digit Recognition
 LIVE Project
Machine Intelligence
Machine learning is a subfield of artificial intelligence (AI) concerned with algorithms that allow computers to learn. What this means, in most cases, is that an algorithm is given a set of data and infers information about the properties of the data—and that information allows it to make predictions about other data that it might see in the future. In simple terms, it gives preditive power to computers !
Machine Learning vs Artificial Intelligence
 Aritificial Intelligence is a system which interacts with surroundings
 AI systems have sensors to collect data from surroundings
 Machine Learning can be considered as the “brain of AI” which process the input data
 ML algorithm frames an appropriate answer, and which is sent back back to the surroundings.
Example of SelfDriving Car

Input data  Set of images caputured by the sensor

Processing by Machine Learning Algorithm  A model trained on images processes it, looks for any obstacles

Output is a required ‘float’ value as the output, giving the required acceleration of the car.
Why the Hype ?

Every minute up to 300 hours of video are uploaded to YouTube.
 Average of 31.25 million messages and view 2.77 million videos every minute on Facebook.
 More data has been created in the past two years than in the entire previous history of the human race.
 At the moment less than 0.5% of all data is ever analyzed and used, just imagine the potential here
Machine Learning In Industry
Google Page Ranking
Google  Natural Language Search Queries
Netflix Suggestions.
Tinder, for you to “chill”
Tesla Self Driving Cars
Political Campaigns (Sentiments of People)
Spam Filtering
Google AdSense ( Ads based upon your history)
Bioinformatics ( Predicting Cancers, IBM Watson )
Apple Siri  Speech Recognition and Talking
Chatbots like Tay, Ruuh
Machine Learning at Home
Google “Allo”
Snapchat Filters
Google Home and Amazon “Alexa”
Facebook Photo Tagging
PRISMA
Recommendations on Amazon, Flipkart
and many more…
Different Machine Learning Approaches
Supervised Learning
 Algorithms to get a set of Labeled Data called Training Data
 Predictions are made on a set of Unlabled Data called Testing Data

Example  Spam Filtering in emails, Obstacle Detection in Images, Classifying Fruits
 Algorithm is trained using a model, which is based on various parameters called features.
Color(X)  Sweetness(Y)  Label 

0.80  0.90  Apple 
0.80  0.84  Apple 
0.10  0.27  Lemon 
0.30  0.47  Lemon 
0.83  0.83  Apple 
0.60  0.97  Apple 
Unsupervised Learning
 Algorithm don’t get set of labeled data
 Algorithm automatically extracts hidden patterns from the data.
 Mostly used to classify data into various sets, similar data is clubbed into a single set called a cluster.
 Example: Clustering Algorithms to classify a set of related data into a single cluster
 Color(X)  Sweetness(Y)   ———— ————:  0.80  0.90   0.80  0.84   0.10  0.27   0.30  0.47   0.83  0.83   0.60  0.97 
Reinforcement Learning
 It has a feedback element to improve its performance
 Based upon the idea of “reward”, algorithm will move in a direction to achieve maximum reward.
 Good Application : Teaching Machine to play games like TicTacToe,Chess etc
 Algorithms uses moves tried in the past which led to successful results
Popular Techniques
 KMeans
 KNearest Neighbours
 Regression
 Decision Trees
 Naive Bayes
 Neural Networks
 Support Vector Machines
 Neural Nets
 Deep Learning
 Support Vector Machines(SVM)
Open Source Packages
 Scikit  Learn
 TensorFlow
 Pytorch
Developer Checklist
 Basic Python 2.7+ ,Pip, JupyterNotebook installed
 Numpy  Mathematic Operations
 Pandas  Working with CSV’s(Excel Sheets), Data reading and writing
 Matplotlib  Plotting Graphs
Python Basics
 Lists
 Dictionaries
 Sorting
 Lambda Function
 Range Function
# Working with Lists
a = [ 1,2,4,5,6,"Hello"]
print a
# Slicing in Lists
print a[1:2]
print a[2:]
print a[:4]
print a[:]
# Dictionaries in Python ( Hashmaps in C++, JSON in javascript)
prices = {
"mango":100,
"apple":120,
"banana":[10,20,30]
}
print type(prices)
print prices["mango"]
#Iterate over all the keys
print prices.keys()
print prices.values()
# Loops
i = 1
while i<=10:
print i
i += 1
# Range Fn (s,e,jump)
range(1,10,2)
x = "Python"
if x=="Python":
print "yes"
else:
print "no"
[1, 2, 4, 5, 6, 'Hello']
[2]
[4, 5, 6, 'Hello']
[1, 2, 4, 5]
[1, 2, 4, 5, 6, 'Hello']
<type 'dict'>
100
['mango', 'apple', 'banana']
[100, 120, [10, 20, 30]]
1
2
3
4
5
6
7
8
9
10
yes
# Sorting in Lists
a = [5,4,3,1,2]
a = sorted(a,reverse=True)
print a
[5, 4, 3, 2, 1]
Math in Python  Packages & Imports
import math
print math.log10(100)
print math.sqrt(2)
2.0
1.41421356237
from math import sqrt as sq
sq(100)
10.0
Scientific Computation
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
# Numpy which is library acutally written in C and Python Interface
# Arrays in Numpy
arr = np.array([1,2,3,4])
# it is of fixed size, it is not dynanimc
print type(arr)
<type 'numpy.ndarray'>
# 2D arrays in Numpy
a = np.zeros((4,4))
print a
a[ : ,0] = 2
a[1, :] = 3
print a
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
[[ 2. 0. 0. 0.]
[ 3. 3. 3. 3.]
[ 2. 0. 0. 0.]
[ 2. 0. 0. 0.]]
# Unique and Argmax Functions
arr = np.asarray([1,2,3,5,3,7,4,2,1,7,7])
b = np.unique(arr,return_counts=True)
print b
index = b[1].argmax()
print b[0][index]
(array([1, 2, 3, 4, 5, 7]), array([2, 2, 2, 1, 1, 3]))
7
Plotting Graphs using Matplot Lib
from jupyterthemes import jtplot
jtplot.style()
a = np.asarray(range(100))
plt.figure(0)
plt.plot(a)
plt.figure(1)
plt.plot(a**2,color='green')
plt.plot(a**3,color='red')
plt.show()
# Scatter Plots
# Random Values in the Range 0 and 1
arr = np.random.random((100,2))
print arr.shape
print arr
plt.figure(0)
# First Paramenter ix X Coordinate, Second Paratmeter is Y
plt.scatter(arr[:,0],arr[:,1],color='yellow')
plt.show
# We are using the Scatter Plot
(100, 2)
[[ 0.00291436 0.01448286]
[ 0.51031075 0.65767026]
[ 0.97074722 0.83606565]
[ 0.86743705 0.01732897]
[ 0.88593955 0.8020919 ]
[ 0.28751527 0.7223667 ]
[ 0.02263074 0.3382036 ]
[ 0.69338037 0.22768306]
[ 0.77667507 0.82879251]
[ 0.61327601 0.28087191]
[ 0.31060801 0.69091621]
[ 0.28837317 0.24580994]
[ 0.06039161 0.02097023]
[ 0.7737081 0.0862868 ]
[ 0.21237252 0.13823183]
[ 0.48561855 0.57743034]
[ 0.86938209 0.97227449]
[ 0.06277177 0.63193716]
[ 0.34021614 0.35706364]
[ 0.39370643 0.5804014 ]
[ 0.41925511 0.04778853]
[ 0.50533611 0.32895564]
[ 0.80257028 0.10471664]
[ 0.71289916 0.89630801]
[ 0.45396971 0.51404844]
[ 0.20334233 0.99497241]
[ 0.99839354 0.21437453]
[ 0.55529647 0.22472561]
[ 0.9728573 0.60438948]
[ 0.32445404 0.33398996]
[ 0.55213098 0.31026391]
[ 0.35979964 0.44872302]
[ 0.05225356 0.58651425]
[ 0.6458056 0.60890034]
[ 0.79562724 0.03879441]
[ 0.6959736 0.38233052]
[ 0.7526301 0.21503284]
[ 0.34190764 0.31116281]
[ 0.75554644 0.46958625]
[ 0.93776575 0.62445155]
[ 0.51429204 0.64805179]
[ 0.37936658 0.26266611]
[ 0.08913006 0.61540506]
[ 0.58724715 0.06253665]
[ 0.85576013 0.16285239]
[ 0.83506768 0.33033296]
[ 0.24312511 0.71298804]
[ 0.41878801 0.06937431]
[ 0.07202771 0.48041518]
[ 0.14903979 0.60142633]
[ 0.97025292 0.94248153]
[ 0.93120412 0.43516918]
[ 0.57869014 0.66465101]
[ 0.23430332 0.26433057]
[ 0.07642415 0.43258007]
[ 0.01275717 0.04839758]
[ 0.74708051 0.01431588]
[ 0.76829324 0.0143131 ]
[ 0.07333657 0.15874178]
[ 0.29564716 0.46516782]
[ 0.12790292 0.05567735]
[ 0.7126499 0.61885085]
[ 0.25167061 0.86270017]
[ 0.99573942 0.79714859]
[ 0.01198528 0.23383981]
[ 0.44065774 0.49272317]
[ 0.15193365 0.77113256]
[ 0.78719606 0.86875411]
[ 0.23941396 0.25477639]
[ 0.38347683 0.47451819]
[ 0.67296508 0.71186838]
[ 0.86829012 0.52463931]
[ 0.70273571 0.55131349]
[ 0.83767118 0.56091251]
[ 0.79026539 0.5755953 ]
[ 0.80825375 0.24356713]
[ 0.42424922 0.72464852]
[ 0.58704666 0.2260189 ]
[ 0.59772621 0.54736828]
[ 0.33457846 0.82713099]
[ 0.98668045 0.81246355]
[ 0.23846308 0.22472451]
[ 0.00846766 0.50492411]
[ 0.17579599 0.61034498]
[ 0.63107979 0.63797212]
[ 0.77107512 0.30251411]
[ 0.87480609 0.84771177]
[ 0.97447 0.33918272]
[ 0.48713053 0.86355211]
[ 0.59996644 0.96567419]
[ 0.16768305 0.27008325]
[ 0.00260076 0.15940528]
[ 0.61041731 0.53609626]
[ 0.54895212 0.65345545]
[ 0.43086309 0.60843941]
[ 0.93793111 0.43365899]
[ 0.72604315 0.16615365]
[ 0.06820395 0.44045418]
[ 0.50274979 0.51384979]
[ 0.96437541 0.14919605]]
<function matplotlib.pyplot.show>
Probability Distribution
Random Variable
Random variable is a variable whose possible values are numerical outcomes of a random experiment.
For eg  1) A random variable could denote number of characters in all the books in the world 2) Length of movie names in all the movies released so far 3) Outcomes of dice throw experiment
Mean and Expectation
u = E(X)
u is the mean
E(X) is the expected value of X.
Normal/Gaussian Distribution
Normal distributions are important in statistics and are often used in the natural and social sciences to represent realvalued random variables whose distributions are not known.
Standard Normal Distribution
Multivariate Normal Distribution
Example in 2 dimensions
KNN Algorithm
 The Knearestneighbor (KNN) algorithm measures the distance between a query scenario and a set of scenarios in the data set.
 KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled dataset consisting of training observations (x,y) and would like to capture the relationship between x and y.
 This method used for classification and regression.
Training Data
Color(X)  Sweetness(Y)  Label 

0.80  0.83  Apple 
0.80  0.85  Apple 
0.10  0.27  Lemon 
0.30  0.47  Lemon 
0.83  0.87  Apple 
0.60  0.97  Apple 
Test Data
Color(X)  Sweetness(Y)  Actual Label  Predicted Label 

0.91  0.75  Apple  Apple 
0.11  0.25  Lemon  Lemon 
Code
mean_01 = np.array([3.0,4.0])
#Lemons are sour, avg sweetness will low, they have some low value for color
# Red values is higher, Yellow Lower
# Sweetness is higher, Sourness Lower
mean_01 = np.array([3.0,4.0])
#2 X 2 identity matrix
cov_01 = np.array([[1.0,0.5],[0.5,1.0]])
mean_02 = np.array([0.0,0.0])
cov_02 = np.array([[1.0,.5],[0.5,0.6]])
dist_01 = np.random.multivariate_normal(mean_01,cov_01,200)
dist_02 = np.random.multivariate_normal(mean_02,cov_02,200)
print dist_01.shape
print dist_02.shape
# print dist_01
(200, 2)
(200, 2)
# Try to make a scatter plot of these points
plt.figure(0)
for x in range(dist_01.shape[0]):
plt.scatter(dist_01[x,0],dist_01[x,1],color='red')
plt.scatter(dist_02[x,0],dist_02[x,1],color='yellow')
plt.show()
# Training Data Preparation
# 400 Samples  200 Apples, 200 for Lemons
labels = np.zeros((400,1))
labels[200:] = 1.0
X_data = np.zeros((400,2))
X_data[:200,:] = dist_01
X_data[200: ,:] = dist_02
# print X_data
# print labels
KNN Algorithm :)
#Dist of the query_point to all other points in the space ( O(N)) time for every point + sorting
# You can the complexity O(Q.N)
#Euclidean Distance
def dist(x1,x2):
return np.sqrt(((x1x2)**2).sum())
x1 = np.array([0.0,0.0])
x2 = np.array([1.0,1.0])
print dist(x1,x2)
1.41421356237
def knn(X_train,query_point,y_train,k=5):
vals = []
for ix in range(X_train.shape[0]):
v = [ dist(query_point,X_train[ix,:]), y_train[ix]]
vals.append(v)
# vals is a list containing distances and their labels
updated_vals = sorted(vals)
# Lets us pick up top K values
pred_arr = np.asarray(updated_vals[:k])
pred_arr = np.unique(pred_arr[:,1],return_counts = True)
#Largest Occurence
index = pred_arr[1].argmax() #Index of largest freq
return pred_arr[0][index]
q = np.array([0.0,4.0])
predicted_label = knn(X_data,q,labels)
print predicted_label
## Run a Loop over a testing data(Split the original data into 2 sets  Training, Testing)
# Find predictions for Q Query points
# If predicted outcome = actual outcome > Sucess else Failure
# Accuracy = (Successes)/ (Total no of testing points) * 100
Project Work  Handwritten Digit Recognition on MNIST Dataset(using KNN)
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
ds = pd.read_csv('./train.csv')
print ds.shape
data = ds.values
print data.shape
(42000, 785)
(42000, 785)
y_train = data[:, 0]
X_train = data[:, 1:]
# X_train = (X_train  X_train.mean(axis=0))/(X_train.std(axis=0) + 1e03)
print y_train.shape, X_train.shape
plt.figure(0)
idx = 104
print y_train[idx]
plt.imshow(X_train[idx].reshape((28, 28)), cmap='gray')
plt.show()
(42000,) (42000, 784)
2
def dist(x1, x2):
return np.sqrt(((x1  x2)**2).sum())
def knn(X_train, x, y_train, k=5):
vals = []
for ix in range(X_train.shape[0]):
v = [dist(x, X_train[ix, :]), y_train[ix]]
vals.append(v)
updated_vals = sorted(vals, key=lambda x: x[0])
pred_arr = np.asarray(updated_vals[:k])
pred_arr = np.unique(pred_arr[:, 1], return_counts=True)
pred = pred_arr[1].argmax()
# return pred_arr[0][pred]
return pred_arr, pred_arr[0][pred]
idq = int(np.random.random() * X_train.shape[0])
q = X_train[idq]
res = knn(X_train[:10000], q, y_train[:10000], k=7)
print res
print y_train[idq]
plt.figure(0)
plt.imshow(q.reshape((28, 28)), cmap='gray')
plt.show()
((array([ 3.]), array([7])), 3.0)
3
Subscribe us on Youtube for more such tutorials.
Download Project
Data files and complete code can be downloaded from Github