Practical Machine Learning, AI & Data Science Part B: Intermediate ML using R in SQL and ML Servers
Practical Machine Learning, AI & Data Science Part B: Intermediate ML using R in SQL and ML Servers
This was a 5 star course. Rafal is a world-class teacher who brings the right combination of practical, technical and theoretical experience to the course. I have a Masters in Analytics and have worked on an Analytics Project for 3 years and yet I still learnt so much from this course. Without a doubt the best course I have been on.
Intermediate Machine Learning in R on SQL Server and Microsoft ML Server – a 3 day training course with Rafal Lukawiecki
This live, instructor-led course fully up-to-date for 2019 and can be attended in the classroom or via Skype.
This course focuses on R and the technologies of Microsoft Machine Learning Server, Azure ML, SQL Server, and Azure SQL Database, whilst teaching you everything you need to know to start using machine learning, and to apply data science for analytics. It follows on from an Introduction to Machine Learning, AI & Data Science with Azure ML.. Together, the Introductory course and this Intermediate course make up the Practical Machine Learning course. Most of the course is also applicable to Python programmers, as the key ML Server libraries are the same.
Who is the Intermediate Machine Learning training course for?
You will be expected to have some understanding of machine learning. If you have attended a course on Machine Learning before, for example this one, or if you are versed in model validity, accuracy, and reliability, this course is suitable for you. If you’re not sure, ask yourself these questions: can I explain the difference between cross-validation and hold-out testing? Do I know which business metrics correspond to precision and which to recall? Is model accuracy more important than reliability, and how does a boosted decision tree work. If in doubt, please attend this course first, or simply take the full 5 days’ training.
What you will learn
You will learn all the concepts and tools that you need to know!Rafal Lukawiecki will teach you:
everything essential to starting data science, ML, and AI projects
all fundamental concepts
how to avoid common pitfalls
how to work fast yet accurately
what is really useful and practical
what is more theoretical but still important
what hype you should be wary of.
You will be able to ask any questions related to your industry and you will get relevant, pragmatic, no-nonsense answers, helping you get ahead with your own projects.
Rafal has been delivering ML, data mining, and data science projects for customers in retail, banking, entertainment, healthcare, manufacturing, education, and government sectors for more than 10 years, and has trained more than 800 data scientists worldwide. He’s a highly-respected presenter, capable of holding your attention. Above all, you’ll be learning from a machine learning practitioner.
The training comprises 50% lectures, 30% demos and 20% tutorials.
You are encouraged to follow the demos on your machine, and you will be challenged to find answers to 3 larger problems during the tutorials. While they are a hands-on part of the course, if you prefer not to practice, you are welcome to use that time for additional Q&A, or to analyse your own data. We will provide you with all the necessary data sets, and we will explain what free or evaluation edition software needs to be installed to follow the course on your own laptop.
We provide pre-built machines, but if you’d rather use your own laptop, please tell us in advance.
You will need an Azure account (even a free one) during the course. You can copy course experiments and data into your workspace for learning and for future reference after the course.
About Rafal Lukawiecki
As Data Scientist at Project Botticelli Ltd, Rafal focuses on making advanced analytics and artificial intelligence easy and useful for his clients.
He can help you find valuable, meaningful patterns and statistically valid correlations using data mining and machine learning, and he is also known for his work in business intelligence, data protection, enterprise architecture, and solution delivery.
Rafal has been a popular speaker at major IT conferences since 1998, and he has had the honour of sharing keynote platforms with Bill Gates and Neil Armstrong. A natural educator, he explains complex concepts in simple terms in his enjoyably energetic style.
This course is available as live instructor-led training in the classroom or join the live class by Skype.
Intermediate Machine Learning in R on SQL Server and Microsoft ML Server
Working with R
There is a large number of tools that you can use with R, and we begin the day focusing on the essential ones. You will also learn how to organise your workflow. Topics include:
Why is RStudio better than RTVS 2017
R Tools for Visual Studio 2017 (please note, there is no RTVS for VS 2019)
Microsoft Machine Learning Server vs SQL Machine Learning Services (Azure and Server)
Package dependency management
Snapshots using MRAN Time Machine
Projects, files, scripts, history, version control using git
Notebooks and RMarkdown
Data Preparation in R
R uses data frames, data tables, and tibbles, amongst others, while ML Server adds XDFs and the ability to work with data stored natively in Hadoop, Spark, and SQL Server. While most data preparation should be done as close to source, preferably using SQL, you will need to learn how to perform some transformations in R. Topics include:
Data frames, tables, tibbles
Reading files and ODBC data
XDFs and connecting to data in ML Server
Scaling data access using ML Server to overcome R/Python memory and parallelism limitations
Plots and Visualisations in R
One of the strengths of R is the ease of creating accurate (and good looking!) plots. As a bare minimum you need to understand how to use the most popular visualisation package, ggplot2, and some of the built-in base functions. Topics include:
Base boxplots, histograms, scatter plots
ggplot2: grammar of graphics
Combining visualisations into layers
Surfacing R graphics in Power BI and SQL Server
Plotting big data using ML Server
Clustering, Segmentation, Anomaly Detection
Segmentation is the main application of unsupervised learning using clustering algorithms. You will also learn how to apply this technique for anomaly (outlier) detection and data preprocessing. Topics include:
Introduction to segmentation
Clustering algorithms (k-means, EM, hierarchical, and others)
Working with k-means
Preparing data for clustering, incl. categorical, non-numeric data
Informal yet practical introduction to Principal Component Analysis (PCA)
Validating cluster goodness of fit using plots and metrics
Anomaly detection with clustering, PCA and SVMs
Without doubt, classifiers are the most important, and the most often used category of machine learning algorithms, and the foundation of algorithmic data science, and of most of today’s Artificial Intelligence. We will focus on several variants of the most important classification algorithm—decision tree—while progressively interpreting the results, and improving its performance. After introducing neural networks and logistic regression we will also compare the performance of all of these classifiers on our test dataset. Topics include:
Introduction to classifiers
Two-class (binary) vs multi-class
Decision trees, forests, and boosting
Implementing simple decision trees in plain R
Visualising plain decision trees
Decision Forests and Boosting in ML Server
Overfitting (overtraining) concerns
Pruning and Complexity Penalty (CP), regularisation weight and other hyperparameters
Minimum support and the size of the tree
Avoiding overfitting through hyperparameter tuning
Implementing parallelised logistic regression on big data using ML Server
Validation of classifiers will be your key concern, because classifiers are used so often, and because their accuracy is not easy to balance with business requirements, such as restricted resources, or a required level of business performance. Building on your understanding of model validity (introduced in Part 1 of this course), you will learn how to balance an acceptable number of false positives with false negatives by using classification (confusion) matrices, metrics of precision and recall, by plotting ROC (Receiver Operating Characteristic) curves, and by measuring their business impact using profit and cost charts. Attendees have commented in the past that this is the most important module of the entire course. Topics include:
Charting precision-recall and sensitivity-specificity
Balancing precision-recall with business goals and constraints
ROC curves and lift charts in detail
Other measures of accuracy, including AUC, and F1 scores
Class imbalance problem (fraud analytics and rare event prediction)
What exactly does cross-validation tell us?
Measuring quality of cross-validation
Optimising binary classifier prediction probability thresholds for a given business target
Refining models to improve accuracy and reliability
Refining Complexity Penalty through cross-validation using caret package
Considered by some as the numerical equivalent of classifiers, regression is a large subject of its own. We will introduce its simple but a very popular form, linear regression, followed by the Generalised Linear Model and other forms of regression, and finally, the more precise, but also prone-to-overfitting, decision tree variants. Topics include:
Introduction to simple regressions in R
Linear regression (classic)
Generalised Linear Models (GLM)
Dealing with non-normal data (Gamma distribution)
Ordinal and multinomial regressions
Advice on working with (star) ratings and Likert scales
Regression decision trees and other ensemble regression algorithms
Regression as a building block of other algorithms
Unlike classifiers, regressions are easier to asses. You will learn about basic tests of classical linear regressions that are easy to perform in R, and about measuring quality of machine learning, non-linear regressions. Topics include:
Measuring linear regression quality
Homoscedasticity, multicollinearity and other concerns
Common diagnostic plots
Making prettier regression validation scatterplots in ggplot2
Measuring machine learning regression quality
R-squared (Coefficient of Determination), RMSE, MAE, RAE, RSE
Deployment to Production
If you plan on using your models for prediction, rather than just for the exploration of data, or if you want to embed them as Artificial Intelligence in your applications, you need to deploy your models to production and maintain them on an on-going basis. Since we focus on the Microsoft ML Server and SQL ML Services (both Azure Database and Server), you will learn about the powerful and fast PREDICT T-SQL statement, and other supported mechanisms for deploying your models. We will also discuss how to deploy models as a web service, using these, and other Microsoft and non-Microsoft techniques. Topics include:
What needs to be deployed, and when?
PREDICT T-SQL statement
Model storage, management and serialisation concerns
Deploying web services uses mrsdeploy and operationalisation server clusters
Consuming web services API from R
Consuming web services using Swagger and REST
On-going maintenance and model updates
Relationship to Azure ML
Please note: we reserve the right to amend the order of the modules to best suit the dynamic character of the class and to answer questions as they arise. Some subjects will only be covered if time allows, but your satisfaction is guaranteed.