Login to your Azure account, select Machine Learning and follow these
Steps:
- Create new Experiment
- Drag-n-drop datasets and modules and modify their properties
- Create a workflow. There are available pre-configured ML modules with R and Python scripts that:
- Following best practices may help to solve a problem for specific domain.
- Have pre-built data schemas.
- Contain domain specific data processing and feature engineering.
- Include training algorithms and calculation metrics.
Workflow Concerns:
- A Single Experiment:
- could be too complex
- difficult to navigate
- easy to mess up
- iterations have to go through entire graph
- Multi-Steps Workflow:
- each step in a separate Experiment
- uses saved datasets / readers / writers
- iterations come through individual steps separately
Machine Learning Templates from Azure Machine Learning Gallery
- Text classification (text tagging, text categorisation, etc). For example, assign piece of text to one or more pre-defined set of classes or categories:
- categorise articles
- organise web pages into hierarchical categories
- filter email spam
- search query user intend prediction
- support ticket team routing
- sentiment analysis
- feedback analysis
- Fraud detection
- Retail forecasting
- Predictive maintenance
Text Processing:
- Language detection
- Normalisation:
- Convert words into normalised ones:
- Down case: The -> the
- Lemmatisation: plays -> play
- Stemming: mice -> mouse
- Stop-words removal: the, a, to, with, etc
- Special character removal: for example, ignore all Non-Alpha-Numeric
- Ignore numbers, emails, URLs, etc
Feature Extraction:
- 'Bags-of-words' - the document is treated as a set of words regardless order and grammar
- Split text into Bi-grams, tri-grams, n-grams
- Score how it correlates with your topic (e.g. sport, business, etc)
- Move from symbols to numbers:
- term occurrence (number of times words or n-grams appear in document)
- term frequency
- Inverse Document Frequency (IDF)
- the inverted rate of documents that contain words or n-grams against the whole training data set of documents
- downgrades the importance of useless words like 'which', 'the', etc
- TF-IDF
- frequent words that appear only in a small number of documents achieve high value
- Dimensionality Reduction (vocabulary used in documents could be too large) - 'curse of dimensionality'. To do:
- filter-based filter reduction
- wrapper-based filter reduction
- feature hashing
- topic modelling (LDA)
Model Training:
- Binary-class Learners:
- 2-class Logistic Regression
- 2-class Support Vector Machine
- 2-class Boosted Decision Tree
- Multi-class Learners:
- One-vs-All Muliclass
- Multi-class Logistic Regression
- Multi-class Decision Forest
Model Evaluation:
- Split data into Train, Development and Test subsets
- Compare and visualise the results of comparison between two trained models, e.g. N-grams and Uni-grams
Finally
Deploy your model on Azure as a web service.
Easy!
:)