Friday, 28 August 2015

Machine Learning with Windows Azure

This is a brief overview of Machine Learning usage on Azure cloud.

Login to your Azure account, select Machine Learning and follow these


  1. Create new Experiment
  2. Drag-n-drop datasets and modules and modify their properties
  3. Create a workflow. There are available pre-configured ML modules with R and Python scripts that:
    1. Following best practices may help to solve a problem for specific domain.
    2. Have pre-built data schemas.
    3. Contain domain specific data processing and feature engineering.
    4. Include training algorithms and calculation metrics.

Workflow Concerns:

  • A Single Experiment:
    • could be too complex
    • difficult to navigate
    • easy to mess up
    • iterations have to go through entire graph
  • Multi-Steps Workflow:
    • each step in a separate Experiment
    • uses saved datasets / readers / writers
    • iterations come through individual steps separately

Machine Learning Templates from Azure Machine Learning Gallery

  • Text classification (text tagging, text categorisation, etc). For example, assign piece of text to one or more pre-defined set of classes or categories:
    • categorise articles
    • organise web pages into hierarchical categories
    • filter email spam
    • search query user intend prediction
    • support ticket team routing
    • sentiment analysis
    • feedback analysis
  • Fraud detection
  • Retail forecasting
  • Predictive maintenance

Text Processing:

  1. Language detection
  2. Normalisation:
    1. Convert words into normalised ones:
      1. Down case: The -> the
      2. Lemmatisation: plays -> play
      3. Stemming: mice -> mouse
  3. Stop-words removal: the, a, to, with, etc
  4. Special character removal: for example, ignore all Non-Alpha-Numeric
  5. Ignore numbers, emails, URLs, etc

Feature Extraction:

  • 'Bags-of-words' - the document is treated as a set of words regardless order and grammar
    • Split text into Bi-grams, tri-grams, n-grams
    • Score how it correlates with your topic (e.g. sport, business, etc)
  • Move from symbols to numbers:
    • term occurrence (number of times words or n-grams appear in document)
    • term frequency
    • Inverse Document Frequency (IDF)
      • the inverted rate of documents that contain words or n-grams against the whole training data set of documents
      • downgrades the importance of useless words like 'which', 'the', etc
    • TF-IDF
      • frequent words that appear only in a small number of documents achieve high value
  • Dimensionality Reduction (vocabulary used in documents could be too large) - 'curse of dimensionality'. To do:
    • filter-based filter reduction
    • wrapper-based filter reduction
    • feature hashing
    • topic modelling (LDA)

Model Training:

  • Binary-class Learners:
    • 2-class Logistic Regression
    • 2-class Support Vector Machine
    • 2-class Boosted Decision Tree
  • Multi-class Learners:
    • One-vs-All Muliclass
    • Multi-class Logistic Regression
    • Multi-class Decision Forest

Model Evaluation:

  • Split data into Train, Development and Test subsets
  • Compare and visualise the results of comparison between two trained models, e.g. N-grams and Uni-grams


Deploy your model on Azure as a web service. 


1 comment:

  1. Thanks for your comment, Nasreen. It's good to know that this information was useful for you.


Online Encyclopedia of Statistical Science (Free)

Please, click on the chart below to go to the source: