Friday, 28 August 2015

Machine Learning with Windows Azure

This is a brief overview of Machine Learning usage on Azure cloud.

Login to your Azure account, select Machine Learning and follow these

Steps:

  1. Create new Experiment
  2. Drag-n-drop datasets and modules and modify their properties
  3. Create a workflow. There are available pre-configured ML modules with R and Python scripts that:
    1. Following best practices may help to solve a problem for specific domain.
    2. Have pre-built data schemas.
    3. Contain domain specific data processing and feature engineering.
    4. Include training algorithms and calculation metrics.

Workflow Concerns:

  • A Single Experiment:
    • could be too complex
    • difficult to navigate
    • easy to mess up
    • iterations have to go through entire graph
  • Multi-Steps Workflow:
    • each step in a separate Experiment
    • uses saved datasets / readers / writers
    • iterations come through individual steps separately

Machine Learning Templates from Azure Machine Learning Gallery

  • Text classification (text tagging, text categorisation, etc). For example, assign piece of text to one or more pre-defined set of classes or categories:
    • categorise articles
    • organise web pages into hierarchical categories
    • filter email spam
    • search query user intend prediction
    • support ticket team routing
    • sentiment analysis
    • feedback analysis
  • Fraud detection
  • Retail forecasting
  • Predictive maintenance

Text Processing:

  1. Language detection
  2. Normalisation:
    1. Convert words into normalised ones:
      1. Down case: The -> the
      2. Lemmatisation: plays -> play
      3. Stemming: mice -> mouse
  3. Stop-words removal: the, a, to, with, etc
  4. Special character removal: for example, ignore all Non-Alpha-Numeric
  5. Ignore numbers, emails, URLs, etc

Feature Extraction:

  • 'Bags-of-words' - the document is treated as a set of words regardless order and grammar
    • Split text into Bi-grams, tri-grams, n-grams
    • Score how it correlates with your topic (e.g. sport, business, etc)
  • Move from symbols to numbers:
    • term occurrence (number of times words or n-grams appear in document)
    • term frequency
    • Inverse Document Frequency (IDF)
      • the inverted rate of documents that contain words or n-grams against the whole training data set of documents
      • downgrades the importance of useless words like 'which', 'the', etc
    • TF-IDF
      • frequent words that appear only in a small number of documents achieve high value
  • Dimensionality Reduction (vocabulary used in documents could be too large) - 'curse of dimensionality'. To do:
    • filter-based filter reduction
    • wrapper-based filter reduction
    • feature hashing
    • topic modelling (LDA)

Model Training:

  • Binary-class Learners:
    • 2-class Logistic Regression
    • 2-class Support Vector Machine
    • 2-class Boosted Decision Tree
  • Multi-class Learners:
    • One-vs-All Muliclass
    • Multi-class Logistic Regression
    • Multi-class Decision Forest

Model Evaluation:

  • Split data into Train, Development and Test subsets
  • Compare and visualise the results of comparison between two trained models, e.g. N-grams and Uni-grams

Finally

Deploy your model on Azure as a web service. 

Easy!
:)


Wednesday, 26 August 2015

Content Enrichment

Content enrichment is about manipulating crawled content before it is added to the search index. For example, add a sentiment analysis score to indexed social activity.

Some components that support content enrichment are:

  • Version control
  • Technical metadata (formats, format versions, validation rules, etc)
  • Provenance data (processing history)

Questions:


  • How standardised is the enrichment information?
  • How volatile is enriched information?
  • When is the content enhanced (by author, during submission, during editorial, etc)?
  • Where does enhanced information live (embedded, externally)?

Key challenges:


  • What is the master source/copy of the information?
  • Is the information normalised or de-normalised (repeating parent metadata across child elements)?
  • How to synchronised across multiple systems?




Online Encyclopedia of Statistical Science (Free)

Please, click on the chart below to go to the source: