banana quality tester: August 2015

Friday, 28 August 2015

Machine Learning with Windows Azure

This is a brief overview of Machine Learning usage on Azure cloud.

Login to your Azure account, select Machine Learning and follow these

Steps:

Create new Experiment
Drag-n-drop datasets and modules and modify their properties
Create a workflow. There are available pre-configured ML modules with R and Python scripts that:

Following best practices may help to solve a problem for specific domain.
Have pre-built data schemas.
Contain domain specific data processing and feature engineering.
Include training algorithms and calculation metrics.

Workflow Concerns:

A Single Experiment:

could be too complex
difficult to navigate
easy to mess up
iterations have to go through entire graph

Multi-Steps Workflow:

each step in a separate Experiment
uses saved datasets / readers / writers
iterations come through individual steps separately

Machine Learning Templates from Azure Machine Learning Gallery

Text classification (text tagging, text categorisation, etc). For example, assign piece of text to one or more pre-defined set of classes or categories:

categorise articles
organise web pages into hierarchical categories
filter email spam
search query user intend prediction
support ticket team routing
sentiment analysis
feedback analysis

Fraud detection
Retail forecasting
Predictive maintenance

Text Processing:

Language detection
Normalisation:

Convert words into normalised ones:

Down case: The -> the
Lemmatisation: plays -> play
Stemming: mice -> mouse

Stop-words removal: the, a, to, with, etc
Special character removal: for example, ignore all Non-Alpha-Numeric
Ignore numbers, emails, URLs, etc

Feature Extraction:

'Bags-of-words' - the document is treated as a set of words regardless order and grammar

Split text into Bi-grams, tri-grams, n-grams
Score how it correlates with your topic (e.g. sport, business, etc)

Move from symbols to numbers:

term occurrence (number of times words or n-grams appear in document)
term frequency
Inverse Document Frequency (IDF)

the inverted rate of documents that contain words or n-grams against the whole training data set of documents
downgrades the importance of useless words like 'which', 'the', etc

TF-IDF

frequent words that appear only in a small number of documents achieve high value

Dimensionality Reduction (vocabulary used in documents could be too large) - 'curse of dimensionality'. To do:

filter-based filter reduction
wrapper-based filter reduction
feature hashing
topic modelling (LDA)

Model Training:

Binary-class Learners:

2-class Logistic Regression
2-class Support Vector Machine
2-class Boosted Decision Tree

Multi-class Learners:

One-vs-All Muliclass
Multi-class Logistic Regression
Multi-class Decision Forest

Model Evaluation:

Split data into Train, Development and Test subsets
Compare and visualise the results of comparison between two trained models, e.g. N-grams and Uni-grams

Finally

Deploy your model on Azure as a web service.

Easy!

Wednesday, 26 August 2015

Content Enrichment

Content enrichment is about manipulating crawled content before it is added to the search index. For example, add a sentiment analysis score to indexed social activity.

Some components that support content enrichment are:

Version control
Technical metadata (formats, format versions, validation rules, etc)
Provenance data (processing history)

Questions:

How standardised is the enrichment information?
How volatile is enriched information?
When is the content enhanced (by author, during submission, during editorial, etc)?
Where does enhanced information live (embedded, externally)?

Key challenges:

What is the master source/copy of the information?
Is the information normalised or de-normalised (repeating parent metadata across child elements)?
How to synchronised across multiple systems?

banana quality tester