See PDF Version for the schedule.
MultiStrategy Ensemble Learning, Ensembles of Bayesian Classifiers, and the Problem of False Discoveries
Abstract
This talk covers an ensemble of my research contributions that I
believe are likely to resonate with a current audience.
Ensemble Learning combines the predictions of multiple classifiers to enhance accuracy relative to any individual classifier. I will show that combining established ensemble learning techniques further enhances accuracy without computational overhead.
Naive Bayes is a popular approach to classification learning due to its computational efficiency, strong theoretical foundation and its capacity to predict probabilities rather than just the most probable outcome. I will present a simple extension that creates an ensemble of naive-Bayes like classifiers, improving naive Bayes' accuracy without undue computational burden.
Finally, I will discuss false discoveries, a problem that plagues many modern pattern discovery systems. Quite simply, many state-of-the-art approaches to pattern discovery are prone to 'discover' patterns that do not exist. I will explain why this is so and discuss approaches to overcome the problem.
Speaker Biography
Geoff Webb holds a research chair in the Faculty of Information
Technology at Monash University, where he heads the Centre for
Research in Intelligent Systems. Prior to Monash he held appointments
at Griffith University and then Deakin University, where he received a
personal chair. His primary research areas are machine learning, data
mining, and user modelling. He is known for his contribution to the
debate about the application of Occam's razor in machine learning and
for the development of numerous methods, algorithms and techniques for
machine learning, data mining and user modelling. His commercial data
mining software, Magnum Opus, is marketed internationally by Rulequest
Research. Many of his learning algorithms are included in the
widely-used Weka machine learning workbench. He is editor-in-chief of
the highest impact data mining journal, Data Mining and Knowledge
Discovery, co-editor of the Encyclopedia of Machine Learning (to be
published by Springer) and a member of the editorial boards of Machine
Learning and ACM Transactions on Knowledge Discovery in Data.
2) Prof David Powers, Flinders University of South Australia
Minors as Miners: Modelling and Evaluating Ontological and Linguistic Learning
Abstract
Growing up is in large measure learning about the world
and our social and linguistic environment. We might call
this data mining, although it is far more multimodal and
immersive than most applications. This paper describes
computational research into how children learn, with a
particular focus on evaluation in both supervised and
unsupervised paradigms.
Conversely, we gain additional insight into association mining by considering psycholinguistic experiments that quantify the way human association by both adults and children relate to a variety of association measures. Learning and evaluation are not dealt with in isolation, but a program of formal and application-based evaluation is expounded and exemplified to show how to evaluate discovered patterns with and without a gold standard. In this context, some serious issues with current evaluation techniques and accuracy measures are identified and the unbiased techniques identified.
Speaker Biography
David Powers is Professor of Computer Science and Director of the
Artificial Intelligence and Language Technology Laboratories at
Flinders University. Since the 1970s, David has been focused on the
idea of getting computers to communicate in everyday language, and to
learn about the world like babies. This includes learning about the
sound systems and grammars of languages as well as about the way
meaning connects to the world. For this reason, much of David's focus
has been on using real and simulated robots to ground meaning, and
more recently the Thinking Head.
David has also worked on developing psychologically plausible models of child learning, using techniques from neuropsychology to monitor and understand the learning process. However, much of David's research is about user-centric applications of his research, including several products in various stages of commercialization. Applications include controlling your home or your wheelchair by talking or thinking; searching the web by exploring the universe star-trek style; and correcting typing, recognition and translation errors using syntactic and semantic information.
Volume, Velocity and Variety - Key Challenges for Mining Large Volumes of Multimedia Information
Abstract
New challenges are emerging, as both government and commercial
organisations attempt to exploit the potentially important information
in their ever increasing volumes of collected data. This presentation
will focus on some of the major challenges involved in the processing
and analysis of large multimedia databases. The presentation will
present and discuss a range of data mining and visual analytic tools
and techniques that DSTO have either developed or acquired to assist
organisations uncover potentially interesting patterns of behaviours,
trends, links and associations that exist in their data.
Speaker Biography
After completing his Ph.D in Mathematics from the University of
Hertfordshire (UK) in 1987, Richard joined Logica Space and Defence
Systems in London where he worked as a mathematical modeller. In 1991
he emmigrated to Australia to join the Defence Science and Technology
Organisation, where he has spent the last seventeen years working on
intelligence related R&D. Richard is currently the Head of
Intelligence Analysis Discipline at DSTO, a research group that
provides IT related scientific advice to the Australian Intelligence
Community and allied agencies.
Jiuyong Li, Peter Christen, Vladimir Estivill-Castro and Artak Amirbekyan
Various organisations, such as hospitals, medical administrations and insurance companies, have collected a large amount of data over years. However, gold nuggets in these data are unlikely to be discovered if the data is locked in data custodians' storage. A major risk of sharing data among different organisations is revealing the private information of individuals in data.
Data sanitation is not enough for protecting privacy in data. Data anonymisation is often used for data publishing to minimise the risk of privacy revealing Many models have been proposed for data anonymisation in the last few years. These models ensure that the probability of identifying an individual or knowing her sensitive information is less than a maximal threshold. In many cases, optimal anonymisation is computational infeasible. Many efficient algorithms have been proposed to anonymise data for various applications. Significant progresses have been made in data anonymisation, but many challenges remain. A typical challenge is to balance strong protection and good data utility.
A major task for data sharing is data linkage (also called data matching or entity resolution), since useful information is normally threaded in various data sources, possibly across several organisations. Several protocols and methods have been developed in past decade to link separate data sets without the need of identifiable information having to be revealed by the data sources Significant developments have been made in automatic linking of large scale and distributed data sets. However, many challenges still have to be solved before privacy-preserving data linkage can be applied to match large real-world data collections in practice.
An alternative approach for data publishing is through secure data exchanging. Secure Multiparty Computation (SMC) based techniques have been widely used to compute aggregated results from multi-parties without revealing anything from a party. However, there are many challenges here. Many solutions have proved very difficult to implement. Even so simple as to check which of two numbers is the largest. Data mining must be efficient and secure for very large datasets; therefore, it seems that expertise is required to ensure that, in implementations, the information that is leaked is innocuous.
In this tutorial, we will discuss fundamental models and protocols, major technologies current developments and research challenges in the above three directions.
1. Introduction to privacy and data sharing
2. Data anonymisation
3. Privacy preserving data linkage
4. SMC based data mining
Accepted PapersThe programm will be available here soon.
|