The Australasian Data Mining Conference: AusDM 2008

Stamford Grand, Glenelg, Adelaide, 27-28 November 2008

http://ausdm08.togaware.com/

See PDF Version for the schedule.

Keynote Presentations

1) Prof Geoff Webb, Monash University

MultiStrategy Ensemble Learning, Ensembles of Bayesian Classifiers, and the Problem of False Discoveries

Abstract
This talk covers an ensemble of my research contributions that I believe are likely to resonate with a current audience.

Ensemble Learning combines the predictions of multiple classifiers to enhance accuracy relative to any individual classifier. I will show that combining established ensemble learning techniques further enhances accuracy without computational overhead.

Naive Bayes is a popular approach to classification learning due to its computational efficiency, strong theoretical foundation and its capacity to predict probabilities rather than just the most probable outcome. I will present a simple extension that creates an ensemble of naive-Bayes like classifiers, improving naive Bayes' accuracy without undue computational burden.

Finally, I will discuss false discoveries, a problem that plagues many modern pattern discovery systems. Quite simply, many state-of-the-art approaches to pattern discovery are prone to 'discover' patterns that do not exist. I will explain why this is so and discuss approaches to overcome the problem.

Speaker Biography
Geoff Webb holds a research chair in the Faculty of Information Technology at Monash University, where he heads the Centre for Research in Intelligent Systems. Prior to Monash he held appointments at Griffith University and then Deakin University, where he received a personal chair. His primary research areas are machine learning, data mining, and user modelling. He is known for his contribution to the debate about the application of Occam's razor in machine learning and for the development of numerous methods, algorithms and techniques for machine learning, data mining and user modelling. His commercial data mining software, Magnum Opus, is marketed internationally by Rulequest Research. Many of his learning algorithms are included in the widely-used Weka machine learning workbench. He is editor-in-chief of the highest impact data mining journal, Data Mining and Knowledge Discovery, co-editor of the Encyclopedia of Machine Learning (to be published by Springer) and a member of the editorial boards of Machine Learning and ACM Transactions on Knowledge Discovery in Data.


2) Prof David Powers, Flinders University of South Australia

Minors as Miners: Modelling and Evaluating Ontological and Linguistic Learning

Abstract
Growing up is in large measure learning about the world and our social and linguistic environment. We might call this data mining, although it is far more multimodal and immersive than most applications. This paper describes computational research into how children learn, with a particular focus on evaluation in both supervised and unsupervised paradigms.

Conversely, we gain additional insight into association mining by considering psycholinguistic experiments that quantify the way human association by both adults and children relate to a variety of association measures. Learning and evaluation are not dealt with in isolation, but a program of formal and application-based evaluation is expounded and exemplified to show how to evaluate discovered patterns with and without a gold standard. In this context, some serious issues with current evaluation techniques and accuracy measures are identified and the unbiased techniques identified.

Speaker Biography
David Powers is Professor of Computer Science and Director of the Artificial Intelligence and Language Technology Laboratories at Flinders University. Since the 1970s, David has been focused on the idea of getting computers to communicate in everyday language, and to learn about the world like babies. This includes learning about the sound systems and grammars of languages as well as about the way meaning connects to the world. For this reason, much of David's focus has been on using real and simulated robots to ground meaning, and more recently the Thinking Head.

David has also worked on developing psychologically plausible models of child learning, using techniques from neuropsychology to monitor and understand the learning process. However, much of David's research is about user-centric applications of his research, including several products in various stages of commercialization. Applications include controlling your home or your wheelchair by talking or thinking; searching the web by exploring the universe star-trek style; and correcting typing, recognition and translation errors using syntactic and semantic information.


3) Dr Richard Price, Defence Science and Technology Organisation

Volume, Velocity and Variety - Key Challenges for Mining Large Volumes of Multimedia Information

Abstract
New challenges are emerging, as both government and commercial organisations attempt to exploit the potentially important information in their ever increasing volumes of collected data. This presentation will focus on some of the major challenges involved in the processing and analysis of large multimedia databases. The presentation will present and discuss a range of data mining and visual analytic tools and techniques that DSTO have either developed or acquired to assist organisations uncover potentially interesting patterns of behaviours, trends, links and associations that exist in their data.

Speaker Biography
After completing his Ph.D in Mathematics from the University of Hertfordshire (UK) in 1987, Richard joined Logica Space and Defence Systems in London where he worked as a mathematical modeller. In 1991 he emmigrated to Australia to join the Defence Science and Technology Organisation, where he has spent the last seventeen years working on intelligence related R&D. Richard is currently the Head of Intelligence Analysis Discipline at DSTO, a research group that provides IT related scientific advice to the Australian Intelligence Community and allied agencies.


 

Tutorial

Privacy preserving data sharing and mining

Jiuyong Li, Peter Christen, Vladimir Estivill-Castro and Artak Amirbekyan

Various organisations, such as hospitals, medical administrations and insurance companies, have collected a large amount of data over years. However, gold nuggets in these data are unlikely to be discovered if the data is locked in data custodians' storage. A major risk of sharing data among different organisations is revealing the private information of individuals in data.

Data sanitation is not enough for protecting privacy in data. Data anonymisation is often used for data publishing to minimise the risk of privacy revealing Many models have been proposed for data anonymisation in the last few years. These models ensure that the probability of identifying an individual or knowing her sensitive information is less than a maximal threshold. In many cases, optimal anonymisation is computational infeasible. Many efficient algorithms have been proposed to anonymise data for various applications. Significant progresses have been made in data anonymisation, but many challenges remain. A typical challenge is to balance strong protection and good data utility.

A major task for data sharing is data linkage (also called data matching or entity resolution), since useful information is normally threaded in various data sources, possibly across several organisations. Several protocols and methods have been developed in past decade to link separate data sets without the need of identifiable information having to be revealed by the data sources Significant developments have been made in automatic linking of large scale and distributed data sets. However, many challenges still have to be solved before privacy-preserving data linkage can be applied to match large real-world data collections in practice.

An alternative approach for data publishing is through secure data exchanging. Secure Multiparty Computation (SMC) based techniques have been widely used to compute aggregated results from multi-parties without revealing anything from a party. However, there are many challenges here. Many solutions have proved very difficult to implement. Even so simple as to check which of two numbers is the largest. Data mining must be efficient and secure for very large datasets; therefore, it seems that expertise is required to ensure that, in implementations, the information that is leaked is innocuous.

In this tutorial, we will discuss fundamental models and protocols, major technologies current developments and research challenges in the above three directions.

Outline

1. Introduction to privacy and data sharing

2. Data anonymisation

3. Privacy preserving data linkage

4. SMC based data mining

Presenters


Accepted Papers

The programm will be available here soon.

  1. UtilizingWiFi Signals in Large-scale Indoor Spatial Infrastructure with Robust Probabilistic Models
    Kha Tran
  2. Evaluation of Malware clustering based on its dynamic behaviour
    Ibai Gurrutxaga, Olatz Arbelaitz, Jesus M. Perez, Javier Muguerza, Jose I. Martin, and Inigo Perona
  3. Service-independent payload analysis to improve intrusion detection in network traffic
    Inigo Perona, Ibai Gurrutxaga, Olatz Arbelaitz, Jose I. Martin, Javier Muguerza, and Jesus M. Perez
  4. ShrFP-Tree: An Efficient Tree Structure for Mining Share-Frequent Patterns
    Chowdhury Farhan Ahmed, Syed Khairuzzaman Tanbeer, Byeong-Soo Jeong, and Young-Koo Lee
  5. On Inconsistencies in Quantifying Strength of Community Structures
    Wen Haw Chong
  6. Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach
    Peter Christen and Ross Gayler
  7. A New Technique for Evaluating Imbalanced Datasets
    Cheng Weng and Josiah Poon
  8. Classification of Brain-Computer Interface Data
    Omar Al Zoubi, Irena Koprinska and Rafael Calvo
  9. LBR-Meta: An Efficient Algorithm for Lazy Bayesian Rules
    Zhipeng Xie
  10. Rare Association Rule Mining via Transaction Clustering
    Yun Sing Koh and Russel Pears
  11. Identifying Stock Similarity Based on Multi-event Episodes
    Abhi Dattasharma, Praveen Tripathi, and Sridhar Gangadharpalli
  12. Priority Driven K-Anonymisation for Privacy Protection
    Xiaoxun Sun, Hua Wang, and Jiuyong Li
  13. Clustering and Classification of Maintenance
    Brett Edwards and Richi Nayak
  14. S2MP: Similarity Measure for Sequential Patterns
    Hassan Saneifar, Sandra Bringay, Anne Laurent, and Maguelonne Teisseire
  15. Graphics Hardware based Efficient and Scalable Fuzzy C-Means Clustering
    S.A. Arul Shalom, Manoranjan Dash, and Minh Tue
  16. Kernel-based visualisation of genes with the Gene Ontology
    Hamid Ghous, Paul Kennedy, Daniel Catchpoole, and Simeon Simoff
  17. Combining Structure and Content Similarities for XML Document Clustering
    Tien Tran and Richi Nayak
  18. Customer Event Rate Estimation Using Particle Filters
    Harsha Honnappa
  19. Structure-Based Document Model with Discrete Wavelet Transforms and Its Application to Document Classification
    Supphachai Thaicharoen, Tom Altman, and Krzysztof Cios
  20. Comparison of visualization methods of genome-wide SNP profiles in childhood acute lymphoblastic leukemia
    Ahmad Al-Oqaily, Paul Kennedy, Daniel Catchpoole, and Simeon Simoff
  21. wFDT - Weighted Fuzzy Decision Trees for Prognosis of Breast Cancer Survivability
    Umer Khan, Hyunjung Shin, and Minkoo Kim
  22. Categorical Proportional Difference: A Feature Selection Method for Text Categorization
    Mondelle Simeon and Robert Hilderman
  23. Exploratory Mining over Organisational Communications Data
    Alan Allwright and John Roddick
  24. Mining Medical Specialist Billing Patterns for Health Service Management
    Yin Shan, David Jeacocke, D. Wayne Murray, Alison Sutinen