Bio-medical Data Mining with Visual Data Flow

J-D. Fekete, T-N. Do

---oOo---

Dataset
#Datapoints
#Dimensions
#Classes
Evaluation method
Colon tumor
62
2000
2
leave-1-out
Ovarian Cancer
253
15154
2
10-fold
Lung Cancer
181
12533
2
32 training, 149 testing
Leukemia
72
7129
2
38 training, 34 testing
Table 1. Dataset description


Visual Data Flow using SVM-1, decision tree algorithm C4.5, Scatterplots, Parallel coordinates deal with
- very large number of dimensions (many thousands)
- users's feedback: optimise parameters, human knowledge
- results: accurate & comprehensibility


1) Colon tumor:
- without feature selection (2000 dims): 6 errors with SVM algorithms
- feature selection with SVM-1: 8 dims
- visualization with scatterplots, parallel coordinates: 4 errors
- classification with decision tree algorithm C4.5




Figure 1. Visual data flow for classifying Colon tumor




Figure 2. Visualization of Colon tumor with Scatterplot-2D




Figure 3. Visualization of Colon tumor with Parallel coordinates




Figure 4. Decision tree of Colon tumor



2) Ovarian cancer:
- without feature selection (15154 dims): 100% accuracy with SVM algorithms
- feature selection with SVM-1: 9 dims
- visualization with scatterplots, parallel coordinates: 100% accuracy
- classification with decision tree algorithm C4.5



3) Lung cancer:
- without feature selection (12533 dims): 2 errors with SVM algorithms
- feature selection with SVM-1: 6 dims
- visualization with scatterplots, parallel coordinates: 5 errors
- classification with decision tree algorithm C4.5




4) Leukemia (AML-ALL):
- without feature selection (7129 dims): 2 errors with SVM algorithms
- feature selection with SVM-1: 8 dims
- visualization with scatterplots, parallel coordinates: 2 errors
- classification with decision tree algorithm C4.5




Last update june 28 2007 by Thanh-Nghi Do