Robust ensemble of handcrafted and learned approaches for DNA-binding proteins


Automatic DNA Binding Protein (DNA-BP) classification is now an essential proteomic technology. Unfortunately, many systems reported in the literature are tested on only one or two data sets/tasks and are difficult to compare. The objective of this study is to create an optimal robust system for DNA-BP classification, one that performs competitively across several DNA-BP classification tasks. Effective DNA-BP classifier systems require the discovery of powerful protein representations and feature extraction methods. In this work, experiments were performed that combined and compared descriptors extracted from some of the best matrix/image protein representations. These descriptors were trained on separate SVMs and evaluated. Convolutional Neural Networks with different parameter settings were also fine-tuned on two matrix representations of proteins. Decisions were fused using the weighted sum rule and evaluated to experimentally derive the most powerful general-purpose DNA-BP classifier system. The best ensemble proposed here produced comparable if not superior classification results on a broad and fair comparison with the literature across four different data sets representing a variety of DNA-BP classification tasks, thereby demonstrating the robustness of the proposed system. The results presented here will be useful for future comparisons.

Keywords:support vector machines, convolutional neural networks, pseudo amino acid composition, heterogeneous ensembles, protein representations.

[Full Paper]