Learning Adaptive Hidden Layers for Mobile Gesture Recognition

Ting-Kuei Hu, Yen-Yu Lin, Pi-Cheng Hsiu


This paper addresses two obstacles hindering advances in accurate gesture recognition on mobile devices. First, gesture recognition performance is highly dependant on feature selection, but optimal features typically vary from gesture to gesture. Second, diverse user behaviors and mobile environments result in extremely large intra-class variations.

We tackle these issues by introducing a new network layer, called an adaptive hidden layer (AHL), to generalize a hidden layer in deep neural networks and dynamically generate an activation map conditioned on the input. To this end, an AHL is composed of multiple neuron groups and an extra selector. The former compiles multi-modal features captured by mobile sensors, while the latter adaptively picks a plausible group for each input sample.

The AHL is end-to-end trainable and can generalize an arbitrary subset of hidden layers. Through a series of AHLs, the great expressive power from exponentially many forward paths allows us to choose proper multi-modal features in a sample-specific fashion and resolve the problems caused by the unfavorable variations in mobile gesture recognition.The proposed approach is evaluated on a benchmark for gesture recognition and a newly collected dataset. Superior performance demonstrates its effectiveness.


@inproceeding{ AHL,
title = {Learning Adaptive Hidden Layers for Mobile Gesture Recognition},
author = {Hu, Ting-Kuei and Lin, Yen-Yu and Hsiu, Pi-Cheng},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2018}



Adaptive Hidden Layer

AHL generalizes a hidden layer in DNNs and can dynamically generate an appropriate activation map for a given input. An AHL is composed of multiple neuron groups and an extra selector. Each training data can be well processed by at least one group, while the selector that implements softmax normalization can dynamically pick a plausible group for each input sample.


IsoGD dataset

This dataset includes 47933 RGB-D gesture videos. Each RGB-D video represents one gesture only, and there are 249 gestures labels performed by 21 different individuals.
Due to the lack of the data labels in the testing subset, our approach and the competing approaches are trained on the training subset and evaluated on the validation subset.

network accuracy
C3D(RGB only) 37.30%
C3D(Depth only) 40.50%
C3D(RGB+Depth) 49.20%
C3D+ConvLSTM(RGB only) 43.88%
C3D+ConvLSTM(Depth only) 44.66%
C3D+ConvLSTM(RGB+Depth only) 51.02%
Ours(RGB only) 44.88%
Ours(Depth only) 48.96%
Ours(RGB+Depth) 54.14% (54.50 for updated version)

Our collected dataset

We collected a dataset for mobile handed gestures. These gestures were recorded with both the videos captured by the cameras of the smartphones and the 3-axis acceleration (ACCE) captured by the accelerometers of the smart watches.

We implement a prototype system on Samsung Note5 and Moto360. A video of the demonstration of our system is shown on the right.

network accuracy
DAE + HOG 81.52%
DAE + ACCE 76.24%
multi-model DAE 86.48%
Our 90.57%

Code and Dataset

  • The codes for reproducing the result of our approach on IsoGD dataset and our collected dataset will be available at github.

  • Our collected dataset can downloaded from the link below.