Bag of features

Power-law distribution in encoded MFCC frames of speech, music, and environmental sound signals

Publication Type:

Conference Paper

Source:

Int. World Wide Web Conf., Workshop on Advances on Music Information Retrieval (AdMIRe), WWW, Lyon, France, p.895-902 (2012)

URL:

http://www2012.wwwconference.org/proceedings/forms/companion.htm#8

Abstract:

Many sound-related applications use Mel-Frequency Cepstral Coefficients (MFCC) to describe audio timbral content. Most of the research efforts dealing with MFCCs have been focused on the study of different classification and clustering algorithms, the use of complementary audio descriptors, or the effect of different distance measures. The goal of this paper is to focus on the statistical properties of the MFCC descriptor itself. For that purpose, we use a simple encoding process that maps a short-time MFCC vector to a dictionary of binary code-words. We study and characterize the rank-frequency distribution of such MFCC code-words, considering speech, music, and environmental sound sources. We show that, regardless of the sound source, MFCC code-words follow a shifted power-law distribution. This implies that there are a few code-words that occur very frequently and many that happen rarely. We also observe that the inner structure of the most frequent code-words has characteristic patterns. For instance, close MFCC coefficients tend to have similar quantization values in the case of music signals. Finally, we study the rank-frequency distributions of individual music recordings and show that they present the same type of heavy-tailed distribution as found in the large-scale databases. This fact is exploited in two supervised semantic inference tasks: genre and instrument classification. In particular, we obtain similar classification results as the ones obtained by considering all frames in the recordings by just using 50 (properly selected) frames. Beyond this particular example, we believe that the fact that MFCC frames follow a power-law distribution could potentially have important implications for future audio-based applications.

Zipf's law in short-time timbral codings of speech, music, and environmental sound signals

Publication Type:

Journal Article

Source:

PLoS ONE, PLoS, Volume 7, Issue 3, p.e33993 (2012)

URL:

http://dx.plos.org/10.1371/journal.pone.0033993

Abstract:

Timbre is a key perceptual feature that allows to discriminate between different sounds. Timbral sensations are highly dependent on the temporal evolution of the power spectrum of an audio signal. In order to quantitatively characterize such sensations, the shape of the power spectrum has to be encoded in a way that preserves certain physical and perceptual properties. Therefore, it is common practice to encode short-time power spectra using psychoacoustical frequency scales. In this paper, we study and characterize the statistical properties of such encodings, here called timbral code-words. In particular, we report on rank-frequency distributions of timbral code-words extracted from 740 hours of audio coming from disparate sources such as speech, music, and environmental sounds. Analogously to text corpora, we find a heavy-tailed, Zipfian distribution with exponent close to one. Importantly, this distribution is found independently of different encoding decisions and regardless of the audio source. Further analysis on the intrinsic characteristics of most and least frequent code-words reveals that the most frequent code-words tend to have a more homogeneous structure. We also find that speech and music databases have distinctive code-words while, in the case of the environmental sounds, this database-specific code-words are not present. Finally, we find that a Yule-Simon process with memory provides a reasonable quantitative approximation for our data, suggesting the existence of a common simple generative mechanism for all considered sound sources. Our results provide new evidence towards understanding both sound generation and perception processes and, at the same time, they suggest a potential path to enhance current audio-based technological applications by taking advantage of knowledge about the found distribution.

Notes:

Supplementary information can be found at PLoS ONE web site.

Visual Registration Method for a Low Cost Robot

Publication Type:

Conference Paper

Source:

7th International Conference on Computer Vision Systems. Lecture Notes in Computer Science, Springer, Volume 5815, Liege, Belgium, p.204-214 (2009)

ISBN:

3-642-04666-5

Keywords:

Registration; Bag of features; robot localization

Abstract:

An autonomous mobile robot must face the correspondence
or data association problem in order to carry out
tasks like place recognition or unknown environment mapping. In
order to put into correspondence two maps, most correspondence
methods first extract early features from robot sensor data,
then matches between features are searched and finally the
transformation that relates the maps is estimated from such
matches. However, finding explicit matches between features is a
challenging and computationally expensive task. In this paper, we
propose a new method to align obstacle maps without searching
explicit matches between features. The maps are obtained from a
stereo pair. Then, we use a vocabulary tree approach to identify
putative corresponding maps followed by a Newton minimization
algorithm to find the transformation that relates both maps. The
proposed method is evaluated on a typical office dataset showing
good performance.

Syndicate content