Dance Video Datasets for Artificial Intelligence

Mark Gituma
10 min readJun 4, 2019

One of the limitations encountered when recognizing and classifying human activities in video is the limited number of training samples. Fortunately, there has been an effort to alleviate this issue as organizations have been releasing their annotated data to the public.

A complete dataset dedicated entirely to dance is not the norm, however as dance is a subset of human actions, it is usually contained within the human activity recognition datasets. Unfortunately, the dance examples are a small subset of the provided data thus it’s necessary to aggregate the examples across the individual sources. By aggregating, it is possible to collect a large enough sample for dance recognition and classification.

This blog is geared towards identifying the different sources of dance datasets available for human action recognition and classification. With that said, let’s look at the first dance dataset out there.

UCF 101

The UCF dataset was created by the University of Central Florida (hence the name UCF) and has 101 action categories in 1332o videos. It was released in 2013 and is based on videos collected from YouTube. The videos contain variations in camera motions, object appearance, pose, object scale, viewpoint, cluttered background and illumination conditions.

Within the 101 categories, the category representing a type of dance is Salsa Spins. There are 133 total videos in total which is about 1 % of the total data (13320).

The dataset can be downloaded as a zip file from the UCF website (https://www.crcv.ucf.edu/data/UCF101.php). The zip size is about 6 GB and on decompression the dataset occupies approximately 14 GB of memory.

ActivityNet

ActivityNet is an attempt for a video benchmark for human activity understanding. Currently there are 2 releases which are 1.2 and 1.3. The main focus is on release 1.3 which was published in March 2016. ActivityNet contains about 200 activity classes, with 10,024 training videos, 4,926 validation videos and a further 5,044 testing videos.

The data pertaining to the ActivityNet dataset is encoded within a json file containing 3 different key i.e. database, taxonomy and version.

  • The version field contains the current dataset release.
  • The taxonomy contains the parent child relationship of every activity in the dataset.
  • The database is the core of the dataset and is as follows:
{ "87hsTxVtn-A": {
"duration": 235.89,
"subset": "validation",
"resolution": "1920x1080",
"url": "https://www.youtube.com/watch?v=87hsTxVtn-A",
"annotations": [{
"segment": [3.680048861154446, 231.84307825273012],
"label": "Belly dance"
}]
}
}

Where the key of the json object is a uuid of the YouTube url. The value of the uuidkey is another json object containing:

  • The duration - which is the length of the video.
  • The subset - determines whether the data belongs to the train, validation or test set.
  • The url - contains the YouTube url for the video and can be used to download the video.
  • The annotations - contains an array of dictionary objects made up of a segment key and a label key. The segment key is a tuple of start and endtimes, while the label is the class label of the given segment. The reason an array is used for the annotation is that a single video can have multiple segments with the same label or multiple segments with different labels.

From the above example, there is a single belly dance segment between 3.6 seconds and 231 seconds in the given video at a spatial resolution of 1920x1080 belonging to the validation set. The data can be retrieved by downloading the video at https://www.youtube.com/watch?v=87hsTxVtn-A which is 235.89 seconds long.

The fact that the dataset is downloaded as a json file indicates that some form of preprocessing needs to be applied to actually download the YouTube videos. Luckily ActivityNet provides their own script to do this which can be found on the ActivityNet github repo (https://github.com/activitynet/ActivityNet).

Class Distribution

An analysis of the class distribution within the Dance category was conducted on the ActivityNet dataset and the results is as follows:

There are a total of 503 videos that are associated with dance in the 1.3 release version. This is about 2.52 % of the total number of videos in the dataset i.e. 19994. There are only 5 classes that contains a form of dance which is 2.5 % if the total class list i.e. 200 Activity classes. This is not a lot with regards to depth and diversity when considering the data requirements for deep learning, and is precisely the reason an aggregation over the different sources is useful.

Kinetics 600

The Kinetics dataset was created by the Deepmind organization and was designed to be a baseline for human activity task and activity recognition. The inspiration was derived from Imagenet which contains 1000 classes with over 1000 images per class. The Kinetics 600 (K600) dataset is an expansion of the Kinetics 4oo (K400) dataset released in 2017.

The K600 dataset consists of about 500,000 video clips, and covers 600 human action classes with at least 600 video clips for each action class. Each clip is about 10s long and was generated after several rounds of human annotation on YouTube videos. The dataset covers a large range of classes including human-object interactions, human-human interactions and single human motions.

According to the The Kinetics Human Action Video Dataset paper published in association with the K400 data, a parent child grouping is provided as a guideline and is not meant to be exclusive. Within the guideline the, dancing category has 18 classes of dance e.g. zumba, krumping, salsa, tango etc. In the newer Kinetics 600 release, the total number of classes increases to 21.

As with the ActivityNet dataset, the Kinetics data can be downloaded from the deepmind kinetics page (https://deepmind.com/research/open-source/open-source-datasets/kinetics/). There are 4 different zip files available for download, the training, validation, holdout test and test set. Within each zip file there is a json file and a csv file. The structure of the json file is as follows:

{...
"--07WQ2iBlw": {
"annotations": {
"label": "javelin throw",
"segment": [
1.0,
11.0
]
},
"duration": 10.0,
"subset": "val",
"url": "https://www.youtube.com/watch?v=--07WQ2iBlw"
}
...
}

This has a similar structure to the ActivityNet json file, with some differences. The objects are in the root json structure as opposed to a nested structure i.e. the databasekey in the ActivityNet dataset. As there is no taxonomy key, the parent child guidelines published in the Kinetics paper are used to extract the classes within the dance category

There are various ways to download the data such as the kinetics downloader by the Showmax streaming platform. It allows the download of individual classes, classes within a category, as well as downloading the entire dataset. However, the shortcoming of the codebase is that it focuses on the K400 dataset and does not have a download option for the newer K600 dataset.

Class Distribution

A quick analysis was conducted on the class distribution within the dance category which can be seen in the following chart.

The above represents the classes within the K600 dance category, and there are 21 in total. The total number of data points available which are aggregated from the train, validation and testing datasets was found to 19105 which represents about 3.98 % of the entire dataset (480173). This is the largest dance dataset encountered so far, however more work needs to conducted to determine the quality of the dance videos within the dataset.

Let’s Dance

The Let’s dance dataset is provided by the Georgia tech college of computing and is a direct result of the Let’s Dance: Learning From Online Dance Videos research paper. It’s the first dataset encountered that is focused completely on dance.

According to the original paper, the dataset consists of 1000 videos, containing 10 dynamic and visual overlapping dances which include; Ballet, Flamenco, Latin, Square, Tango, Breakdancing, Foxtrot, Quickstep, Swing and Waltz. There are meant to be 100 videos per class which are approximately 10 seconds long and at 3o fps. The videos were taken from YouTube at a spatial resolution of 720 p.

The dataset can be downloaded from the project’s webpage (https://www.cc.gatech.edu/cpl/projects/dance/) and is currently divided into 4 partitions which are; Original Frames, Optical Flow, Skeletons Visualized and Skeletons JSON

The focus is on the Original Frames, and as the name suggests, this dataset contains video frames of the different dance classes. The zip file size is about 26 GB which expands to 403 GB when extracted. The frames are arranged within folders for each type of dance. The implication of using only frame level data is that there is no sound hence no sound modality an be performed on the data therefore the classification can only be performed purely based on spatial and/or temporal input modalities.

Data Analysis

A quick analysis was conducted on the extracted Original Frames dataset. As mentioned, this results in approximately a 403 GB extracted folder size. Within the folders there are 16 different class types for dance as opposed to the original 10 described in the paper, the new additions are rumba, jive, samba, pasodoble, cha and tap. Though it should be noted that “breakdancing” was renamed to “break” within the recent update.

Within each folder, the individual frames of the videos have been extracted hence the large folder size. The filenames for the dataset had the following format <youtubeUUID>_<startFrame>_<relativeFrame>.jpg e.g. 5xxTkB5bGy4_046_0026.jpg where the prefix 5xxTkB5bGy4 is assumed to be the unique YouTube uuid in the URL. The middle value 046 is assumed to be the starting frame within the video and the last value 0026 is assumed to be the relative frame count and starts form 0001 and increments to the last frame number in the section . Thus it assumed the same videos can have multiple segments grouped according to the startFrame i.e. videos with the same YouTube UUID can have different start frame values but exactly the same relative frame values e.g. 5xxTkB5bGy4_046_0026.jpg and 5xxTkB5bGy4_199_0026.jpg .

With these assumptions, the frames with the same youtube uuid and middle value were grouped together and as a result the following class distribution was calculated for each video instance.

It was found that the average number of videos per class was 91 with a standard deviation of 7. This was less than recorded in the original paper but could be attributed to the fact that it’s a newer dataset, as well as the assumptions made during this calculation were perhaps not accurate. It was found that in the newer dataset the total number of videos was 1463 which is due to the increased number of classes.

YouTube 8M

The YouTube 8M dataset is one of the largest scale labelled video dataset to date and was released by Google. It contains approximately 6.1 M YouTube video ids resulting in over 350,000 hours of video containing 3862 classes with an average of 3 labels per video.

The dataset itself is more complex than the above mentioned sources in that the output are the feature vectors resulting from a deep learning classifier at temporal resolution of 1 frame per second of video. There is a dance category within the dataset but not much further analysis has been conducted on this data due to its significant difference from other sources.

Secondary Data Augmentation

Even though the focus is on dance datasets, the principle of dance with regards to individual human motion, human — human interactions and to a limited extent human — object interactions can be extended to other forms of body motion. Examples are activities such as backflips, swinging leg, swinging arm, controting etc (derived from the Kinetics dataset).

This augmentation is possible as dance is largely a motion based activity and as such, we are looking to eliminate spatial features with regards to object interactions and focus mainly on the dynamics of the human body through time.

Reliance of spatial features e.g. human object interactions has been shown to be problematic in previous human activity recognition experiments in that the network simply learns associations of objects within a video section and pools the results of the object recognition to associate to an activity. In layman’s terms this means, if a video contains a ball, a field, 22 individuals in the field, 2 goals post, then it reasonably assumes the video contains a soccer match without really using any of the temporal features in the video.

With dance, this is not possible as it’s mostly motion driven hence an algorithm trained to understand dance moves cannot take significant advantage of spatial objects to understand the dance moves or what type of dance is being conducted.

Conclusion

As we have seen, there is quite a few sources of dance datasets around, however in each source, the dance category is usually a small fraction of the whole (with the exception of the Let’s Dance dataset). Thus, by aggregating data across the different sources, a substantial dataset can be generated covering a wide range of classes with a larger number of samples in each class.

Aggregation poses a few challenges, one being class duplication in that a YouTube url appearing in one dataset might be present in another. With the ActivityNet, Kinetics and to a lesser extent the Let’s dance dataset this can be resolved by a simple filter to exclude the duplicate url. However, this is a naive approach in that the chosen segments within the video might be different. A much more thorough approach would be to apply a frame matching algorithm on the segments of interest after which, videos that contain a significant amount of duplicated frames across the datasets can be discarded. More information on this procedure can be found in the Kinetics paper.

Even though aggregation improves the amount of data available, the total data focusing on dance is still quite low considering the variation present in everyday human dance movement.

Thus, Dancelogue is committed to providing high quality dance data with greater variation and number of samples, it is hoped that the data will be used by researchers to further the field of human action recognition and classification within dance.

The original posting of this appeared in the Dancelogue blog, content from the blog is usually posted on medium after 2 weeks.

Reference

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Mark Gituma
Mark Gituma

Written by Mark Gituma

Ask me anything or request a 10 minute video call on https://mbele.io/mark

No responses yet

Write a response