Streaming with MP4 Container

ISO/IEC 14496-12:2015(E) has details for MP4 including fragmented MP4 (fMP4)

 

MP4 file is a popular multimedia container format. Many video and audio files are in this format. It should not be confused with the codec method MPEG-4.

It is an instance of the more general format defined by ISO/IEC 14496-12:2004 (MPEG-4 Part 12: ISO base media file format) which is directly based upon QuickTime File Format.  In my opinion, ISO based media file format is over-engineered. ISO/IEC 14496-15 is specifically for H.264 (AVC). ISO-14496-10 is for H.264 (AVC)

Regular MP4
One key point is that the media data (e.g. video frame, audio samples) are interleaved in mdat box, and they are indexed in stco box.  The indices are absolute file offsets (i.e. offsets from the start of the MP4 file). This means it is impossible to create an MP4 file on the fly during streaming unless moof box is used.

The way a player finds media data samples in an MP4 file to play can be summarized as following:

  1. Time to Sample Box ('stts') provides the information for the interval of every sample. Time to Samples means mapping times (i.e., durations or deltas) to samples. One time can be mapped to multiple samples. 
  2. Sample Size Box('stsz') provides the size (i.e. number of bytes) of every sample. By now, it knows the timing and size of every sample, but it still does not know the locations of samples.
  3. Sample to Chunk Box ('stsc') provides the map between chuncks and samples (i.e. which samples each chunck has). Sample to Chunk means grouping samples into chunks. This box together with the above two determines the size of each chunk. 
  4. Chunk Offset Box ('stco') provides the offset (i.e. the location) of each chunk. By now, the location of every sample is determined. This is the most tricky box.

With the above information, the decoder can find chunk of each sample, hence the location.  For example, the following is Sample to Chunck Box (stsc) :

1    23    1

2    21    1

3   20    1

...

Sample 50 is 6th smaple of chunk 3 (50 - 23 - 21 = 6).  Say, Chunk Offset Box (stco) shows Chunk 3 offset is 5808400, the decoder finds the sizes of sample 45, 46, 47,  48, 49, and add their sum to 5808400 to obtain the offset of sample 50.

Decoding Time to Sample Box (stts) has the list of all sample time delta (i.e. the duration of each sample). 

MP4 file format can also be used for video streaming by using Movie Fragments boxes.  The minimum structure for such streaming is depicted in the following figure:

Important: "The fields in the objects are stored with the most significant byte first, commonly known as network byte order or big-endian format."

 

 

If H.264 codec is used for video compression, the following structure applies:

 

 

 

ISO Viewer is an extremely useful tool for muxing and demuxing MP4 files (not streams).

 


Table 1 Box types, structure, and cross-reference (ISO/IEC 14496-12)

ftyp

 

 

 

 

 

*

5.3.1

file type and compatibility

moov

 

 

 

 

 

*

5.3.2

container for all the meta-data

 

mvhd

 

 

 

 

*

5.3.4

movie header, overall declarations

 

trak

 

 

 

 

*

5.3.5

container for an individual track or stream

 

 

tkhd

 

 

 

*

5.3.6

track header, overall information about the track

 

 

tref

 

 

 

 

5.3.7

track reference container

 

 

edts

 

 

 

 

5.3.26

edit list container

 

 

 

elst

 

 

 

5.3.27

an edit list

 

 

mdia

 

 

 

*

5.3.8

container for the media information in a track

 

 

 

mdhd

 

 

*

5.3.9

media header, overall information about the media

 

 

 

hdlr

 

 

*

5.3.10

handler, declares the media (handler) type

 

 

 

minf

 

 

*

5.3.11

media information container

 

 

 

 

vmhd

 

 

5.3.12.2

video media header, overall information (video track only)

 

 

 

 

smhd

 

 

5.3.12.3

sound media header, overall information (sound track only)

 

 

 

 

hmhd

 

 

5.3.12.4

hint media header, overall information (hint track only)

 

 

 

 

dinf

 

*

5.3.13

data information box, container

 

 

 

 

 

dref

*

5.3.14

data reference box, declares source(s) of media data in track

 

 

 

 

stbl

 

*

5.3.15

sample table box, container for the time/space map

 

 

 

 

 

stsd

*

5.3.17

sample descriptions (codec types, initialization etc.)

 

 

 

 

 

stts

*

5.3.16.2

(decoding) time-to-sample

 

 

 

 

 

ctts

 

5.3.16.3

(composition) time to sample

 

 

 

 

 

stsc

*

5.3.18.3

sample-to-chunk, partial data-offset information

 

 

 

 

 

stsz

 

5.3.18

sample sizes (framing)

 

 

 

 

 

stz2

 

5.3.18

compact sample sizes (framing)

 

 

 

 

 

stco

*

5.3.20

chunk offset, partial data-offset information

 

 

 

 

 

stss

 

5.3.21

sync sample table (random access points)

 

 

 

 

 

stsh

 

5.3.22

shadow sync sample table

 

 

 

 

 

padb

 

5.3.24

sample padding bits

 

 

 

 

 

stdp

 

5.3.23

sample degradation priority

 

mvex

 

 

 

 

 

5.3.29

movie extends box

 

 

trex

 

 

 

*

5.3.30

track extends defaults

moof

 

 

 

 

 

 

5.3.31

movie fragment

 

mfhd

 

 

 

 

*

5.3.32

movie fragment header

 

traf

 

 

 

 

 

5.3.33

track fragment

 

 

tfhd

 

 

 

*

5.3.34

track fragment header

 

 

trun

 

 

 

 

5.3.35

track fragment run

mdat

 

 

 

 

 

 

5.3.3

media data container

free

 

 

 

 

 

 

5.3.21

free space

skip

 

 

 

 

 

 

5.3.21

free space

 

udta

 

 

 

 

 

5.3.28

user-data, copyright etc.

 

In theory, MP4 file can contain PCM using audio object type = 0.  However, there seems to be no media player supporting MP4 file embedded with PCM audio.  Some media players can tolerate such files by playing only the video ignoring the audio, some palyers do not play such files at all.

When mux H.264 packets in a MP4 file, it is important to omit the start code (0x00, 0x00, 0x00, 0x01).  Many players will not play MP4 files with the start code.

Fragmented MP4:

Fragmented MP4 (fMP4) is an international standard and is widely supported. It originally was meant to support live streaming because it does not require the client app (streaming recipient) to know the entirety of the stream. A regular MP4 has precise information on the media payloads (video, audio) sizes and the exact location, duration, and size of every sample. This allows the client app (e.g., a video player) to play and seek any position easily. A regular MP4 is impossible to be used for live streaming. fMP4 has fragments independent of each other, so a player can start to play the stream from any fragment at any moment. This is great for live streaming and also for fault tolerance recording because an fMP4 file is a valid file at any moment, unlike a regular MP4 that requires the knowledge of the entire stream (A regular MP4 cannot be valid until the recording ends orderly).
If an fMP4 is not used for live streaming, a player can find the length of the video by reading the entire file first instead of reading and playing fragments one by one. This is exactly how some players do. However, some players do not. Most importantly, most players cannot seek a video in fMP4 form. This is the most significant caveat of using MP4 and this is why it cannot replace regular MP4. It can be used as a backup recording file in case of an unexpected ending of the recording.

Fault tolerance recording to support unattended continuous recording. 

Track Fragment Box contains exactly one Track Fragment Header Box and 0 or multiple Track Fragment Run Boxes. Each Track Fragment Run Boxe points to a chunk of data in the Media Data Box('mdat').

Each Track Fragment Runbox has:

unsigned int(32) sample_count

Array with size sample_count of the following structure:

unsigned int(32) sample_duration

unsigned int(32) sample_size

The first Track Fragment Runbox needs to have data_offset.

For simplicity, one can have only one Track Fragment Run Box.

Each Track Fragment Box corresponds to an entry in Sample To Chunk Box

 

 

 

This article was updated on 20:59:42 2024-07-11