Streaming with MP4 Container
ISO/IEC 14496-12:2015(E) has details for MP4 including fragmented MP4 (fMP4)
MP4 file is a popular multimedia container format. Many video and audio files are in this format. It should not be confused with the codec method MPEG-4.
It is an instance of the more general format defined by ISO/IEC 14496-12:2004 (MPEG-4 Part 12: ISO base media file format) which is directly based upon QuickTime File Format. In my opinion, ISO based media file format is over-engineered. ISO/IEC 14496-15 is specifically for H.264 (AVC). ISO-14496-10 is for H.264 (AVC)
Regular MP4
One key point is that the media data (e.g. video frame, audio samples) are interleaved in mdat box, and they are indexed in stco box. The indices are absolute file offsets (i.e. offsets from the start of the MP4 file). This means it is impossible to create an MP4 file on the fly during streaming unless moof box is used.
The way a player finds media data samples in an MP4 file to play can be summarized as following:
- Time to Sample Box ('stts') provides the information for the interval of every sample. Time to Samples means mapping times (i.e., durations or deltas) to samples. One time can be mapped to multiple samples.
- Sample Size Box('stsz') provides the size (i.e. number of bytes) of every sample. By now, it knows the timing and size of every sample, but it still does not know the locations of samples.
- Sample to Chunk Box ('stsc') provides the map between chuncks and samples (i.e. which samples each chunck has). Sample to Chunk means grouping samples into chunks. This box together with the above two determines the size of each chunk.
- Chunk Offset Box ('stco') provides the offset (i.e. the location) of each chunk. By now, the location of every sample is determined. This is the most tricky box.
With the above information, the decoder can find chunk of each sample, hence the location. For example, the following is Sample to Chunck Box (stsc) :
1 23 1
2 21 1
3 20 1
...
Sample 50 is 6th smaple of chunk 3 (50 - 23 - 21 = 6). Say, Chunk Offset Box (stco) shows Chunk 3 offset is 5808400, the decoder finds the sizes of sample 45, 46, 47, 48, 49, and add their sum to 5808400 to obtain the offset of sample 50.
Decoding Time to Sample Box (stts) has the list of all sample time delta (i.e. the duration of each sample).
MP4 file format can also be used for video streaming by using Movie Fragments boxes. The minimum structure for such streaming is depicted in the following figure:
Important: "The fields in the objects are stored with the most significant byte first, commonly known as network byte order or big-endian format."
If H.264 codec is used for video compression, the following structure applies:
ISO Viewer is an extremely useful tool for muxing and demuxing MP4 files (not streams).
Table 1 Box types, structure, and cross-reference (ISO/IEC 14496-12)
ftyp |
|
|
|
|
| * | 5.3.1 | file type and compatibility |
moov |
|
|
|
|
| * | 5.3.2 | container for all the meta-data |
| mvhd |
|
|
|
| * | 5.3.4 | movie header, overall declarations |
| trak |
|
|
|
| * | 5.3.5 | container for an individual track or stream |
|
| tkhd |
|
|
| * | 5.3.6 | track header, overall information about the track |
|
| tref |
|
|
|
| 5.3.7 | track reference container |
|
| edts |
|
|
|
| 5.3.26 | edit list container |
|
|
| elst |
|
|
| 5.3.27 | an edit list |
|
| mdia |
|
|
| * | 5.3.8 | container for the media information in a track |
|
|
| mdhd |
|
| * | 5.3.9 | media header, overall information about the media |
|
|
| hdlr |
|
| * | 5.3.10 | handler, declares the media (handler) type |
|
|
| minf |
|
| * | 5.3.11 | media information container |
|
|
|
| vmhd |
|
| 5.3.12.2 | video media header, overall information (video track only) |
|
|
|
| smhd |
|
| 5.3.12.3 | sound media header, overall information (sound track only) |
|
|
|
| hmhd |
|
| 5.3.12.4 | hint media header, overall information (hint track only) |
|
|
|
| dinf |
| * | 5.3.13 | data information box, container |
|
|
|
|
| dref | * | 5.3.14 | data reference box, declares source(s) of media data in track |
|
|
|
| stbl |
| * | 5.3.15 | sample table box, container for the time/space map |
|
|
|
|
| stsd | * | 5.3.17 | sample descriptions (codec types, initialization etc.) |
|
|
|
|
| stts | * | 5.3.16.2 | (decoding) time-to-sample |
|
|
|
|
| ctts |
| 5.3.16.3 | (composition) time to sample |
|
|
|
|
| stsc | * | 5.3.18.3 | sample-to-chunk, partial data-offset information |
|
|
|
|
| stsz |
| 5.3.18 | sample sizes (framing) |
|
|
|
|
| stz2 |
| 5.3.18 | compact sample sizes (framing) |
|
|
|
|
| stco | * | 5.3.20 | chunk offset, partial data-offset information |
|
|
|
|
| stss |
| 5.3.21 | sync sample table (random access points) |
|
|
|
|
| stsh |
| 5.3.22 | shadow sync sample table |
|
|
|
|
| padb |
| 5.3.24 | sample padding bits |
|
|
|
|
| stdp |
| 5.3.23 | sample degradation priority |
| mvex |
|
|
|
|
| 5.3.29 | movie extends box |
|
| trex |
|
|
| * | 5.3.30 | track extends defaults |
moof |
|
|
|
|
|
| 5.3.31 | movie fragment |
| mfhd |
|
|
|
| * | 5.3.32 | movie fragment header |
| traf |
|
|
|
|
| 5.3.33 | track fragment |
|
| tfhd |
|
|
| * | 5.3.34 | track fragment header |
|
| trun |
|
|
|
| 5.3.35 | track fragment run |
mdat |
|
|
|
|
|
| 5.3.3 | media data container |
free |
|
|
|
|
|
| 5.3.21 | free space |
skip |
|
|
|
|
|
| 5.3.21 | free space |
| udta |
|
|
|
|
| 5.3.28 | user-data, copyright etc. |
In theory, MP4 file can contain PCM using audio object type = 0. However, there seems to be no media player supporting MP4 file embedded with PCM audio. Some media players can tolerate such files by playing only the video ignoring the audio, some palyers do not play such files at all.
When mux H.264 packets in a MP4 file, it is important to omit the start code (0x00, 0x00, 0x00, 0x01). Many players will not play MP4 files with the start code.
Fragmented MP4:
Fragmented MP4 (fMP4) is an international standard and is widely supported. It originally was meant to support live streaming because it does not require the client app (streaming recipient) to know the entirety of the stream. A regular MP4 has precise information on the media payloads (video, audio) sizes and the exact location, duration, and size of every sample. This allows the client app (e.g., a video player) to play and seek any position easily. A regular MP4 is impossible to be used for live streaming. fMP4 has fragments independent of each other, so a player can start to play the stream from any fragment at any moment. This is great for live streaming and also for fault tolerance recording because an fMP4 file is a valid file at any moment, unlike a regular MP4 that requires the knowledge of the entire stream (A regular MP4 cannot be valid until the recording ends orderly).
If an fMP4 is not used for live streaming, a player can find the length of the video by reading the entire file first instead of reading and playing fragments one by one. This is exactly how some players do. However, some players do not. Most importantly, most players cannot seek a video in fMP4 form. This is the most significant caveat of using MP4 and this is why it cannot replace regular MP4. It can be used as a backup recording file in case of an unexpected ending of the recording.
Fault tolerance recording to support unattended continuous recording.
Track Fragment Box contains exactly one Track Fragment Header Box and 0 or multiple Track Fragment Run Boxes. Each Track Fragment Run Boxe points to a chunk of data in the Media Data Box('mdat').
Each Track Fragment Runbox has:
unsigned int(32) sample_count
Array with size sample_count of the following structure:
unsigned int(32) sample_duration
unsigned int(32) sample_size
The first Track Fragment Runbox needs to have data_offset.
For simplicity, one can have only one Track Fragment Run Box.
Each Track Fragment Box corresponds to an entry in Sample To Chunk Box