Wireshark dump

There’s a lot of info in the Wireshark dump, so the first thing to do is to filter the address to my IP cam’s IP so we only see what it sends/receives.

We can see there’s some sort of ping-pong going on at first.

sequenceDiagram
participant c as Client
participant s as IP Cam (Server)

c->>s: DESCRIBE
s->>c: Reply

c->>s: SETUP
s->>c: Reply

c->>s: PLAY
s->>c: Reply

s-->>c: Video Data

The protocol tab of wireshark informs us that this comms is RTSP 1.0.

Note: This capture was done by running the retina crate, which is what I’m contributing to. It only supports RTSP 1.0 which is why we’re going to be looking at RFC 2326, not RFC 7826 which is RTSP 2.0.

Looking at RTSP comms

DESCRIBE

Nothing too special about this packet except the fact that we’re asking to receive the content in application/sdp.

SETUP

Interesting to note that retina has hard-coded the Transport option as

RTP/AVP/TCP;unicast;interleaved={proposed_channel_id}-{proposed_channel_id + 1}

PLAY

The interesting part here is sending npt=0.000-, which I guess means start playing video immediately (0.000) & continuously (-).

Analyzing the first RTP payload

Once the camera replies to the PLAY request, it starts sending out video data i.e. H.264 inside RTP packets. We can now filter wireshark capture by setting ip.src_host == {camera_ip} so we see one-way comms from IP camera to my system, and inspect the RTP payload.

H.264 video in RTP packets

We can use RFC 6184 to understand what the payload of the RTP packets will be.

From Section 5.2:

A receiver can identify the payload structure by the first byte of the RTP packet payload

and also,

This byte is always structured as a NAL unit header. The NAL unit type field indicates which structure is present.

The first byte of the RTP payload will tell us what type of payload structure it is. Read Section 5.2 to understand the three types of payload structures.

RTP payload

Here’s the first few bytes

0000   3c 87 4d 00 1f e7 40 28 02 dd 80 a5 05 05 05 f0
0010   00 00 03 00 10 00 00 03 02 8b 01 00 02 dc 68 00
0020   02 25 51 7f ff 02 80 00 00 00 01 28 ee 3c 80 00
0030   00 00 01 25 b8 40 00 10 5b fc 84 a5 08 ec d2 43
0040   b6 58 0b ec 17 2e 74 65 73 80 73 90 8b d5 7a f5
0050   9e ea 8c 8e 55 09 80 cf 50 19 4b 8a 08 ac 2f 3c

First byte of RTP payload

As mentioned in the RFC, we’ll analyze the first byte (0x00) which is 3c.

3c in binary is 00111100.

The RFC told us that this is always structured as a NAL unit header, which is defined in Section 1.3 as:

+---------------+
|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+
|F|NRI|  Type   |
+---------------+

The first byte represented as a NAL unit header.

+---------------+
|0|0|1|1|1|1|0|0|
+-+-+-+-+-+-+-+-+
|F|NRI|  Type   |
+---------------+

Payload structure identification

We’re interested in the Type value, which is 11100. This in decimal is 28, which is, acc. to Table 1 of Section 5.2, a FU-A packet.

Second byte of RTP payload

Now we know that the first RTP packet is of type FU-A, we can look at its doc to move forward. The format is defined in Figure 14 as:

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| FU indicator  |   FU header   |                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
|                                                               |
|                         FU payload                            |
|                                                               |
|                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               :...OPTIONAL RTP padding        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

One key point of FU-A packets:

Fragmentation is defined only for a single NAL unit and not for any aggregation packets.

Meaning a FU-A packet (which can consist of multiple packets i.e. fragmented) can contain data of only a single NAL unit, not multiple.

^ Remember that.

Now, we’ve already read the first byte, which is the FU indicator, let’s read the second byte i.e. FU header.

The second byte (0x01) is 87, which in binary is 10000111.

Format of FU header:

+---------------+
|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+
|S|E|R|  Type   |
+---------------+

S (start bit)
E (end bit)
R (reserved bit)
Type (NALU payload type) <-- interest

Our second byte as FU header is:

+---------------+
|1|0|0|0|0|1|1|1|
+-+-+-+-+-+-+-+-+
|S|E|R|  Type   |
+---------------+

Since it is the first fragment of FU-A, the start bit is 1, and the NAL payload type is 00111, which in decimal is 7.

The NAL unit type of 7 is for SPS.

Now we know the following:

  • The packet is a FU-A packet
  • The FU-A packet has the header of SPS, meaning the data after FU header is supposed to be entirely for SPS

Let’s start reading the packet further.

Reading next bytes

Emulation Prevention Bytes (EPBs)

There are emulation prevention bytes (EPBs) at byte 0x12 and 0x17.

Start code

There is a start code at 0x27 till 0x2a i.e. 0x00_00_00_01.

Next NALU (PPS header & data)

At 0x2b, we get the header of next NALU 0x28, or 00101000, where type of NALU is 01000, which is 8 i.e. PPS. Subsequent bytes are PPS data.

Start code

There is a start code at 0x2f till 0x32 i.e. 0x00_00_00_01.

Next NALU header (IDR header & data)

At 0x33 is next NALU header 25, or 00100101, where type of NALU is 00101, which is 5 for IDR. Subsequent bytes are IDR data.

How FU-As should be

Notice that the FU-A packets are fragmented. And remember when I quoted the RFC which explicitly said that:

Fragmentation is defined only for a single NAL unit and not for any aggregation packets.

This means that when we have received a FU-A packet i.e. we receive all fragmented packets for that FU-A, starting from the first packet with start bit set to 1, until the final packet with end bit set to 1, will be of a single NALU.

FU-A example

[3c 87 ...] # first fragmented packet of FU-A
[3c 07 ...], [3c 07 ..], .. , [3c 07] # middle fragmented packets
[3c 47 ...] # last fragmented packet of FU-A

The payload of the entire FU-A packet should be of a single NALU, which is made clear by the last digit in second byte 7, meaning these packets were for SPS.

What our FU-A is

So we know what the payload of the FU-A should be, i.e. data for SPS.

However

We’ve received an Annex B stream in the FU-A packet, which contains SPS, PPS and IDR slice.

Normally, IDR slices are big so they’re commonly sent over FU-A packets, but my camera is bundling up SPS & PPS with the IDR slice into an Annex B stream, and sending that through FU-A, which acc. to the RFC, ain’t what we was supposed to do dawg.

This is essentially what we’re receiving.

[
3c           # FU-A
87           # start of FU-A for SPS
..           # SPS data
(start code) # end of last NALU
28           # single NALU for PPS
..           # PPS data
(start code) # end of last NALU
25           # single NALU for IDR
..           # IDR data
]

[
3c           # FU-A
07           # middle packets of FU-A for SPS
..           # IDR data
]

[
3c           # FU-A
47           # end of FU-A for SPS
..           # IDR data
]

Funny how the FU-A fragments after the first one still say that they’ll have SPS, but they instead have IDR slice since that’s the last NALU in the Annex B stream.

Conclusion

The camera is not conforming to the RFC, which is why retina isn’t able to work. I’ll need to come up with a solution for parsing the Annex B stream in FU-A packets so retina works with my camera.

Spoiler alert: I did make a PR before but the code didn’t feel good, which is why I’m gonna attempt doing it again.

Update: Opened up a PR for adding support for my camera.