Pico PIO Camera

Hooking up an OV7670 to Pico's PIO

This is a lengthy post which goes into lots of detail. The tl;dr is that the code is here: https://github.com/usedbytes/camera-pico-ov7670

In this post I’ll describe how I went about connecting an OV7670 camera module to the Raspberry Pi Pico that drive’s my Pi Wars at Home 2022 ‘bot M0o+. It’s all wrapped up in a separate library which should be easy-ish to incorporate in other projects.

Trying to get this camera working was the very first thing I did, as a sort of Pico “feasibility study” to decide if entering a Pico-based robot was practical. Initially I just bit-banged the interface and did get some results without too much effort, but things have escalated a fair amount since then:

OmniVision OV7670

The OmniVision OV7670 is a 640x480 (0.3 megapixels) CMOS camera, which dates from 2005(!). It’s apparently now discontinued but despite this is still readily available from all the usual places you’d look for cheap breakout boards.

You can get a module with an OV7670, a lens assembly, the necessary voltage regulators and 0.1" headers for less than £5 from a UK seller on eBay.

The interface to the camera consists of an i2c bus for configuring things like resolution and white balance, and a streaming parallel video interface for outputting the image from the sensor.

There are dozens of pages around the internet on interfacing this module, so I’ll focus on just the information that’s needed to make sense of the rest of this post.

Video interface

The video interface can be split into two parts:

  • The synchronisation signals: VSYNC, HREF and PCLK, which need to be used to keep track of which part of the frame the data is for.
  • The pixel data, which comes out on 8 signals, D0-D7. One byte of data is transferred for each PCLK cycle.

The camera “pushes” the images from the sensor continuously, at a fixed rate, whether the receiver is ready for them or not. With a bare OV7670 chip there’s no way to “pull” the frame at your own pace, because the sensor doesn’t have anywhere to store the pixel data1. This means on the receiving side, you have to “keep up” with the data coming from the camera.

To receive a frame from the camera, the process is something like this:

for each frame:
   wait_for_falling_edge(VSYNC)
   for each line in frame:
      wait_for_rising_edge(HREF)
      for each byte in line:
         wait_for_rising_edge(PCLK)
         byte = read_pins_parallel(D0, 8)

Frame timing diagram

Timing diagram created with WaveDrom

The meaning of each byte depends on the configured pixel format. There’s a few options available, but I only care about RGB565 and YUYV. In both of these formats, two bytes are used for each pixel, so the number of PCLK cycles in each line of the image is width * 2. However, for YUYV you need a complete run of 4 bytes to reconstruct 2 pixels, whereas for RGB565 you can take each 2-byte pixel individually.

For RGB565:

Signal Byte 0 Byte 1
D7 R4 G2
D6 R3 G1
D5 R2 G0
D4 R1 B4
D3 R0 B3
D2 G5 B2
D1 G4 B1
D0 G2 B0

For YUYV:

Signal Byte 0 Byte 1 Byte 2 Byte 3
D7..D0 Y0 U01 Y1 V01

YUYV Detour

A quick detour to YUV if you’re not familiar. This will be important when we get to packing/unpacking of the pixel data.

YUV is just a different way to represent colours. Where RGB splits each pixel into “redness”, “greenness” and “blueness”; YUV instead uses:

  • Y: Luminance/Luma - The brightness of the pixel
  • U/Cb: Chroma Blue - Blueness minus Luma
  • V/Cr: Chroma Red - Redness minus Luma

YUYV is a specific way of storing a YUV image. Because human eyes are much more sensitive to changes in brightness/luminance than they are to changes in colour/chrominance, it’s very common to throw away either 50% or 75% of the chroma samples as a crude form of data compression.

YUYV is a “4:2:2” YUV format, meaning 50% of the chroma data is discarded. Each pixel has its own Y value, but each pair of pixels in a line share a pair of U/V values. So in each set of 4 bytes received from the camera, we get 2 bytes of Y data, one for each of two consecutive pixels; and one byte each of U and V data, which apply to both of the pixels in the pair.

The sample layout in 4 YUYV pixels

In a robotics context, YUV can be useful because if you just take the Y values, then you have a decent grayscale image, for something like line following; and if you take either or both of U and V, then you can easily identify colors (more on that on my earlier post about Mini Mouse).

Converting between RGB and YUV is possible, but it means a few multiplications and additions per-pixel, so if you want one or the other, it saves some processing if you can ask for the appropriate format from the camera directly.

Register Settings

The last point I want to make on the OV7670 itself is how much of a nightmare it is to configure. The datasheets/integration guides which are available on the internet don’t provide enough information to effectively configure the camera, and in some cases contradict each other.

There are various sources of “golden” register values, which I believe are sourced from OmniVision themselves, and they all contain tons of values and register addresses which either aren’t documented, or are marked as “reserved” in the data sheet(s).

After much frustrating messing around, I settled on using Adafruit’s reference values from their driver, which is a fairly minimal set and gives me the functionality I need (downscaling to 80x60 and both YUYV and RGB565 output).

Hooking up the Pico and PIO

PIO on the Pico has totally flexible pin-mapping, so the camera can really be wired to whatever pins you want. However, for parallel input the 8 data pins should be wired to a consecutive run of 8 GPIOs, so that they can be shifted in using a single in PINS 8 PIO instruction.

The other catch is that the OV7670 needs an external clock source, fed in to the XCLK pin. You can generate that on the Pico in any number of ways, but I opted to use one of the GPOUT clock generators in the RP2040, and the only pin which can be used for this on a Pico is GP21 with clk_gpout0.

Connections for a single parallel-input SM

Converting the little frame program pseudocode above to PIO code is quite straightforward:

for each frame:
   wait_for_falling_edge(VSYNC)
   for each line in frame:
      wait_for_rising_edge(HREF)
      for each byte in line:
         wait_for_rising_edge(PCLK)
         byte = read_pins_parallel(D0, 8)

We need a counter for the number of lines, and a counter for the number of bytes in a line, which we’ll send in via the output FIFO. Then just loop over lines and bytes, putting the data into the input FIFO:

.program camera_parallel
.wrap_target
pull                          ; Pull number of lines from FIFO
out Y, 32                     ; Store number of lines in Y
pull                          ; Pull bytes-per-line from FIFO
                              ; Note: We leave bytes-per-line in OSR, to reload X each line 

wait 1 pin PIN_OFFS_VSYNC     ; Wait for VSYNC to go high, signalling frame start

loop_line:                    ; For each line in frame
mov X, OSR                    ; Reload X with bytes-per-line
wait 1 pin PIN_OFFS_HREF      ; Wait for start of line

loop_byte:                    ; For each byte in line
wait 0 pin PIN_OFFS_PCLK      ; Wait for PCLK to go low
wait 1 pin PIN_OFFS_PCLK      ; Wait for PCLK to go high (rising edge)
in PINS 8                     ; Shift in 1 byte of data
jmp x-- loop_byte             ; Next byte

wait 0 pin PIN_OFFS_HREF      ; Wait for end of line

jmp y-- loop_line             ; Next line

.wrap                         ; Next frame

Aside: If “PIO v2” ever comes around, I think a “wait for edge” instruction would be a good addition.

This needs corresponding code for the CPU to load the programs, setup the PIO, drain the input FIFO, etc. Instead of reading the data from the FIFOs with the CPU directly, we can use DMA to copy from the PIO to a buffer in memory.

All that is left as an exercise for the reader I’m afraid, because I don’t have a decent copy of that implementation any more. My final full code is on GitHub but it’s significantly different from the program above for reasons described below.

Saving Pins

The approach above works great, but it uses loads of pins on the Pico, a total of 14!

  • SCL
  • SDA
  • XCLK - Must be GP21
  • VSYNC
  • HREF
  • PCLK
  • D0..D7 - Must be contiguous

I can’t really afford to spend this many pins on my Pico, when I’ve also got multiple motors to drive and other sensors to talk to. So, I’ve added a 74165, parallel-in, serial-out shift register which will change 8 data pins to 1 data pin and 1 clock pin, a net saving of 6 pins.

I did try using the XCLK pin as the shift register clock, instead of directly driving the shift register clock input – which would have saved an additional pin – but I couldn’t get the timing to work reliably, and my logic analyser isn’t fast enough to debug it.

I haven’t drawn a proper schematic, but this should give you the idea:

OV7670 Module 74165 Pico Pico (example)
SIOC - User decision 1
SIOD - User decision 0
VSYNC - base_pin + 2 18
HREF - base_pin + 3 19
PCLK SH/nLD base_pin + 4 20
XCLK - User decision 21
D7 A -
D6 B -
D5 C -
D4 D -
D3 E -
D2 F -
D1 G -
D0 H -
- CLK base_pin + 1 17
- Qh base_pin 16
- CLK_INH GND

Connections for a single serial-input SM

So what does this change in the PIO code? Well, instead of getting a whole byte of data with a single in PINS 8 instruction, now we’re going to need to manually shift the data from the external shift register, one bit at a time.

The program above is already using the two registers Y and X for the line and byte counters, so there isn’t a spare register to use to count bits. For now, let’s get the idea by just repeating the “read a bit” code 8 times, using side-set to control the shift register CLK pin:

Confession: I haven’t actually tested these PIO snippets, but I hope they serve as useful illustrations

.program camera_serial_unrolled_loop
.wrap_target
pull                             ; Pull number of lines from FIFO
out Y, 32                        ; Store number of lines in Y
pull                             ; Pull bytes-per-line from FIFO
                                 ; Note: We leave bytes-per-line in OSR, to reload X each line 

wait 1 pin PIN_OFFS_VSYNC        ; Wait for VSYNC to go high, signalling frame start

loop_line:                       ; For each line in frame
mov X, OSR                       ; Reload X with bytes-per-line
wait 1 pin PIN_OFFS_HREF         ; Wait for start of line

loop_byte:                       ; For each byte in line

wait 1 pin PIN_OFFS_PCLK         ; Wait for PCLK to go high (otherwise inputs are "transparent")
nop side 1                       ; side set CLK
in PINS 1 side 0                 ; Grab bit 0, clear CLK
nop side 1                       ; side set CLK
in PINS 1 side 0                 ; Grab bit 1, clear CLK
nop side 1                       ; side set CLK
in PINS 1 side 0                 ; Grab bit 2, clear CLK
nop side 1                       ; side set CLK
in PINS 1 side 0                 ; Grab bit 3, clear CLK
nop side 1                       ; side set CLK
in PINS 1 side 0                 ; Grab bit 4, clear CLK
nop side 1                       ; side set CLK
in PINS 1 side 0                 ; Grab bit 5, clear CLK
nop side 1                       ; side set CLK
in PINS 1 side 0                 ; Grab bit 6, clear CLK
nop side 1                       ; side set CLK
in PINS 1 side 0                 ; Grab bit 7, clear CLK
wait 0 pin PIN_OFFS_PCLK side 0  ; Wait for PXCLK to go low

jmp x-- loop_byte                ; Next byte

wait 0 pin PIN_OFFS_HREF         ; Wait for end of line

jmp y-- loop_line                ; Next line

.wrap                            ; Next frame

This obviously wastes a lot of code space (and is a bit ugly). Instead we can use a second State Machine (SM) for the “inner” byte loop, giving us a brand new X register to count bits in. This is also useful for other reasons, explained later.

So:

  • SM0 - Handles the “frame” timing
  • SM1 - Just shifts in individual bytes, controlled by SM0

Connections for using two SMs for serial-input

The SM0 program is nearly the same as before, but inside the loop_byte loop it just triggers (and waits for) SM1 using IRQ 5:

.program camera_serial_sm_trigger
.wrap_target
pull                          ; Pull number of lines from FIFO
out Y, 32                     ; Store number of lines in Y
pull                          ; Pull bytes-per-line from FIFO
                              ; Note: We leave bytes-per-line in OSR, to reload X each line 

wait 1 pin PIN_OFFS_VSYNC     ; Wait for VSYNC to go high, signalling frame start

loop_line:                    ; For each line in frame
mov X, OSR                    ; Reload X with bytes-per-line
wait 1 pin PIN_OFFS_HREF      ; Wait for start of line

loop_byte:                    ; For each byte in line
irq wait 5                    ; Trigger SM1 to shift in byte, and wait for it to finish the byte
jmp x-- loop_byte             ; Next byte

wait 0 pin PIN_OFFS_HREF      ; Wait for end of line

jmp y-- loop_line             ; Next line

.wrap                         ; Next frame

The SM1 program uses the X register to loop 8 times, but is otherwise similar to before. Things get a little more complex here, using IRQ flags, side-set and delay cycles, but it’s hopefully still understandable:

Note that wait 0 irq 4 rel is used, so that this same code can be used on different state machines. More on that below.

.program camera_serial_byte_loop
.wrap_target
.side_set 1 opt                  ; Use side-set bit 0 to drive CLK on the shift register

.wrap_target
set X 7                          ; Load the bit counter
wait 0 irq 4 rel                 ; Wait to be triggered by IRQ (4 + SM number), but don't clear it

wait 1 pin PIN_OFFS_PCLK         ; Wait for PCLK to go high (otherwise inputs are "transparent")

nop side 1 [1]                   ; side set CLK
loop_bit:
in PINS 1 side 0 [0]             ; Grab a bit of data, clear CLK
jmp x-- loop_bit side 1 [1]      ; Next bit

wait 0 pin PIN_OFFS_PCLK side 0  ; Wait for PXCLK to go low
irq set 5                        ; Tell the frame loop that the byte is finished

.wrap

This uses 9 fewer instructions in total than the single SM “unrolled” version, but uses up 2 State Machines.

I refer to the Raspberry Pi Pico C/C++ SDK documentation and plead innocence:

There is no need to feel guilty about dedicating a state machine solely to a single I/O task, since you have 8 of them!

Multi-plane data

The other reason for using multiple state machines, is it lets us easily use different buffers for the Y, U and V data. This is handy, because if you only want a greyscale image, you could ignore the U/V data entirely.

Or, if you only want the V data to look for “red” things (spoiler alert), then you can get the PIO/DMA to put all that V data into its own buffer, and you don’t need to worry about skipping over the Ys and Us in your image processing code.

To do this, we tweak the above SM0 program ever so slightly, so that instead of always triggering SM1 for each byte, we can replace that loop with different code for different data layouts, triggering different SMs.

Connections for using four SMs for multi-plane serial-input

Importantly, this means that the X counter in SM0 doesn’t count bytes any more, it counts “chunks”, where the meaning of a “chunk” depends on the pixel format. For YUYV, a “chunk” would be 4 bytes, whereas for RGB565 it might be only 2 bytes.

For using a separate state machine for Y, U and V, giving three separate buffers, SM0 looks like so (note “chunks-per-line” not “bytes-per-line”):

.program camera_serial_sm_chunk_trigger
.wrap_target
pull                          ; Pull number of lines from FIFO
out Y, 32                     ; Store number of lines in Y
pull                          ; Pull chunks-per-line from FIFO
                              ; Note: We leave chunks-per-line in OSR, to reload X each line 

wait 1 pin PIN_OFFS_VSYNC     ; Wait for VSYNC to go high, signalling frame start

loop_line:                    ; For each line in frame
mov X, OSR                    ; Reload X with chunks-per-line
wait 1 pin PIN_OFFS_HREF      ; Wait for start of line

loop_chunk:                   ; For each chunk in line
irq wait 5                    ; Trigger SM1 to shift in Y0
irq wait 6                    ; Trigger SM2 to shift in U01
irq wait 5                    ; Trigger SM1 to shift in Y1
irq wait 7                    ; Trigger SM3 to shift in V01
jmp x-- loop_chunk            ; Next chunk

wait 0 pin PIN_OFFS_HREF      ; Wait for end of line

jmp y-- loop_line             ; Next line

.wrap                         ; Next frame

Because the camera_serial_byte_loop shift register program above used a relative IRQ number in wait 0 irq 4 rel, we can use that exact same code on state machines 1, 2 and 3 to handle the Y, U and V data respectively. SM1 gets triggered twice per chunk, because there are 2 Luma samples per chunk.

If instead we wanted to put YUYV data into a single buffer, we can use just SM0 and SM1 and replace the pixel loop with:

loop_chunk:                   ; For each chunk in line
irq wait 5                    ; Trigger SM1 to shift in Y0
irq wait 5                    ; Trigger SM1 to shift in U01
irq wait 5                    ; Trigger SM1 to shift in Y1
irq wait 5                    ; Trigger SM1 to shift in V01
jmp x-- loop_byte             ; Next byte

Or separating Y from UV:

loop_chunk:                   ; For each chunk in line
irq wait 5                    ; Trigger SM1 to shift in Y0
irq wait 6                    ; Trigger SM2 to shift in U01
irq wait 5                    ; Trigger SM1 to shift in Y1
irq wait 6                    ; Trigger SM2 to shift in V01
jmp x-- loop_byte             ; Next byte

Or RGB565, with only 2 bytes per chunk instead of 4:

loop_chunk:                   ; For each chunk in line
irq wait 5                    ; Trigger SM1 to shift in Byte 0
irq wait 5                    ; Trigger SM1 to shift in Byte 1
nop
nop
jmp x-- loop_byte             ; Next byte

In fact, we can patch that loop_chunk loop at runtime depending on which pixel format the code wants!

Final Code

With all that background above on how the code works, you should be well equipped to understand the eventual implementation at https://github.com/usedbytes/camera-pico-ov7670.

I’ve tried to make that a library which can be easily integrated into other Pico projects, but I haven’t put much effort into making it particularly configurable - for example the resolution is fixed at 80x60 pixels, and it doesn’t have the “parallel input” implementation, only serial input using an external shift register.

Feel free to submit patches to address some of the shortcomings!

In particular adding the necessary code to support parallel input would be really simple based on the code above. I don’t have a board wired up for that circuit any more, but maybe I’ll come back to that another day.

Integration woes

I did all of the development of this code on a very densely populated breadboard, and it worked great!

Densely packed breadboard with ESP32, Pico and camera

When I came to transfer this over to the robot, I ran in to tons of problems with signal integrity. Apparently in all my frustration/dispair, I didn’t save a single picture showing the problem!

I’m still not 100% sure what went wrong, but I re-soldered the circuit 4 times which was super tedious when there’s 20+ signals to wire on stripboard by hand. I also tried various different cables (including plugging in directly) to connect the camera to the board.

Wrapping a ribbon cable in foil (not electrically connected) seemed to work, but in the end I’ve settled on using separated jumper wires, and lots of fiddling with PIO delay cycles and OV7670 clock rates, to get a set-up which is working reliably.

I wasted nearly 2 weeks of making time just getting the camera to work “in-situ” on the robot, so I’m not terribly motivated to sink more time into it just now. However, with those problems solved I was able to get the robot to follow a red blob, which was super satisfying :-)

I’ll write more about the image processing as I start to come up with solutions for the Pi Wars challenges.


  1. There are similar, more expensive modules which include an FPGA and a framebuffer RAM, which you can then “pull” the frame from using SPI, but with PIO there should be no need for this, the Pico is fast enough to capture the streaming data from the sensor. ↩︎