cat2mqtt: Quick beginner-friendly custom object detection with TensorFlow/Keras

Beginner tutorial on quickly creating custom model and detecting objects in a video stream using TensorFlow/Keras, Teachable Machine and Python

cat2mqtt: Quick beginner-friendly custom object detection with TensorFlow/Keras

Since I've found it a bit difficult to find instructions on how to make really simple object detection using something like TensorFlow, Keras, OpenCV, or PyTorch I've decided to write up my approach.

tl;dr: Use https://teachablemachine.withgoogle.com/ to quickly get a working model, then use the provided code sample and adjust as per your needs. Unless you have very specific need not covered by the Teachable Machine, this is the fastest approach.

The Problem

My cat (Lara) loves to sleep on the bedroom windowsill, it's in top 3 of her favorite places to stay at - and it's the only one not in my view when working from home. If I'm tied up in meetings or similar I would like to be able to quickly know if she's there or not.  While this is not an information I absolutely must know, I thought it would be a great exercise on how to determine if something like a pet is present in some physical location.

Ideally this would also integrate in Home Assistant, the state of which I can then show on my Stream Deck. In order to achieve the goal I've tried few things:

Attempts

Attempt #1: Add a pressure sensor

I can modify simple ZigBee door sensor to work with car seat pressure sensors and just detect if she's on it or not? If you thought that would be simple solution, you're very wrong.

The pressure sensor is very unreliable,  due to the fact that she sleeps in various positions, moves around, or sometimes just distributes weight somehow differently. Even putting the sensor in her most frequented position didn't yield always good results. I have a heatmap of her movements, so I know the optimal position to put it in:

The same problems apply to other similar sensor approaches.

Attempt #2: Add a camera

As you can see from the previous image this is exactly what I did. And this, obviously, works for being able to see if she's there or not. But has a few downsides:

  • It requires opening of the stream and determining the state "manually"
  • It can not be integrated in home assistant to return a binary True or False state.
  • Furthermore, it can not be integrated with Stream deck, and even if it could be the size would reduce the ability to see the picture.

That brings me to the obvious next step:

Attempt #3: Create a custom machine learning model trained to recognize if there's a cat on the windowsill or not based on the camera RTSP stream

Because why not.

I haven't had much experience with machine learning and object detection. While I wasn't interested in doing a deep dive, I wanted to quickly verify how well this approach works. So I've started with a simple goal: "Quickly have code running that uses some machine learning library to determine if my cat is on the windowsill or not based on the image from the RTSP stream.". Already at the first word "Quickly" there seemed to be a lot of roadblocks.

While machine learning has progressed a lot in recent years, most of the tutorials and setups are still somewhat complex and not beginner-friendly. As I was looking into how to do this I've stumbled upon a few libraries and platforms and looked into them, but I haven't really found anything with really simple description on how to achieve what I want.

I have some pictures of her on the bed, and I have photos of the empty bed - how difficult could it be to generate model based on this? Things like TensorFlow seem very complex for generating the models - there are a lot of tutorials on this topic, but I found it too complex for my use case. I've found the official TensorFlow tutorial on this for TensorFlow Lite, but I've had some issues with getting Lite support working on my M1 MacBook Pro (btw this has been a recurring theme with all the platforms)

I've tried OpenCV for which the documentation seems a bit lacking. Likewise, I've tried PyTorch which seems to be marketed as more user-friendly approach to ML topic, but I also didn't get too far with the custom model. Detecting things using existing models seems to be easy in most of these, but with custom models is where it gets complicated or under documented.

Enter Teachable Machine

Teachable Machine is an amazing tool, that I'm surprised I didn't stumble upon more frequently when researching this.

I think the website itself does a great job explaining what it is and how it works. But the basic premise is, create categories, add images to categories, click a button, get a usable custom model. It even provides source code that can be used as-is.

Preparing data

The camera that I have is a UniFi protect one. The UniFi Protect interface offers recording on motion detection, which is ideal for getting relevant data. I've had more than enough videos of various movements. Here's a sample:

I've downloaded some of these videos, converted them to images using FFmpeg find . -name "*.mp4" -exec ffmpeg -i {} -r 0.75 %05d.png \; then did the same for recordings without Lara, and created a new project on Teachable Machine.

The images get cropped by Teachable Machine, but we still end up with something like this, which is perfectly enough for the detection

Generating model

Generate two categories, with the empty one being the last - if it's completely unrecognized input it will result in this last category being detected. The model creation can take a minute or two depending on the sample size

Assuming you're looking to use this on a computer and not IoT device download the Keras model:

Using the model

In the downloaded zip file there are two files keras_model.h5, and labels.txt. You can use the source code visible on the download page to use the model.  For my use case I've used it to detect the object, in this case Lara or empty and report the detection, as well as confidence (between 0 and 1) to MQTT, which I can then easily integrate in Home Assistant. My code can be found here: https://github.com/nikolak/cat2mqtt

Result:

This is one of those things that doesn't have much to show in the end. I just ended up with a simple data in my MQTT.

In home assistant:

Home Assistant config is done like this:

mqtt:
  binary_sensor:
    - name: "Lara on windowsill"
      state_topic: "vision/bedroom_windowsill/label"
      payload_on: "Lara"
      payload_off: "Background"
      device_class: presence

Note:

You might be better off using docker to run this, but if you're on M1/ARM CPU then the docker isn't supported at the time of writing - so remote deployment is the easiest way to get it running. I've got it working on my M1 MacBook Pro using conda. I did get some issues, so I had to install older protobuf version pip install protobuf==3.19.4. NumPy was also being weird, so I had to run pip install numpy --upgrade, other than that instructions from apple cover some other things https://developer.apple.com/metal/tensorflow-plugin/

My code

You can find my code here. While I do use it to detect my cat on the window and report it to MQTT, the code is not harcoded to detect only this. You can provide your model, labels.txt, and it should report whatever it finds. I've made it configurable enough, so you can pick MQTT topic to report to and similar.

The code is licensed under MIT so feel free to edit it for your needs.