Asking locally run vision-text models to classify images


In this post we play with Visual Question Answering models, multi-modal ML models capable of answering questions on images, run them on my laptop and measure their performance in detecting if there are buildings and metal buildings in images.


As part of an architecture researh project in deep energy retrofits, we wanted to obtain a set of publicly-owned metal prefabricated buildings in Québec. These buildings were considered good candidates to develop a scalable retrofit protocol given the project's partnership with provincial government and other architectural considerations. Starting with a database of all the lots in Québec, we extracted around 13,000 lots using public-related land-use codes that were likely to contain metal prefab buildings (like arenas, parks and community centers).

After scraping streetview images for each of these lots, I wanted a way to automatically identify the metal buildings, the alternative being humans sifting through the images. Let us leverage technology to reduce human labor! This prompted some research into available vison models at that time (Summer 23).

This was likely going to be a one-time task, so I did not want to go through many long hours of ML model development, something I had no experience, with no certainty of success and requiring expensive hardware (we did have access to research super-computers that I never got around to play with in the end 😔). Instead, I was hoping to find a pre-trained model to use for this problem, but although there has been a lot of work in extracting buildings from aerial photos, I could not find existing models for the task at hand.

At the time, my coworker and I were looking into image classifiers, segmenters and object detection models looking for ways to apply them. We also looked at popular vision model training sets, reasoning that models training using them would recognize buildings. For example, the CIFAR-100 dataset has a category for "large man-made outdoor things" which was promising, ImageNet has categories for special buildings like churches and bakeries. I was hopeful with the Cityscapes dataset, but buildings are considered just as background.

Then I found the LAION dataset. Unfortunately, the explorer site is now broken (the KNN backend returns a 502 error at the time of writing) so you'll have to trust me on this, but the dataset actually contained metal prefab buildings! I then looked into models which had trained on LAION and this led me to finally using the BLIP model (paper and nice video about it) which can be used for Visual Question Answering (VQA).


So, the idea was to ask the VQA model a set of yes/no questions for each image of a lot, and use the answers to estimate the probability that the lot contains a building and metal building. The results can then be ordered for more time-efficient human analysis, without having to discard photos of empty lots and non-metal buildings, or we could determine a threshold and discard those falling below.

After struggling to refresh my basic stats and probabilities, buying a cool textbook that I will never finish, playing with bayesian networks and giving up, I figured that by considering each question as a binary/diagnostic test, we can use simple bayesian statistics and this great stack exchange response to combine the test results and obtain a formula for the probability of each event, given the test results (see math below).


In the end, using BLIP running on my laptop I was able to process the ~90k images and estimate for each the probability that it contained a building P(B) and a metal building P(M). After averaging a lot's image's scores, we can order the lots by P(B) and P(M) descending to get the candidate building list ordered from best to worst. To make this a proper classifier, we can set a threshold and determine the class based on that, e.g. P(M) > .5 for metal buildings, P(M) < .5 and P(B) > .5 for non metal buildings and no building for everything else.

The Low P(B)/High P(M) cases in the image below, are those that tricked the model into thinking they were metal buildings. We can see that street signs, RV parks, metal fences all trigger the 'metal detecting' part of the model. Overall, I was very happy with the results.

Note: This work was done in the summer of 2023, before GPTs offered multi-modal capabilities. It would be interesting to measure it's performance on the same task! The models described in this post however will only cost you the elctricity needed to run your computer.

Example results of running the BLIP model on our images. Results are categorized in 4 logical classes based on the probabilities of P(B) and P(M)


The VQA model is asked the following questions:

Is a building clearly visible?
Is there a building in the image?
Is a building present in the image?
Can you clearly see a building in the image?

Is there a metal prefabricated building in the image?
Is there a metal prefabricated building?
Is there a metal building?
If there is a building, what is it made of?

We consider each question as an independent binary test. The first 4 test if there is any building, the second 4 if there is a metal building. If the model answers "yes" the test is considered positive, else it's negative. For the last question, we check if the answer contains "metal" or not instead. In practice, I've found that the VQA models would always answer with a single word for these questions.

That's all great, except once the model answers, how much we can trust its answers? To quantify that, the most intuitive measure to me is the probability that there is a building (or that it is metal) based on the test results, or P(M | T_M). I don't know what to call this measure exactly, but I guess it's a sort of combined predictive value of the tests. To compute it, we can use Bayes' theorem and follow this great stack exchange response.

Let T_M be the set of answers to the questions (the test results) and P(M) be the probability of a metal building being on the lot. Then P(M | T_M) is the probability of a metal building given the test results and is given by


(by sampling the data, I mean go through random cases counting the occurence of the metal buildings)


P(B) and P(M) were estimated by randomly sampling the data until obtaining 30 occurrences of the target class. They were measured to be .65 and .11, respectively.

Now we also need to measure the sensitivity and specifity of each question. From our 10k cases, we'll randomly select a few of each class to serve as our "measuring set" (instead of eval/test set) and run the model against it. Knowing the class of each case, we'll be able to count the number of true/false positivesand compute the sensitivty and specificity of each test.

Here are the results of measuring the sensitivity/specificity of the model. I included measurements for the ViLT model which can be used for the same task as a comparison. I chose BLIP since it ran faster and seemed to have higher sensitivity for most tests, but I'm not sure how exactly to interpret these results to form any conclusions about the relative performance of the two models.

Model Question TP FP TN FN Sensi Speci PPV NPV
Is a building clearly visible? 853 109 176 26 0.9704 0.6175 0.8867 0.8713
Is there a building in the image? 853 106 179 26 0.9704 0.6281 0.8895 0.8732
Is a building present in the image? 852 108 177 27 0.9693 0.6211 0.8875 0.8676
Can you clearly see a building in the image? 858 123 162 21 0.9761 0.5684 0.8746 0.8852
Is there a metal prefabricated building in the image? 135 342 658 29 0.8232 0.6580 0.2830 0.9578
Is there a metal prefabricated building? 133 307 693 31 0.8110 0.6930 0.3023 0.9572
Is there a metal building? 112 155 845 52 0.6829 0.8450 0.4195 0.9420
If there is a building, what is it made of? 110 48 952 54 0.6707 0.9520 0.6962 0.9463
Is a building clearly visible? 717 19 266 162 0.8157 0.9333 0.9742 0.6215
Is there a building in the image? 844 84 201 35 0.9602 0.7053 0.9095 0.8517
Is a building present in the image? 841 73 212 38 0.9568 0.7439 0.9201 0.8480
Can you clearly see a building in the image? 814 54 231 65 0.9261 0.8105 0.9378 0.7804
Is there a metal prefabricated building in the image? 149 427 573 15 0.9085 0.5730 0.2587 0.9745
Is there a metal prefabricated building? 150 448 552 14 0.9146 0.5520 0.2508 0.9753
Is there a metal building? 111 111 889 53 0.6768 0.8890 0.5000 0.9437
If there is a building, what is it made of? 20 5 995 144 0.1220 0.9950 0.8000 0.8736

And here are the model runtimes on my laptop (Lenovo Thinkbook 15 Gen2 with AMD Ryzen 5 4500 U and 16GB RAM):

Model avg s/img
BLIP 3.099
Vilt 3.821


Since doing this work, I decided to try training a few classifiers for this task using the BRAILS framework. Using a (porobably too small) dataset, I trained 4 different architectures and measured their performance following this article for multi-class classification. In order to compare the VQA models' performance, I set a threshold of 0.5 to classifiy the cases like so:

P(M) > .5 -> metal buildings, P(M) < .5 and P(B) > .5 -> non metal buildings and no building for everything else.

Model Accuracy Recall (micro) Recall (macro) Precision (micro) Precision (macro) F1 (macro) F1 (micro)
convnextb 0.5145 0.5145 0.4032 0.5145 0.3034 0.5145 0.3208
convnexts 0.5242 0.5242 0.3881 0.5242 0.3106 0.5242 0.3209
efficientnet2m 0.5640 0.5640 0.4081 0.5640 0.3363 0.5640 0.3513
efficientnet2s 0.5870 0.5870 0.4115 0.5870 0.3519 0.5870 0.3628
blip-vqa-capfilt-large 0.6316 0.6316 0.4205 0.6316 0.3951 0.6316 0.3930
vilt-b32-finetuned-vqa 0.7237 0.7237 0.4438 0.7237 0.4565 0.7237 0.4460

We see that accuracy and micro recall are equal for multiclass classification. We see that off all the BRAILS models, efficientnet2s is the best, even though it's the smallest model! Perhaps it has something to do with the small size of the training set, with a small model better able to utilize it. Overall, ViLT outperforms all the other models with the test threshold. It could be interesting to try other thresholds and see how it affects the classifier performance of the VQA models.

You can recreate these results by running the code here:


This is mostly me playing data scientist, if you found some glaring mistakes or thought this did not make any sense, please let me know. I'd be happy to learn more about this subject.

Overall, I was very happy with the results from BLIP. It was cool to be able to run a model and actually solve a real problem using it, all without requirting expensive hardware.


The two models are bad at following instructions. For example, if told

    Reply 'no_bldg' if no building is visible, 'metal' if a metal building is visible and 'not_metal' if a non-metal building is visible

they will both consistently answer by yes/no instead.