linkMultimodality

For MTLLM to have actual neurosymbolic powers, it needs to be able to handle multimodal inputs and outputs. This means that it should be able to understand text, images, and videos. In this section, we will discuss how MTLLM can handle multimodal inputs.

linkImage

MTLLM can handle images as inputs. You can provide an image as input to the MTLLM Function or Method using the Image format of mtllm. Here is an example of how you can provide an image as input to the MTLLM Function or Method:

1linkimport:py from mtllm.llms, OpenAI;

2linkimport:py from mtllm, Image;

3link

4linkglob llm = OpenAI(model_name="gpt-4o");

5link

6linkenum Personality {

7link   INTROVERT: 'Person who is shy and reticent' = "Introvert",

8link   EXTROVERT: 'Person who is outgoing and socially confident' = "Extrovert"

9link}

10link

11linkobj 'Person'

12linkPerson {

13link    has full_name: str,

14link        yod: 'Year of Death': int,

15link        personality: 'Personality of the Person': Personality;

16link}

17link

18linkcan get_person_info(img: 'Image of Person': Image) -> Person

19linkby llm();

20link

21linkwith entry {

22link    person_obj = get_person_info(Image("person.png"));

23link    print(person_obj);

24link}

Input Image (person.png):

1link# Output

2linkPerson(full_name='Albert Einstein', yod=1955, personality=Personality.INTROVERT)

In the above example, we have provided an image of a person ("Albert Einstein") as input to the get_person_info method. The method returns the information of the person in the image. The output of the method is a Person object with the name, year of death, and personality of the person in the image.

linkVideo

Similarly, MTLLM can handle videos as inputs. You can provide a video as input to the MTLLM Function or Method using the Video format of mtllm. Here is an example of how you can provide a video as input to the MTLLM Function or Method:

1linkimport:py from mtllm.llms, OpenAI;

2linkimport:py from mtllm, Video;

3link

4linkglob llm = OpenAI(model_name="gpt-4o");

5link

6linkcan is_aligned(video: Video, text: str) -> bool

7linkby llm(method="Chain-of-Thoughts", context="Mugen is the moving character");

8link

9linkwith entry {

10link    video = Video("mugen.mp4", 1);

11link    text = "Mugen jumps off and collects few coins.";

12link    print(is_aligned(video, text));

13link}

Input Video (mugen.mp4): mugen.mp4

1link# Output

2linkTrue

In the above example, we have provided a video of a character ("Mugen") as input to the is_aligned method. The method checks if the text is aligned with the video. The output of the method is a boolean value indicating whether the text is aligned with the video.

linkAudio

We are working on adding support for audio inputs to MTLLM. Stay tuned for updates on this feature.

Multimodality Image Video Audio