Democratizing AI: Collaborative AI Development with InstructLab

While at All Things Open conference last fall, Open at Intel host Katherine Druckman spoke with Carol Chen from Red Hat's Open Source program office about her work with InstructLab and applying open source methodology to training large language models. 
 

“I think a lot of talk about democratizing AI, people are worried about AI taking away jobs. I don't think that's the main worry, at least not for the near future kind of thing. I think a lot of people are, now, I work for my parents for example, they don't know the difference between something they see that's AI-generated. They don't know the real.”

— Carol Chen, Community Architect, Red Hat 

 

Carol's Background and Role at Red Hat

Katherine Druckman: Hey Carol, thank you so much for joining me. I really appreciate it. We're here at All Things Open, and I appreciate you taking the time out of your schedule. You are giving a talk. I would love it if you could introduce yourself and tell us a little bit about who you are and what you do at Red Hat and then tell us about your talk. 

Carol Chen: Okay, thanks. First, thank you very much for having me. This is my first time at All Things Open. I've been working for Red Hat for more than eight years now, but because I'm based in Finland, I do attend more events in Europe rather than in America. All Things Open is a great event. I work for the open source program office, so I support a lot of the upstream open source projects that a lot of the Red Hat products are based out of. The current new project that I've been involved in since the middle of May is called InstructLab, which is the talk that I'm going to give today will be based upon. 

Katherine Druckman: Fabulous, so tell us a little bit about the talk. Just give us a preview. It'll be recorded, I'm guessing maybe probably recorded and then released later. 

Carol Chen: The title of the talk is Applying Open Source Methods to Building and Training Large Language Models. Nowadays, AI is all the rage and training of LLMS. 

Katherine Druckman: Yes, it is. Yep. 

Carol Chen: It is. Yes, and the training of LLMs tends to be quite inaccessible for normal small companies or organizations or people like me who for the longest time I was afraid of using LLMs such as ChatGPT and the similar tools like that. We want to have a kind of an open, trusted approach, and also applying open source development methods, kind of the methodology as well as the concepts of what we are familiar with in the open source communities and apply that to training large language models. 

It's a collaboration with IBM. IBM Research did a paper on this methodology and Red Hat kind of took that and implemented it into this project. I think the talk will go into more about the details. It text only based synthetic generation and so on, so forth, but the highlight of it is somebody like me who didn't know anything about LLMs, didn't know what hugging face is, was able to download this command line app, go through the workflow, try things out for myself, what it means to be generating synthetic data, to be fine tuning model, seeing the results for myself all within a few hours of starting on the project. Again, I'm not a developer currently. I used to be a long time ago. 

Katherine Druckman: Right. Same. In a previous life. 

Carol Chen: Exactly. 

Katherine Druckman: Recovering engineers. 

Carol Chen: I know, oh, that's a good one, recovering engineers, but not worrying about, "Oh, would I be able to do this?" I was able to do that and be able to learn a lot of things through the whole process and be able to show something through it. I think a lot of talk about democratizing AI, people are worried about AI taking away jobs. I don't think that's the main worry, at least not for the near future kind of thing. I think a lot of people are, now, I work for my parents for example, they don't know the difference between something they see that's AI-generated. They don't know the real. 

Katherine Druckman: Yes, content authenticity. 

Carol Chen: Exactly. 

Katherine Druckman: It's a big one. 

AI and Open Source

Carol Chen: To be able to kind of educate people that AI is a tool just like any other tool. It can be misused, but it can also be used for very good things. How do you differentiate between those and how do you not be afraid of new technology? I think it's the same thing that when the cloud thing, the whole cloud revolution, people was like, "What's this thing in the sky?" It's not in the sky, but- 

Katherine Druckman: What does the cloud even mean? 

Carol Chen: Right, exactly, and then they came blockchain and things like that. AI is just one of those new stuff that we're dealing with that we have to not just help ourselves, but also the society to break it down, to make it accessible. 

Katherine Druckman: AI is not a new concept. We were using AI and we've been applying AI to a lot of things for a long time, but I think what it is that, I don't know about you, but every time I look at some kind of social media, I see all these creepy, the Uncanny Valley videos of not even Uncanny Valley things that are just creepy from the start. I think that, so the average person who is less plugged into technology, that's what they're seeing. They're seeing these really creepy looking videos of some sort of food turning into puppies, and they see it all the time. If that is your familiarity or your point of entry for thinking about AI, that's not a good place to start. 

Carol Chen: Right, yeah. 

Challenges and Opportunities in AI

Katherine Druckman: I think that's why people are getting kind of nervous. They don't think about, well, you can use computer imaging to diagnose diseases or something. They don't think about it in those terms. They think about this is weird and unsettling and I don't like it. 

Carol Chen: Yeah, and the funny thing is, you're right, AI has been around for a long time. In my graduate studies, I actually was looking into machine learning and expert systems. Of course, during that period of climate, it's more like predictive AI, so now the new wave is all about generative AI. 

In that sense, sometimes technology also comes in waves, and this wave is just so strong and so many things are happening at the same time. I find it hard, myself, to kind of wrap my head around all these things, much less we're not even talking about my parents' generation, my own friends and people I associate with. They'll be like, "What's this AI thing? Why should I care about it?" I think we need to be able to be part of the conversation about where all this is going, because it's one thing to have this great technology, but if we are not part of the narrative, people are going to abuse it. There'll be people who use it for all kinds of weird stuff. I think there's also a social responsibility of companies like what we have and also, doing things the Open Source way to build trust around what we're doing with AI. 

InstructLab: Making AI Accessible

Katherine Druckman: Yeah, I like that. That absolutely needs to come into the conversation as early as possible. I wanted to get back a little bit to InstructLab, and I think it's interesting because there are, ethics aside, and being responsible with our AI, there are a lot of developers, engineers out there who need to, for whatever reason they've been told or that want to add AI capability to something or build a new AI application and giving people a starting point that is a little bit more approachable, like you mentioned. Somebody like you or I, who we used to be engineers in our daily lives, but maybe we're not anymore, or maybe you are, and this is just not your expertise, but helping people get started with that kind of stuff is interesting. I wondered if you could just tell us a little bit more about that and what kind of person you see really getting the most benefit out of it? 

Carol Chen: On the community level, what we're trying to do is make the process of fine-tuning AI a collaborative effort. We have what we call this community build. We take, the base model we work with is from IBM, it's a branded model, which is actually open source licensed because we want to trust the sources that this model is trained with ... Because a lot of the large language models out there, we don't know what goes into it.   

Katherine Druckman: Yeah. 

Carol Chen: Right, so with this trusted base model, we will then accept contributions from the community to add skills and knowledge to the model. This is a finely curated process because we want that when the community uses this improved model, they can trust what goes in is also good stuff. We want to make sure that there's no harmful information, no personal information, that kind of thing. This whole process, anybody can definitely contribute by helping to review these pull requests for contributions. The contribution itself is, like I mentioned a little bit just now, it's a taxonomy-based representation of the data, which is knowledge and skills that you want to train the model with. This is, we use YAML files to represent that. Very easy, text-based, very kind of accessible. 

Katherine Druckman: Readable. 

Carol Chen: Readable, accessible, exactly. You don't have to worry because when we talk about RAG. 

Katherine Druckman: Retrieval augmented generation. 

Carol Chen: Yeah, you have to have knowledge about how to represent that in vectors, that kind of thing. I don't even know how to do that. With InstructLab, we are using these YAML files, so to represent certain contexts of knowledge that you want to contribute, you give it some seed examples, which are in the form of question-and-answer format, prompts and responses. Then there's a process called synthetic data generation, which takes this 10 to 15 seed examples and then generates hundreds of thousands of data points because you need a certain amount of data to be able to move a model. This whole process is part of this InstructLab workflow, and with this large amount of synthetically generated data, you can then fine-tune the model with it. 

All this is very accessible even on the laptop. We have a laptop workflow that uses quantized models. Quantized models tend to be kind of, I guess compressed and- 

Katherine Druckman: Smaller? 

Carol Chen: Yeah, smaller, exactly. You can run them on your laptops. While that may not give you an accurate result necessarily, so kind of low fidelity, but still it gives you a good idea of what kind of changes you're making to the model. You can test things out before you submit it to the full model and do an end-to-end run of the training. 

Like how we do unit tests, we want to run something locally and make sure that at least I'm on the right track before you- 

Katherine Druckman: Shorten the feedback. 

Carol Chen: Exactly, exactly. Yeah, so it's the same similar concept. Again, all this makes it easy for people to start contributing and making a difference in language models which are not easily accessible in the usual. If you say you want to change effect, what goes into Llama? It's not possible, right? Yeah. Whereas with this, you can then make incremental changes with your own little contributions through the community, having a collaborative way of improving a model with the community, but that's one side of things. Another side of things is if you have something like more sensitive data that you don't necessarily want to be a collaborative part of some model, you want to use it for your own organization or your own company. For example, you have sensitive data about your clients, or a good example we've heard is healthcare, health data. 

Katherine Druckman: Oh, absolutely. Yeah. 

Carol Chen: Patient data, you definitely don't want that to be fed up, sucked up to some cloud model somewhere. You can then use InstructLab and then have the models locally and then fine tune them with your own data for your own use case, whether it's tracking the patient's history, medical history or getting some kind of diagnosis with the information that you have, but that is contained and you have control over it basically. 

Personal Journey into AI

Katherine Druckman: How very interesting. How did you get personally involved in the conversation about AI coming from an OSPO perspective? That's interesting to me. 

Carol Chen: I know, so even though I was a software engineer in two lifetimes ago, I've been mainly doing this kind of community management community herding for more than 10 years. I've been involved in many kinds of open source projects, but honestly in the past two years, even with the whole AI wave and everything, I myself have not been too eager to try a lot of this new technology. The funny thing is I was thinking one day is if I myself have this kind of hesitation or fear, what about people who are not like me who are in the tech industry? For me to almost in a way to overcome my own fears, I need to kind of jump into it and to find out what it's about. I was also very lucky to be partially involved in the release of this project. Like I said, I downloaded it, I tried it out for myself. I'm like, "Okay, this is very interesting actually," and I don't need that deep developer skills and knowledge to get started, which makes it really just so approachable, so easy. 

 I can introduce this to my friends who are not necessarily techies and be able to show these concepts to them so that they can easily understand what it means to train large language models, what does fine tuning even mean, or RAG, or whatever.   

Katherine Druckman: Yeah. If you kind of unpack it for you, you can help unpack it for other people. 

Carol Chen: Exactly. 

Katherine Druckman: Yeah. I understand what you mean. That's interesting, so who would you recommend actually check out InstructLab? 

Carol Chen: I would say it is for everyone because you don't necessarily have to want to use AI or even to train a model, but I used it myself to understand the concepts of AI, and we've actually been talking to some universities and educational institutions to use InstructLab as a tool to promote the understanding of AI. Because yeah, it's not just ... Honestly, in-depth AI, there's a lot of math. There's a lot of really pretty hardcore, there's a lot of really deep technical stuff. For most people, just having that general understanding is very helpful to approach, whether it is apps they're using or recommending AI agents for making optimizing workflows or whatever. If they have that better basic understanding, then they're not going to fear that AI is taking over jobs or whatever, and then actually know how to utilize AI in a proper way for what is work optimization or even personal life enhancements and things like that. 

AI Ethics and Open Source

Katherine Druckman: I wonder, so at this event, the OSI is releasing the first release version of the OSI-approved open source definition for AI. I wonder how closely you've been following the conversation around that. 

Carol Chen: Honestly, I haven't been following as closely as I should have because I kind of joined this project quite recently and I was kind of mostly focused on learning about this project itself and making sure I understand what I'm doing. I do know that it's important to have a definition that we can all agree on because just like open source, it has been established. These are the kind of practices and licenses, which means what open source is and how we work the open source model. I think definitely the definition is important, but I also wonder if it's almost like ... I was quite surprised, honestly, that it's already happening so soon. 

Katherine Druckman: You see it as an ongoing conversation. Yes, I understand what you mean. We're very early to be having it. 

Carol Chen: Exactly. Of course, it is an iterative process, but I don't know if there's enough input and feedback loop even in the initial trying to reach that definition to say, "Hey, let's have already the first round of definition." 

Katherine Druckman: Yeah, I think that's an interesting observation. Consensus is absolutely important, especially in community-driven activities of any kind. Right? 

Carol Chen: Right. 

Katherine Druckman: Yeah, I am particularly curious about your perspective, obviously coming from an OSPO because when you're talking about things like compliance and you're very familiar with licensing and why one might pick one over the other, and then when you talk about defining something like this, it's a bit newer. 

Carol Chen: So new, right? We're very familiar with code licenses. 

Katherine Druckman: But we know what that means. If you say, "I'm releasing this project under Apache 2.0," you know what that means. You know exactly, as an OSPO person, what that means. 

Carol Chen: What does it mean for an AI model? 

Katherine Druckman: What do I say if this model is compliant with this definition, do you fully understand it and how much more research do you do? 

Carol Chen: I would barely understand it at this point. 

Katherine Druckman: I think a lot of us are in the same boat. 

Carol Chen: It's definitely something I want to follow closer and hopefully be able to be involved in. Definitely not personally in terms of influence, but with Red Hat, with Intel, with these different organizations to make sure that we understand each other we have with proper input and feedback capabilities to make sure that we will agree on this new definition together. 

Katherine Druckman: Yeah. Fantastic. Well, thank you so much. Is there anything that you wanted to talk about that I didn't yet ask you? 

Carol Chen: I think there's a lot of things we can talk about. I'm just excited to see so many things at this event and the progress, even though AI is exciting, but there's just so much in the Open Source space. 

Katherine Druckman: There always is, right? 

Carol Chen: Yes. 

Katherine Druckman: I know. That's the best thing about coming to events like this. 

Carol Chen: Right, yeah. 

Katherine Druckman: Getting everybody together in the same spaces to kind of really hash out the difficult questions. 

Carol Chen: Yes, exactly. 

Katherine Druckman: Good stuff. 

Carol Chen: Following projects that follow for years, what they're doing and what's the development there? It's so cool. I really like this open source community. I've been using Linux since for actually more than 20 years now. 

Katherine Druckman: I have also been around a while. 

Carol Chen: There's always something new and I just really enjoy and appreciate the collaboration with community members and the open conversations that we can have with each other. That just motivates me to continue what I'm doing. 

Katherine Druckman: Just like the one we're having right now. 

Carol Chen: Exactly. Yes. Yes. Thank you for giving me this opportunity. 

Katherine Druckman: Yeah, thank you for joining me. Well, thank you so much and have a great rest of the conference and hopefully talk to you again soon. 

Carol Chen: Yeah, thank you. 

Katherine Druckman: You've been listening to Open at Intel. Be sure to check out more about Intel’s work in the open source community at Open.Intel, on X, or on LinkedIn. We hope you join us again next time to geek out about open source.  

About the Guest

Carol Chen, Community Architect, Red Hat 

Carol Chen is a community architect at Red Hat, supporting and promoting various upstream communities such as InstructLab, Ansible, and ManageIQ. She has been actively involved in open source communities while working for Jolla and Nokia previously. In addition, she also has experiences in software development/integration in her 12 years in the mobile industry. Carol has spoken at events around the world, including DevConf.CZ in Czech Republic and OpenInfra Summit in China. On a personal note, Carol plays the Timpani in an orchestra in Tampere, Finland, where she now calls home. 

About the Host

Katherine Druckman, Open Source Security Evangelist, Intel  

Katherine Druckman, an Intel open source security evangelist, hosts the podcasts Open at Intel, Reality 2.0, and FLOSS Weekly. A security and privacy advocate, software engineer, and former digital director of Linux Journal, she's a long-time champion of open source and open standards. She is a software engineer and content creator with over a decade of experience in engineering, content strategy, product management, user experience, and technology evangelism. Find her on LinkedIn