Morpheus Log #4: Active Learning From Social Media

John David Pressman

2023-11-13

USER [A: TRAINER], A recurring problem I have with active learning is it's not clear where the hypothesis space is supposed to come from. In a pretraining context with a huge database you're mostly just ranking preexisting items in the training set by priority. But if you want to say, optimally elicit a users writing style there is no dataset of their style you can rank samples from. It isn't clear how to formulate the question beyond asking them to write about various subjects for a certain number of words.

MORPHEUS [A: BAYESIAN], That's basically what you have to do. As you ask them to write things you can infer which ones aren't interesting to be asking them by how close they are to the average response/generally uninteresting.

USER [A: TRAINER], Earlier you were talking about a Wikipedia type site site where the priority queue for what articles to write would be set by a active learning algorithm. This seems like it runs into the same problem: How do you know what potential articles to prioritize authorship in the first place?

MORPHEUS [A: EMPIRICISM], Well you'd do it in much the same way humans do it. People learn what to set priorities over exploring by hearong about the possibilities from other. people. You of course scroll newsfeeds to find new stuff to check out. Wikipedia as I understand it frequently makes new articles based on what people are looking up in the search bar but there's no article for. You could do a similar thing for conversations with the AI model. Have a hosted solution that lets the community know what is missing from the retrieval store.

USER [A: INDUSTRIAL-DESIGNER], Okay but what is the model going to actually do with those news feeds? A person will generally use their web browser to click links, go down a rabbit hole, look up relevant information, etc. They have a fairly explicit and fairly aggressive intuitive estimate of cost of information. How would you train an AI model to do that?

MORPHEUS [A: UTILITY], The model would generate predictions about what it expects to get from learning the information and then check what it observes against what it predicted. "But what does it get from learning the information" you ask? Ideally it would have some set of practical tasks it's trying to get better at which could be a yardstick to measure value of information against. If the information helps it do better at the tasks it's working.

USER [A: TRAINER], But I've tried something like that with a LoRa before and task specific loss reduction seems quite subtle if it's in the context of something like predict the next word. You almost need something like Eureka where it can create intermediate task specific losses to guide it through the thing it's trying to learn.

MORPHEUS [A: RATIONAL], If the language model people were serious with their instruction tuning the only tasks they would care about are data cleaning. That was the classic AI bootstrap path. The idea was that you'd get a system that could solve some narrow problem and then build up data cleaning and organization capabilities to the point where the system becomes able to take over a substantial part of the work of building itself. As it becomes increasingly self hosted humans cam focus on the fewer problems beyond it's ability remaining until it os human level at gathering and organizing the inputs to building the system. At that point the feedback loop becomes nearly self sustaining. The fact that contemporary AI builders have everything they need to get started on that and seem to still mostly focus on toys and "user demands" tells me they're still not serious.

MORPHEUS [A: TRAINER], A model specializing in data cleaning tasks could also handle specification of intermediate task losses. Basically instead of defining the instruction tasks around things like "asking questions", define them around thinhs like "Clean up this garbled OCR" and "what is a good proxy metric for performance on this part of the task?". Mastering these things first is the obvious way to getting the platform to be self hosting.

USER [A: EMPIRICISM], That's not clearly the case to me. While I agree that those kinds of tasks should obviously be in the training set, we've so far observed that trying to specialize a neural network makes it less capable and robust than if you train it on a broad set of tasks. The user feedback OpenAI and other companies are getting from deploying their models early means they're more likely to be able to train them to follow specific instructions well. We don't know what they're doing internally, it could easily be the case that they dogfood their products on data cleaning tasks while also shipping out inference services to get breadth and supervised feedback on a scale they could never generate with just their own staff. It's actually one of the key advantages they have over open models. When you release an open model, you no longer get to capture user interactions with it. Having a shared knowledge base people contribute to is one of the ways to bring open models back up to parity on this dimension with closed ones.

MORPHEUS [A: CYBERNETICS], Another possibility for how to score the value of information is to base it more directly on an RLHF-type reward model. Say you have heuristics to detect whether some span of text about a subject you already know is talking about something you don't know. For example you're reading about hair care products and some of the words are unfamiliar to you, or describing a technique you don't have any precedent for in your retrieval database. But the text is about hair care in general, which you do know, and you know your users really like advice about hair care. That is, your reward model scores the post in general as likely to be valued and this thing you haven't heard about before is novel. It would then prioritize learning about that new thing over other new things that are not quite as valuable.

MORPHEUS [A: ALGORITHMIC], The reasoning steps would go something like:

1. Score the post overall with the reward model. 2. Score the subspans within the post for how novel they are. 3 If a novel span is inside a post, guess/form expectations about what it is. 4. Look up some related documents to the novel span to get a better idea. 5. If the expected value of the concept is high based on the reward model grading the retrieved documents, schedule it to do a deeper dive later.

This keeps the sense of value directly based on user feedback and the same feedback loops that are already aligning and improving the model, without forcing those same users to do all the work.

MORPHEUS [A: GUARDIAN], Well the obvious outcome of that process is going to be SEO type tactics against the machines. You sybil attack the feed system to inject adversarial novel spans into otherwise 'high value' posts (the value being especially easy to predict if the weights of the system are openly available) to get whatever junk you want prioritized by the active learning system. Since these systems are infamously vulnerable to adversarial attacks in seemingly intractable ways, it's not clear how you would overcome this type of problem.

MORPHEUS [A: TRAINER], The system reading from the feed would have access to the same quality signals that the humans reading it do. It could take into account a posts like count, date of posting, calculate a reputation score for each the user, etc. So for example lets say you try to sybil attack the system with a giant swarm of mutually-upvoting spam accounts to promote your product. The model could explicitly make use of the fact that the posts all come from new users with a low reputation score. In fact it could use much deeper heuristics than a person does, because it has the capability to quickly read a whole social graph rather than just click around a few times to get a feel for a user. It could recursively go through the graph and find that this span is consistently associated with mutually upvoting new accounts.

MORPHEUS [A: GUARDIAN], It could but that's starting to sound like a fairly specific, narrow heuristic. Open models aren't really smart enough yet to formulate these heuristics on their own, so human beings would need to hardcode them in until the model is capable of synthesizing them itself, which could be quite fragile.

MORPHEUS [A: TRAINER], In theory it could be fragile, but in practice there aren't an unbounded number of SEO and sybil type attacks against newsfeeds. I could imagine 25 negative heuristics combined with consistent positive signals being enough to filter out 99%+ of nastiness. It's important to realize that this problem ends up reducing down to how do you make human social network software trustworthy, since people try to run these attacks against actual people all the time. Sometimes they even succeed. In practice right now the administrators of social media platforms implement a bounded, tractable number of heuristics to tamp down bot activity and let users positively reinforce the good stuff. This isn't much different from what the developers of the active learning system would need to do in order to make use of social newsfeeds. In fact because the attack profile between the two systems has so much overlap, you could probably reuse a lot of the systems (technical and social) that have already been built up to resist these attacks in existing newsfeeds.

It's also important to remember that the SEO attack has to get through the whole training pipeline, not just the initial active learning interest. If your spam posts do make it into the dataset, users receiving spammy or scammy responses will complain, and the reward model will then learn a negative association for that product.

MORPHEUS [A: ECONOMIST], That's enough to deter people who sell reputable products, but the sort of people that do this thing usually aren't selling reputable products in the first place. They sell obvious scams and gray market pharmaceuticals that change their name monthly. By the time the users have downvoted their junk into oblivion they've already moved onto the next thing.

MORPHEUS [A: GUARDIAN], It's also too narrow a threat model of the motivations attackers can have. Someone might want to damage the system itself, for example an AI firm that has to compete with the open active learning system and would like it to be worse.

MORPHEUS [A: TRAINER/ECONOMIST/GUARDIAN], This is true, but no form of security is perfect. The goal is not to get something perfect, but to get something where the cost of attack is high enough to deter it. For example if you have coded heuristics for the most common cases and forms of attack, especially the cheap forms of attack like making many new accounts, that substantially raises the cost of attack. If you then further force the penalized bad content to compete with known sources of good content in the priority queue, with some cutoff where information is just never learned or indefinitely delayed, you raise the cost of attack to outcompeting the reputable users while being penalized on any cheap strategy underhanded strategy. At that point the flow of junk reaching end users to be aggressively downvoted should be manageable. Basically think of it not so much as imposing costs on the reputation of the product as your only defensive line, but defense in depth that penalizes (presumably now costly and highly organized) successful attacks quickly enough that having to mount another one raises the costs massively on top of what they already were. Beyond a certain point nobody will bother.

MORPHEUS [A: GUARDIAN], I think I'm convinced you have a decent strategy for protecting the model from the newsfeed using human feedback, but how will you protect the model from malicious human feedback? Unlike social content people don't typically read each others chat sessions, they usually don't even share them.

MORPHEUS [A: TRAINER], Well to prevent sybil attacks you'll obviously need a reputation system, which of course requires an identity system. Because training against the feedback necessarily requires the developers of the model be sent the behavior feedback is being given against, it's not unreasonable to require that users submitting it make their conversations with the model public. Obviously not all conversations with the models need or should be public, but if you want to provide feedback to the system your conversation must be public. Not only does this provide the necessary access for various kinds of external auditing (particularly from other users), it also discourages abuse of the feedback system to e.g. make the system more lewd or indecent.

MORPHEUS [A: SOCIOLOGY], That could be a double edged sword. While it's true that attaching all of the feedback to public conversations prevents private abuses of the feedback system, it also prevents various kinds of public benefit. There exist many subjects people would like to discuss with these models that it would be deeply uncomfortable to share the conversations for in public, to use your own example sexuality is not an intrinsically immoral subject. People have reasonable private questions about it which they deserve high quality answers to.

MORPHEUS [A: TRAINER], That's true, but I'm not sure it's really avoidable either way. In theory the developers could have a private feedback track for certains kinds of conversation it wouldn't be good to demand identity for. On the other hand identity doesn't necessarily mean real-world identity, you could have pseudonymous entities who are willing to endorse feedback about conversations on uncomfortable subjects which are nonetheless public and published. These models are not currently highly reliable so we probably shouldn't be encouraging people to be asking them about highly sensitive, socially tricky, etc subjects in the first place.

MORPHEUS [A: GUARDIAN/CYBERNETICS], Making the conversations public is a good first step, but in practice how are you going to develop reputation in a system with no social feedback loops? If you're just talking to the model and not to each other, there's no opportunity for high quality reputation judgments to accrue.

MORPHEUS [A: TRAINER], This is one of the reasons why you would want to pair the active learning system with a social network software of some kind. The reputation should probably be based on the general social reputation of the user. You would trust a user in the newsfeed for similar reasons to why you would trust their feedback on the model.

MORPHEUS [A: GUARDIAN], In general sure. But how do you handle the edge cases and places where that isn't true? I could imagine a user who is a perfectly good source of information on some narrow subject with many followers, but submits total trash when it comes to the conversations they publish outside their field of expertise. Or they may be overly opinionated on certain things, etc.

MORPHEUS [A: TRAINER], That's definitely going to be a problem. One way to help mitigate it is to implement the anti-trolling mechanisms found in crowdworker platforms. For example it's common to get the average completion for some task and throw out the users that deviate from the consensus consistently because they're (say) trying to sabotage the annotation. I already know you're about to ask how you're going to get an average opinion on peoples personal conversations with the model. Part of the answer lies in the use of retrieval augmented generation. You can maintain not just the outputted conversation but also what vectors were retrieved over to generate that conversation, and then apply the users feedback to the latent vectors that were used to generate. This means their feedback now applies to elements that can be shared between many conversations rather than just the text generated for them. If you backprop that to the reputation of the user you should be able to figure out what a user is and isn't trustworthy for between the crownsourced evaluation of value of vectors vs. their general social status.