yeah i've been wondering this too. from what i've seen, a lot of them use another llm to evaluate, like one llm grades another. or they have a 'reference answer' it tries to match up with. also heard about some just checking for certain keywords or structure. dunno if it's super fair sometimes tho tbh.