Test ~100 AI models against YOUR specific prompts. Get deterministic scores, real API costs, and stability metrics. Built this after discovering the "best" model for my RAG pipeline was a model that performed better AND cost 10x less. No LLM-as-judge. No voting. Just reproducible results for your actual use case. • 18 scoring modes • Real cost/efficiency calculations from API pricing • Vision & document support • Beginner-friendly yet capable of deep, complex use. Free tier available
Screenshots
Product Updates (0)
No updates yet. Check back later for updates from the team.
This is super compelling, especially the focus on reproducible results and real cost efficiency. Testing models against your own prompts without LLM-as-judge feels like a much more honest way to choose the right model.
Built OpenMark AI after finding a cheaper model beat a 'flagship' one for my task. Stop trusting generic benchmarks, test models on YOUR prompts with deterministic scoring, real costs & 100+ models.
@kean this is very timely. I could have used this when I chose gpt-4o for a client's agentic flow several.months ago, but found out months later that 4.1-mini was performing better for his use case AND much cheaper....
@theaspirinv thank you ! This is verbatim what happened to me 8 months ago. Built a rag pipeline and found out using cheaper models would actually perform better! So i made this benchmarking tool. Now I regularly use it to check for drift.
Comments (2)
This is super compelling, especially the focus on reproducible results and real cost efficiency. Testing models against your own prompts without LLM-as-judge feels like a much more honest way to choose the right model.
Built OpenMark AI after finding a cheaper model beat a 'flagship' one for my task. Stop trusting generic benchmarks, test models on YOUR prompts with deterministic scoring, real costs & 100+ models.
@kean this is very timely. I could have used this when I chose gpt-4o for a client's agentic flow several.months ago, but found out months later that 4.1-mini was performing better for his use case AND much cheaper....
@theaspirinv thank you ! This is verbatim what happened to me 8 months ago. Built a rag pipeline and found out using cheaper models would actually perform better! So i made this benchmarking tool. Now I regularly use it to check for drift.