Conversation

Now the question is: how do I compare the logic in a few hundred similar but far from identical applications? So far the plan is:

  1. Use pyelftools to parse the binaries.
  2. Find a particular string in the data segment and calculate its address.
  3. Use Capstone to disassemble the entire code segment.
  4. Find the instruction accessing this string, indicating the function I’m interested in.
  5. Identify the boundaries of the function containing this instruction.
  6. Analyze other instructions of the function, finding out which global variables it accesses.
  7. Check whether it passes any of these global variables to another function which can be identified by the parameters it receives.

Don’t get me wrong, Capstone is a great tool. But here the steps starting with 4 are unnecessarily complicated. First of all, loading an address into a register takes two instructions on both ARM and MIPS, and Capstone won’t help me figure this out. And Capstone isn’t much of a help for finding function boundaries either.

I tried spimdisasm, and it solves both issues. Unfortunately, it only does MIPS and I don’t see anything comparable for the ARM platform.

As to proper decompilers, the scenario “decompile an entire file automatically and reasonably quickly, doesn’t have to be good” seems to be an uncommon one. RetDec for example works for ten minutes before simply giving up.

1
1
0
@WPalant Why not Ghidriff/BinDiff/Diaphora?
1
0
1

@buherator Nah, diffing tools aren’t going to be of much use. These binaries are way too different. They aren’t being built from a shared code base, despite sharing much code.

But I didn’t know that Ghidra exposes an API which can be used by command line tools. That may be good enough.

2
1
1
@WPalant Not really grasping the situation but Pigaios by @joxeankoret may also be interesting?

https://github.com/joxeankoret/pigaios
1
0
0

@buherator Yeah, I doubt that this will work. From what I have seen, matching any kind of AST onto these binaries will produce lots of false negatives. They all have these one or two functions I want to look into (I think), but they are always somewhat different. The strings are the only factor that I’m reasonably confident to be stable, and even here I want to account for some variation.

2
1
0

@WPalant
I don't compare ASTs, in general. Actually, one of the most used heuristic in Pigaios is... strings matching.
@buherator

0
0
0
@WPalant Right, then Ghidra API sounds like a good choice indeed. Also note the Ghidra VersionTracker has several "atomic" matchers (like string matcher) which may also be useful, and _maybe_ they can be configured by creating a Ghidriff plugin?
0
0
1

@WPalant
I believe you are confusing how binary diffing tools work: they aren't designed for patch diffing, they are designed for finding the similarities. At least Diaphora and Bindiff, if using binaries, will only match what is shared or similar between the binaries.
@buherator

1
0
0

@joxean @buherator The functions that I’m interested in are going to be different. I’ve already analyzed some variants, and I need to know what else is there – or where the differences are not essential. Identifying these functions within each binary despite the differences is actually the first challenge here…

1
1
0

@WPalant
Identifying the functions you mention, in my opinion, is actually the work of these tools. Feel free to reinvent the wheel if you feel like doing so, but keep in mind that you're doing that.
@buherator

1
0
0

@joxean Thank you, I am aware. 😅

Reading the descriptions of these tools, these are absolutely not meant for my use case. The goal as stated is identifying changes as a piece of software evolves, not comparing semi-related binaries built for wildly different architectures. I believe you of course that these tools can be made to work for me, and I will take a closer look. But the question is always: what’s the cost? Me being unfamiliar with any of these tools, I expect to sink a significant amount of time into this. Already bindiff’s statement about Ghidra not being really supported (no, I didn’t pay for an IDA license) doesn’t sound good.

On the other hand, I am already quite familiar with Ghidra and its output. So I expect to figure things out quickly here. I generally happen to be better at coding up a quick solution than at figuring out someone else’s massive codebase (which it is ultimately going to be because anything other than the primary purpose of the tool isn’t going to be well-documented). This is quite likely reinventing the wheel, but it’s going to be fairly quick reinventing, which for this one-time occasion might be a better choice.

I hope you see that we have very different starting points here.

@buherator

0
1
0