I started again 2 days ago since in all this time there was no time to spent on these formats. there are 3 of them, one for each direct x version but the structure and all is the same.
important ->> file endianness is unknown for me yet. some stuff is seen in
little endian and some in big endian, I hope im wrong but there has been no way to prove otherwise
Here is a little summary of what I know for now, need help specially on a specific part.
file structure at simple view
1 - texture names and/or shader properties
2 - compiled shaders
3 - data table 1 (48 byte)
4 - data table 2 (88 byte)
5 - referenced geometry, shaders, emitters in plain text + 16 byte string: can be unique or generic
6 - data table 3 (length of each line not defined)
now lets talk about part 1.
file has some fields at the start that dont provide anything related to size, offsets or anything (at least for now it seems)
then there is an array of fields that do have some relation to the textures, but since they're like 01020300 and so on it doesn't make any sense, this section starts with relative offsets for the text strings, one byte gives out the number of texture text strings +1, another byte the number of shaders properties (if any) and another byte gives some other kind of properties (if any) and thats pretty much the first part here. lot of this is blank data though
part 2 - compiled shaders. this part makes most of the file, but the thing is, there is not much more than an id string for each shader, for example the string RD11 is always present in compiled shaders. A section of each compiled shader has relative offsets that point to, for example sampler0, outdoorLightHemisphereDir, and so on but it is presented in the sequence of text strings. after these strings appears the mircosoft hlsl shader compiler there is some text like TEXCOORD0, and then straight unknown data, someone with knowledge of shaders might know what this is.
I have found no actual reference as to how many compiled shaders exist in a file, to which texture they belong to or anything.
part 3. the next stuff is the most interesting. before this section there are 4 fields. little endian. the first one has unknown info, all i know is that it grows in size the more the strings that are stored here. the next field shows differences but I don't know if it is also because of the number of strings. next field seems to be the same in dx9 files, the same one for dx10, same for dx11 but different one version to another, ex: dx09 is 0x00000001, dx11 0x00000002, etc. The last field before the actual data is the number of strings that are stored here. After these fields, you have a lot of strings that god knows what they mean. the only thing I noticed and could prove is that they're stored in crescent order, but to notice this you have to switch from little endian to big endian. the crescent order is present up to the 8th byte. each string here is of 48 bytes.
I dont want to post all my theories about what this part is about cause right now I have many. This is the part im having trouble with, you will see now, how it does not relates to anything. The number of compiled shaders is not the same as the number of strings here, there are way much more strings here.
part 4 - data table 2. this one is pretty easy at first glance. A field in little endian gives out the number of lines here too. mind that the number of lines here is the same as in part 3. so if in part 3 are 9000, here we will have 9000. each line is of 88 bytes.
each line begins with a 16 byte sequence, lets call it string 1. this string 1 is unique, not in the sense that it appears only once across the file. it is unique in the sense that it might work as an ID, or as a parent node if you like. the next 16 bytes can contain another string (call it string 2) and the next 16 always contain a string. call it string 3. The next 40 bytes are unknown, mostly blank data after each line.
For you to have an idea:
string 1/string ID appears in 3 parts. part 4, 5 and 6. In part 4 it appears as many times as needed. in part 5 and 6 only once.
string 2/generic type 1 appears in 2 parts. part 4 as many times as needed and part 5 only once.
string 3/generic type 2 appears in 2 parts, 4 and 5, in part 4 and 5 they appear as many times as needed.
(replace part 1 with string 2/generic type 1 and part 2 with string 3/generic type 2)
I will explain the relationship of part 4, 5 and some of part 6 at the end.
part 5 - 2 fields before data. one field gives the string length (16 bytes) and the next is something interesting. It is the amount of strings, of the first type (string ID), that exist in this section. These appear only once in this part.
For the most part here, you get repeated strings of the third type, generic. example:
that string, which is in text form from the file, appears a lot of times. another example
and that works the same for string IDs and the other string generic type.
part 6. actually, this part is inside part 5 but I have separated it for understanding, so no fields determining anything, just data. this one was pretty fun. Structure of the data section is like this:
pair of bytes (number of pairs can vary) - string ID - byte 1 - 3 blank bytes.
the string ID is the one that appears in part 4 and 5, respectively. the byte after it is the number that the string gets repeated in part 4.
Unknown bytes: Pairs of 2 bytes. They're in little endian. This is more or less how I found it out:
To actually understand what I mean, we said in this example that if part 3 had 9000 strings, then part 4 will have 9000. We also said that in part 5 we get to know how many unique string ID exist, lets put in this case, 100.
So we have this data:
9000 strings on part 3 and 4. 100 unique strings. Each string is 16 bytes, that there is always one byte after it and then 3 blank bytes, in total. 20 bytes that are surely that exist here. (I found all this after looking one by one, so I know that they're always "there")
Now we do some math. Lets say that this data part is 20000 bytes. 20 bytes x 100 unique strings = 2000. The size of this part is 20000. 20000 - 2000 = 18000 which divided by 2 gives us 9000. This is the same about of strings. After testing it with many shader databases, it works. One little discrepancy, is that if there are 9000 pairs like in this case, probably you won't find the pair with value 9000. You'll find it by searching for one unit less, like 8999 instead of 9000.
So, after this, I know that parts 3 to part 6 are all connected. The strings IDs, generic strings 1 and 2 exist across all shader databases.
The lines of part 3 have discrepancies though. If the line is 48 bytes long, the first 28 bytes are the same across all the shader databases, the next 20 bytes will not. The next 20 bytes have some interesting fields, offset like but for now im stuck. Because I thought about these fields being pointers, from 0 to the number of pairs lets say from 0 to 9000, but I have not found something at all, some numbers appear, some not, some appear in different places, ie, not on the columns you would expect them too, etc.
any help appreciated guys, that is, for the part 3. compiled shaders are out of my league