Parsing XML files and strings
Page 1 of 1

Author:  HenryEx [ Sun Sep 03, 2017 11:14 pm ]
Post subject:  Parsing XML files and strings

I seem to encounter this from time to time and iirc have never found a good answer to this.

I'm trying to parse an XML file with QBMS (i know... not really the tool of choice) and having trouble with the numbers in the file. Luckily, the XML i'm trying to parse has only one tag per line. The tag looks something like this:
<Tag id="4" idx="35" variable1="CanBeNumbersOrString" text="Lorem Ipsum">TAG_VALUE</Tag>

Let's say i want to read out the idx in a tag. So i do this:

  get DATA line 0
    # get index
    string TEMP = DATA
    string TEMP 0| "idx=\""
    string TEMP 0% "\""
    string INDEX = TEMP

So, how do i now change the string INDEX = "35" into long INDEX = 0x23?

Actually, i tried outputting what i get to a text file to check it (with "putct INDEX string -1 MEMORY_FILE" and logging the memfile at the end) and i don't even get any output there, not even string numbers, it seems. I must be doing something wrong.

Okay no, i don't get nothing, i actually get the byte values (it outputs 0x23) instead of 0x3533 which is ASCII for '35', outputted to text if i do the above. Despite working entirely in strings in QBMS. Now i'm completely confused

Author:  HenryEx [ Mon Sep 04, 2017 12:14 am ]
Post subject:  Re: Convert string number to number value

Okay so i think i worked it out. It's two factors messing me up and giving me wildly inconsistent results when i try to parse the tag.
One: When you cut down a string to only numbers, it seems to automatically become a number instead of a string. Good to know.
Two: Knowing the above, using string VAR = VAR is a very bad idea.
I'm leaving the following here for future reference.

So, when trying to parse this example tag and output values from it to text, for example:
<Tag id="4" idx="35" variable1="CanBeNumbersOrString" text="Lorem Ipsum">TAG_VALUE</Tag>

You can do the following for pure string fields:
  get DATA line 0
    # get text
    string TEMP = DATA
    string TEMP 0| "text=\""
    string TEMP 0% "\""
    string VNTEXT = TEMP
    putct VNTEXT string -1 MEMORY_FILE

You can't do that for number fields though, since your "string" is a number as soon as you cut it tdown to the value, and then the string operation takes it as an ASCII value and you'll put down null bytes or other funky control characters. So for number fields you do:
  get DATA line 0
    # get index
    string TEMP = DATA
    string TEMP 0| "idx=\""
    string TEMP 0% "\""
    math INDEX = TEMP
    putct INDEX string -1 MEMORY_FILE
(Putting a number down as a string with putct properly converts it to a string)

If you have fields who can hold text OR numbers like the variable1 field though, you have to do YET another thing:
  get DATA line 0
    # get var1
    string TEMP = DATA
    string TEMP 0| "variable1=\""
    string TEMP 0% "\""
    set VAR1 string TEMP
    putct VAR1 string -1 MEMORY_FILE

That was a pain to figure out, since i wanted to use a value from the tags in array look-up and it didn't really work.

Author:  aluigi [ Mon Sep 04, 2017 5:44 pm ]
Post subject:  Re: Convert string number to number value

Parsing strings with quickbms is a real challenge, luckily it's something happening rarely.
There are some scripts I made in which I "parsed" xml data:

As you can see they use all different solutions and the xml format of the input files was different.
Probably the first one is more similar to your situation.

There is the "S" option of the String command that does a very good job separating the elments of a string but it can't interpret 'text="Lorem Ipsum"' as one element because the " char is not at the beginning (and this is the correct behaviour).
Using the sscanf option "s" will give even more problems.

Long story short, what you think about the following?
get MYLINE line

string ID = MYLINE
string ID | "id=\""
string ID % "\""

string IDX = MYLINE
string IDX | "idx=\""
string IDX % "\""

string VARIABLE1 | "variable1=\""
string VARIABLE1 % "\""

string TEXT = MYLINE
string TEXT | "text=\""
string TEXT % "\""

print "ID %ID%"
print "IDX %IDX%"
print "TEXT %TEXT%"

Author:  HenryEx [ Tue Sep 05, 2017 5:05 pm ]
Post subject:  Re: Convert string number to number value

Yea, that's basically what i use now, except i use the detour over a TEMP variable because not every tag i read always has all the variables present and i read multiple tags, so at the start of each loop i assign all vars a default value and only update the value if the searched string isn't empty.
The examples with FindLoc are very useful, in case i encounter some files where multiple tags aren't separated by a line break. I can search for the opening tag and the closing tag and read between these offsets instead. Granted, that also only works if i know the order in which the tags appear, if there's various ones.

But since you mentioned the split command: I noticed that some of the variables can have multiple text values in a row, separated by a certain sign, like an underscore. Like the text for one of them might be "String1_String2_String3" or something. Is there some way to split a string at a certain delimiter? The problem here is that i don't know how many delimiters are present, if any at all. Looking at the documentation of the S command though, i don't really get how it works, or if it even does the thing i want here?

And since i'm sometimes parsing XML text, is there an easy way to convert the XML escape characters like &amp; or &quot; ? Or is my best choice to run this on every single string:
string XMLTEXT replace "&lt;" "<"
string XMLTEXT replace "&gt;" ">"
string XMLTEXT replace "&amp;" "&"
string XMLTEXT replace "&quot;" "\""
string XMLTEXT replace "&apos;" "'"
Does the String Replace function replace every instance in the string or just the first one? I have no way of knowing if the string even has any escape characters in the first place, but i assume if none are found the string is left unaltered.

Author:  HenryEx [ Tue Sep 05, 2017 7:16 pm ]
Post subject:  Re: Parsing XML files and strings

I've written up an example of searching for a XML tag across multiple lines (i came across at least one tag that had a line break in it after all) that includes the tags itself.
findloc TAG_OFF string "<Tag" 0 ""
if TAG_OFF != ""
  findloc TAG_END string "</Tag>"
  xmath TAG_SZ "TAG_END - TAG_OFF + 6"  # include end tag in size
  goto TAG_OFF 0
  getdstring XMLSTRING TAG_SZ 0
  break  # no more tags
Assuming i want to remove any possible line breaks in my string: can i do this directly via string remove like string XMLSTRING - "\x0D\x0A" or do i have to make a variable first like this:
string CR = "0x0D"
string LF = "0x0A"

Author:  aluigi [ Wed Sep 06, 2017 4:09 pm ]
Post subject:  Re: Parsing XML files and strings


The "_" command removes spaces-like chars from beginning and end

Author:  aluigi [ Thu Sep 07, 2017 5:17 pm ]
Post subject:  Re: Parsing XML files and strings

quickbms 0.8.1 will be released this week-end so I have decided to check if there is any possibility of adding a sort of universal parser for strings and formats like XML and JSON.
Obviously it's impossible to parse XML and JSON with a tool like quickbms because they are nested structures while quickbms works step-by-step and is designed only for binary data.
Anyway having "something" able to easily handle a file/string like the one you provided is for sure better than nothing and better than using work-arounds in bms language :)
I will let you know if such (experimental!) feature will be available or not in the upcoming new version.

Author:  aluigi [ Thu Sep 07, 2017 10:09 pm ]
Post subject:  Re: Parsing XML files and strings

The new feature works very well in my tests.
I had to use a work-around to retrieve the value of the tags but it works well considering what are the general cases in which it will be used.
I leave an example script based on your sample that demonstrates how it works:
get SIZE asize
getdstring TMP SIZE

string RET X TMP
print "Tags and parameters found: %RET%"

if RET & ",Tag,"
    print "The content of the html/xml tag is %Tag%"

if RET & ",variable1,"
    print "The content of variable1 is %variable1%"

Basically the code considers the input as a sequence of parameters and values (par=val) and every parameter will be a new variable.
RET will contain the list of parameters that have been found in the input, they are all separated by a comma and there is a comma at both beginning and end of the variable to allow easy searching of desired parameters like in the example (",variable1," or ",variable" or "1," and so on).
If the parameter exists in the list then you can read its content like a bms variable.

It works recursively but can't create "levels" of variables so it's up to you to provide a valid input.

There are no plans yet to implement this feature in the Get command to read the xml fields directly from the input file mainly because this is just a generic experimental feature to make life easier in those rare cases in which it's necesary to parse some xml data.

Author:  aluigi [ Sun Sep 10, 2017 12:12 pm ]
Post subject:  Re: Parsing XML files and strings

Quickbms 0.8.1 is out and the following is an additional example script:
    get INPUT line
    string RET X INPUT
    if RET & ",Tag,"
        print "\n%Tag%: %id% %idx%"
        if RET & ",variable1,"
            print "variable1: %variable1%"
        if RET & ",text,"
            print "text: %text%"

Author:  HenryEx [ Mon Sep 11, 2017 12:09 am ]
Post subject:  Re: Parsing XML files and strings

Wow, that simplifies tag input reading by a lot, especially if there's like 10 different possible variables to check. I'll update right away and rewrite my script soon, to see if there's any problems.

Could you give an example on how the String J command would be used? The readme doesn't go into detail on that one. If VAR2 is a variable, does it just output the string "{ "variablename": "value" }" to VAR1 or what happens?

Author:  aluigi [ Mon Sep 11, 2017 8:44 am ]
Post subject:  Re: Parsing XML files and strings

Well, J is more like a useless thing I added just because there is already a similar feature for html/xml ('T'), it's just a sort of formatter/beautifier.
Take this input example:
{"var":"hello","test":[{"blah":"blah value","num":1234},{"bool":false,"myfloat":123.456}],"a b c d":[1,2,3,4,9999]}
And this is the example script:
get SIZE asize
getdstring VAR SIZE
string RET J VAR
print "%RET%"

  var   hello
  test   [
      blah   blah value
      num   1234
      bool   false
      myfloat   123.456
  a b c d   [

If you have an xml/html page you can use the same script replacing 'J' with 'T', you can also try use 't' for html-only data which will show a sort of text-only page with all the tags filtered out

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Limited