question

AlanBarclay-1668 avatar image
0 Votes"
AlanBarclay-1668 asked LesHay-2099 answered

Help using VB.Net Regex to split file at specific header text

Hi Guys, My PC developed an issue, as a result a large collection of files merged into single files?
I have decided to create a VB.net application that will scan a set path and process each file it finds e.g.

 Sub Process_File(Byval sFileName as string)
   Using SR as New StreamReader(sFileName)
    Dim sData() as String = SR.ReadToEnd()
    SR.Close
    SR.Finalise()
    SR.Dispose()
    
    Dim sPattern as string = "wwww"+chr(x)+chr(y)+chr(z) '*** <- This is the specific search pattern : w = 4 char string, x, y & z = byte value
    Dim uCNT as uInt16 = 0
    Dim uLen as uint16 = 0
    
    Dim matches as MatchCollection = Regex.Matches(sData, sPattern)
    
      For each match as Match in matches
        TextOutput(String.Format("{0} : {1} : {2}", match.Index,  match.Length, match.Value) + vbcrlf)
    
        Dim sOutFile as string = String.Format("{0} Split {1}{2}",IO.Path.GetFileNameWithoutExtension(sFileName), uCNT, IO.Path.GetExtension(sFileName)
    
        If uCNT < Matches.count - 1
          uLen = Matches(uCNT+1).Index - Matches(uCNT).Index
        Else
          uLen = sDATA.Length - Matches(uCNT).Index
        EndIf
    
        using SW as new StreamWriter(sOutFile)
          SW.Write(strings.mid(sData, Matches(uCNT).Index, uLen))
          SW.Close
          SW.Finalise()
          SW.Dispose()
        end using
    
        uCNT += 1
      Next
   End Using
 End Sub

*note example may have typos? and not be complete? I'm using the sites html editor :-/

Everything appears to be working, the recursion supplies all the required files, the regex detects all the required headers, however the returned index doesn't appear to be returning the correct position (its about 3000 bytes short per item, i.e. item 2 would be 6000 bytes out?), the output files generated obviously don't work as their headers don't exist at the beginning of the files.

The help/documentation for regex isn't any use :-\ or no longer supported ???

Any Solution Ideas would be most appreciated and to all you guys wondering why would I want to do this I ask why don't you just supply a solution :-?

*Note the merged files are binary or contain a load of binary data so escape characters every where, I just want to unmerge the merged files! :-)

Thanks Guys


dotnet-visual-basic
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @AlanBarclay-1668 ,
Could you please provide a simple example of the content of the file, the result you want, the result you are getting so far?
This will help us better understand your situation.

0 Votes 0 ·
AlanBarclay-1668 avatar image
0 Votes"
AlanBarclay-1668 answered

Hi JiachenLiMFST sorry for the delay getting back, Life took a detour, unfortunately I can't provide a sample due to the data contained within the files and also as the files contain several files on one there size is excessive.
To help you imagine what's going on and possibly the file structure, I can only suggest you imagine a single MP3 file made up from 4-5 different MP3 files all joined together, I want to split these into single files, so using Regex I set 'ID3' as the search pattern. Regex returns the correct amount of 'ID3's however their positions are out by around 3000 bytes for the first match 6000 for the second match 9000 for the third and so on... if I split the file at these positions I'll be chopping the end of the previous split :-/
It appears what should be a simple task as issues, maybe because the data files contain some binary/encrypted data ???

Thanks for the response anyway

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

LesHay-2099 avatar image
0 Votes"
LesHay-2099 answered

Hi

OK, I don't know if this is on the right track.
I have used a jpg image file to create some ramdom byte data by taking random chunks, adding the test header to each and merged all the chunks. Then display the original indexes and the found indeces for comparison.

Then, I have tried to separate the chunks back to original (I have only dealt with getting the start indexes in this code. The code is a bit messy and can no doubt be improved on - however, it does get the actual accurate indexes.

 Option Strict On
 Option Explicit On
 Imports System.Text
 Public Class Form1
     Dim r As New Random
     Dim header() As Byte
    
     ' for sample, used a jpg file
     Dim sample() As Byte = IO.File.ReadAllBytes("C:\Users\lesha\Desktop\ABC.dat")
     Dim bytes() As Byte
     Dim StartingIndx As New List(Of Integer)
     Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
         ' create random test blocks and merge them
         ' keeping track of indexes
         ' ===========================
         header = Encoding.UTF8.GetBytes("wwwwABC")
         bytes = getblock()
    
         StartingIndx.Add(0)
         ' add 4 more test blocks
         For i As Integer = 1 To 4
             StartingIndx.Add(bytes.Length)
             bytes = bytes.Concat(getblock).ToArray()
         Next
         ' ===========================
    
         ' try and get indexes of blocks which
         ' correspond to originals
         Dim EndingIndx As List(Of Integer) = GetIndexes()
    
         ' compare original indexes with found
         ' indexes (StartingIndx with EndingIndx)
    
         TextBox1.Text = "StartingIndx.Length= " & StartingIndx.Count.ToString & vbCrLf
         TextBox1.AppendText("EndingIndx.Length= " & EndingIndx.Count.ToString & vbCrLf & vbCrLf)
         Dim shortest As Integer = StartingIndx.Count - 1
         If EndingIndx.Count - 1 < shortest Then shortest = EndingIndx.Count - 1
         For i As Integer = 0 To shortest
             TextBox1.AppendText(StartingIndx(i).ToString.PadLeft(12) & EndingIndx(i).ToString.PadLeft(12) & vbCrLf)
         Next
     End Sub
     Function GetIndexes() As List(Of Integer)
         Dim inx As New List(Of Integer)
         Dim start As Integer = 0
         Dim x As Integer = 0
    
         Do
             x = Array.IndexOf(Of Byte)(bytes, header(0), start, bytes.Length - start - header.Length)
             If x > -1 Then
                 If bytes(x) = header(0) AndAlso bytes(x + 1) = header(1) AndAlso bytes(x + 2) = header(2) AndAlso bytes(x + 3) = header(3) AndAlso bytes(x + 4) = header(4) AndAlso bytes(x + 5) = header(5) AndAlso bytes(x + 6) = header(6) Then
                     inx.Add(x)
                 End If
             End If
             start = x + header.Length
         Loop Until x < 0
         Return inx
     End Function
     Function getblock() As Byte()
         Return header.Concat(sample.Take(r.Next(1000, sample.Length - 1000))).ToArray()
     End Function
 End Class
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.