Help using VB.Net Regex to split file at specific header text

Alan Barclay 6 Reputation points
2022-06-23T15:48:34.583+00:00

Hi Guys, My PC developed an issue, as a result a large collection of files merged into single files?
I have decided to create a VB.net application that will scan a set path and process each file it finds e.g.

Sub Process_File(Byval sFileName as string)  
  Using SR as New StreamReader(sFileName)  
   Dim sData() as String = SR.ReadToEnd()  
   SR.Close  
   SR.Finalise()  
   SR.Dispose()  
  
   Dim sPattern as string = "wwww"+chr(x)+chr(y)+chr(z) '*** <- This is the specific search pattern : w = 4 char string, x, y & z = byte value  
   Dim uCNT as uInt16 = 0  
   Dim uLen as uint16 = 0  
  
   Dim matches as MatchCollection = Regex.Matches(sData, sPattern)  
  
     For each match as Match in matches  
       TextOutput(String.Format("{0} : {1} : {2}", match.Index,  match.Length, match.Value) + vbcrlf)  
  
       Dim sOutFile as string = String.Format("{0} Split {1}{2}",IO.Path.GetFileNameWithoutExtension(sFileName), uCNT, IO.Path.GetExtension(sFileName)  
  
       If uCNT < Matches.count - 1  
         uLen = Matches(uCNT+1).Index - Matches(uCNT).Index  
       Else  
         uLen = sDATA.Length - Matches(uCNT).Index  
       EndIf  
  
       using SW as new StreamWriter(sOutFile)  
         SW.Write(strings.mid(sData, Matches(uCNT).Index, uLen))  
         SW.Close  
         SW.Finalise()  
         SW.Dispose()  
       end using  
  
       uCNT += 1  
     Next  
  End Using  
End Sub  

*note example may have typos? and not be complete? I'm using the sites html editor :-/

Everything appears to be working, the recursion supplies all the required files, the regex detects all the required headers, however the returned index doesn't appear to be returning the correct position (its about 3000 bytes short per item, i.e. item 2 would be 6000 bytes out?), the output files generated obviously don't work as their headers don't exist at the beginning of the files.

The help/documentation for regex isn't any use :-\ or no longer supported ???

Any Solution Ideas would be most appreciated and to all you guys wondering why would I want to do this I ask why don't you just supply a solution :-?

*Note the merged files are binary or contain a load of binary data so escape characters every where, I just want to unmerge the merged files! :-)

Thanks Guys

VB
VB
An object-oriented programming language developed by Microsoft that is implemented on the .NET Framework. Previously known as Visual Basic .NET.
2,578 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Alan Barclay 6 Reputation points
    2022-06-30T12:14:39.107+00:00

    Hi JiachenLiMFST sorry for the delay getting back, Life took a detour, unfortunately I can't provide a sample due to the data contained within the files and also as the files contain several files on one there size is excessive.
    To help you imagine what's going on and possibly the file structure, I can only suggest you imagine a single MP3 file made up from 4-5 different MP3 files all joined together, I want to split these into single files, so using Regex I set 'ID3' as the search pattern. Regex returns the correct amount of 'ID3's however their positions are out by around 3000 bytes for the first match 6000 for the second match 9000 for the third and so on... if I split the file at these positions I'll be chopping the end of the previous split :-/
    It appears what should be a simple task as issues, maybe because the data files contain some binary/encrypted data ???

    Thanks for the response anyway

    0 comments No comments

  2. LesHay 7,126 Reputation points
    2022-06-30T13:11:39.097+00:00

    Hi

    OK, I don't know if this is on the right track.
    I have used a jpg image file to create some ramdom byte data by taking random chunks, adding the test header to each and merged all the chunks. Then display the original indexes and the found indeces for comparison.

    Then, I have tried to separate the chunks back to original (I have only dealt with getting the start indexes in this code. The code is a bit messy and can no doubt be improved on - however, it does get the actual accurate indexes.

    Option Strict On  
    Option Explicit On  
    Imports System.Text  
    Public Class Form1  
    	Dim r As New Random  
    	Dim header() As Byte  
      
    	' for sample, used a jpg file  
    	Dim sample() As Byte = IO.File.ReadAllBytes("C:\Users\lesha\Desktop\ABC.dat")  
    	Dim bytes() As Byte  
    	Dim StartingIndx As New List(Of Integer)  
    	Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load  
    		' create random test blocks and merge them  
    		' keeping track of indexes  
    		' ===========================  
    		header = Encoding.UTF8.GetBytes("wwwwABC")  
    		bytes = getblock()  
      
    		StartingIndx.Add(0)  
    		' add 4 more test blocks  
    		For i As Integer = 1 To 4  
    			StartingIndx.Add(bytes.Length)  
    			bytes = bytes.Concat(getblock).ToArray()  
    		Next  
    		' ===========================  
      
    		' try and get indexes of blocks which  
    		' correspond to originals  
    		Dim EndingIndx As List(Of Integer) = GetIndexes()  
      
    		' compare original indexes with found  
    		' indexes (StartingIndx with EndingIndx)  
      
    		TextBox1.Text = "StartingIndx.Length= " & StartingIndx.Count.ToString & vbCrLf  
    		TextBox1.AppendText("EndingIndx.Length= " & EndingIndx.Count.ToString & vbCrLf & vbCrLf)  
    		Dim shortest As Integer = StartingIndx.Count - 1  
    		If EndingIndx.Count - 1 < shortest Then shortest = EndingIndx.Count - 1  
    		For i As Integer = 0 To shortest  
    			TextBox1.AppendText(StartingIndx(i).ToString.PadLeft(12) & EndingIndx(i).ToString.PadLeft(12) & vbCrLf)  
    		Next  
    	End Sub  
    	Function GetIndexes() As List(Of Integer)  
    		Dim inx As New List(Of Integer)  
    		Dim start As Integer = 0  
    		Dim x As Integer = 0  
      
    		Do  
    			x = Array.IndexOf(Of Byte)(bytes, header(0), start, bytes.Length - start - header.Length)  
    			If x > -1 Then  
    				If bytes(x) = header(0) AndAlso bytes(x + 1) = header(1) AndAlso bytes(x + 2) = header(2) AndAlso bytes(x + 3) = header(3) AndAlso bytes(x + 4) = header(4) AndAlso bytes(x + 5) = header(5) AndAlso bytes(x + 6) = header(6) Then  
    					inx.Add(x)  
    				End If  
    			End If  
    			start = x + header.Length  
    		Loop Until x < 0  
    		Return inx  
    	End Function  
    	Function getblock() As Byte()  
    		Return header.Concat(sample.Take(r.Next(1000, sample.Length - 1000))).ToArray()  
    	End Function  
    End Class  
    
    0 comments No comments