Giter VIP home page Giter VIP logo

officeextractor's People

Contributors

bormm avatar dign17 avatar seal-mb avatar sicos1977 avatar sicos2002 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

officeextractor's Issues

Embedded Object Name

Hello there,
I was wondering if there is a way to display the names of the embedded objects after the extraction instead of just "Embedded object.docx", "Embedded Word document.doc" etc.
Thanks.

Doesn't throw exceptions

Logger.WriteToLog($"Cant check for embedded object because an error occured, error: {exception.Message}");

I assume there was no intention to just "log" the thrown exceptions, except if you use the attached test project...
Using the project in another application, (for example by using the nuget) could be problematic if the user intends to output possible errors, without using the workaround of regexing the exceptions out of the produced logfile ( or using the still exception throwing version 1.10.0)

PasswordProtectedChecker Dependency not Reflected in NuGet-Package

I don't know how this stuff works, but the Assembly can't be used anymore without "PasswordProtectedChecker, Version=1.3.0.0, Culture=neutral, PublicKeyToken=61be40572173cc31". But the nuget-package does not shows this dependency and so is not automatically installing it.

Extracted file order does not match original oleObject order

When there are 10 or more embedded objects, the order of the extracted files does not match the order of the original oleObject files.

Original Filename New Filename
oleObject1.bin Embedded object.bin
oleObject10.bin Embedded object_2.bin
oleObject11.bin Embedded object_3.bin
oleObject2.bin Embedded object_4.bin
oleObject3.bin Embedded object_5.bin
oleObject4.bin Embedded object_6.bin
oleObject5.bin Embedded object_7.bin
oleObject6.bin Embedded object_8.bin
oleObject7.bin Embedded object_9.bin
oleObject8.bin Embedded object_10.bin
oleObject9.bin Embedded object_11.bin

I have fixed this in a local copy of Extractor.cs by reordering the list by extracting the oleObjectId from the URI:

               // extracts the id of an oleobject so we can order it appropriately
                Func<string, int> getOleObjectId = p => {
                    string fileName = p.Substring(p.LastIndexOf(@"/") + 1);
                    int objectId;
                    string retrievedId = fileName.ToLower().Replace("oleobject", "").Replace(".bin", "");
                    bool success = int.TryParse(retrievedId, out objectId);
                    return success ? objectId : 9999;
                };

                // Get the embedded files names in the correct order. 
                var parts = package.GetParts()
                    .ToList()
                    .Where(p => p.Uri.ToString().StartsWith(embeddingPartString))
                    .OrderBy( p => getOleObjectId(p.Uri.ToString()) )
                    .ToList();

                foreach (var packagePart in parts)

Strong Name Release?

After OpenMcdf is now available with a strong name, will OfficeExtractor also released (also on nuget.org) with a strong name?

PasswordProtectedChecker Dependency

Hello Kees,

with the latest OfficeExtractor release 1.8.1 you added a PasswordProtectedChecker dependency which relies on the iTextSharp-LGPL package. I can think of a scenario where the current usage of public Result IsFileProtected(string fileName) inside the Extractor class is convenient, but it also implicates a problem of potentially conflicting dependencies. In addition it is now also necessary to comply with LGPL rules, due to a dependency of OfficeExtractor and not the component itself.

For more simplicity could you consider to move the usage of PasswordProtectedChecker outside of OfficeExtractor?

With kind regards

Dmitry

Embedded Excel workbook differs from original, appearing empty when opened

  1. Create a Word or PowerPoint file
  2. Embed an Excel workbook, can be either .xls or .xlsx
  3. Open with viewer and let it export embeddings
  4. Open the exported Excel
  5. Excel app opens but its window is blank, implying that it cannot read any sheets from the workbook

I've tried this on my own separately and found the only way to export with fidelity to original is to use OLE automation, open the workbook in a new instance of Excel, invoke a Save operation, and then close the instance.

Is there a better way?

Usage question

How could I incorporate this into a PowerShell script that checks for the existence of macros and extracts them if they exist?

Embedded image objects in excel documents cannot be extracted

Hello Kees,

we currently noticed that some embedded images formats (especially bmp and jpg) are not exported from excel sheets (.xls 97-2003 version). I also tested the newer excel container (.xlsx) where embedded .png and .tif work fine, but extracting .bmp and .jpg throws an exception stating a non valid value NULL (ViewerForm.cs line 85).

For testing I used the 1.4.4 release and the file attached.
EmbeddedBitmap.xlsx

For some strange reason I could not upload the .xls file as a zip, so sorry for this.

I guess this issue is somehow related to "Embedded image objects in word 97-2003 documents cannot be extracted #2" posted by Marco before. It would be nice if you have a solution for this one too.

With kind regards

Dmitry

Fresh Copy from GitHub does not work

I pulled down the master version to check things out and could not get it to work due to several package related issues.

Test Name: DocWith2EmbeddedImages
Test FullName: OfficeExtractorTest.OfficeExtractorTest.ExtractionTests.DocWith2EmbeddedImages
Test Source: \vmware-host\Shared Folders\Documents\projects\OfficeExtractor-master\OfficeExtractorTest\ExtractionTests.cs : line 71
Test Outcome: Failed
Test Duration: 0:00:00

Test Name: DocWith2EmbeddedImages
Test Outcome: Failed
Result StackTrace:
at OfficeExtractor.Extractor..ctor(Stream logStream)
at OfficeExtractorTest.ExtractionTests.DocWith2EmbeddedImages() in \vmware-host\Shared Folders\Documents\projects\OfficeExtractor-master\OfficeExtractorTest\ExtractionTests.cs:line 73
Result Message:
Test method OfficeExtractorTest.ExtractionTests.DocWith2EmbeddedImages threw exception:
System.IO.FileLoadException: Could not load file or assembly 'PasswordProtectedChecker, Version=1.3.6.0, Culture=neutral, PublicKeyToken=754cbba67bd582b0' or one of its dependencies. The located assembly's manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040)

I tried tag 1.12 and received similar results. I was able to get tag 1.10 to work properly.

Extracting ole objects - cannot load a non-seekable stream

Hi,

I'm trying to use the OfficeExtractor nuget (v1.15.0) in a .NET5 project to extract pdf objects from docx files. however, the extractor keeps failing with an "OpenMcdf.CFException: Cannot load a non-seekable Stream".

Here's a snippet of the stack trace thrown in my unit test:
image

I've done some research and you might want to change packagePartStream to packagePartMemoryStream in the constructor of the CompoundFile.

using (var compoundFile = new CompoundFile(packagePartStream))

Here's a reference to a related issue, where the suggestion also is to copy to a MemoryStream.
dotnet/runtime#28258 (comment)

Kind regards!

Extracting OLE objects with UNICODE characters in their names.

The Officextractor throws an exception if the embedded OLE object uses Unicode characters, e.g. a ZIP file with Chinese characters. The reason is that the name in the OLE object can only ANSI and the Chinese characters are converted to question marks. But the question mark is not allowed as part of the filename, this leads to a file exception.

Excel::IsPasswordProtected seams broken or just not needed?

Hi there,

at first: Thank you for that great component!

I have some questions/issue about password protected files, especially excel files.

  1. The tests DocxWithPassword, PptxWithPassword and XlsWithPassword are currently not passed
    I am wondering if they got broken or never worked.

  2. Excel::IsPasswordProtected throws on Excel95 that the stream "WorkBook" doesn't exists
    Maybe I would fix this, but:

  3. I was unable to protect even a Excel 97 file in a way that IsPasswordProtected returns true. Even I protect if from opening, it returns false. It also returns false for your XlsWithPassword test.
    Maybe again, this could be fixed, but:

  4. Extraction works also for full password protected excel files I can't even open in Excel. All files are correctly exported.

  5. The protection of the Xlsx test file (Excel >= 2007) is correctly detected, but the detection is just by checking for the "EncryptedPackage" stream. So for what revision of Excel is all the code with the "WorkBook" stream? I haven't tested if the export would be possible for such a protected 2007 document. I guess the protection is working in that case.

So at the end: Is IsPasswordProtected really needed ? Is it needed for Excel > 97 ?

Support msg (outlook) files

Thank you for your library, very useful.
Could you support extracting embedded file from msg filetype (outlook email message) ?

Usage question

How would I use this method so I can plug it into my powershell script that looks for macros inside Office files? I am not real savvy with Visual Studio. I can use the compiled DLL file, I guess, but how do I make calls to it from my script?

How to Use or Invoke?

I have downloaded from github, and have it as a project. A solution file exists, so it seems complete. However, I have not been able to determine how to run it against an MS Office document.

Usage? [C:\path\to]OfficeExtractor [flags?] MySpreadSheet.xlsx

Thanks!

Embedded image objects in word 97-2003 documents cannot be extracted

Hi Kees,

we just noticed that all embedded image objects (jpg,png) within word 97 files (stored using office 2007) are not extracted. I have seen that your test documents containing other office file objects and text files, but no images. Is there some special thing with images?

I tried the 1.4.1 and 1.4.3 release using the OfficeViewer with the attached test file.

Did you know anything about that?
word97-with-embedded-images.zip

Best regards
Marco

Build Instruction

Not having programmed in the Microsoft World for the past 21years, I'd love to build your great tools under Windows 10.

However, I've not been able to accomplish this task due to errors in dependencies.

Is there anywhere a step-by-step howto that teaches an new-old-bie how to build OfficeExtractor?

Thank you

Best regards
Lukas

public interface with MemoryStreams

Hi,

I would like to use your library in our project, but we're working with MemoryStreams. Is there a chance to get a public interface of the Extract() method that acceps and returns a MemoryStream, rather than a file and an output directory?

Or alternatively would you accespt a pull request, if I branch a version and implement it myself?

While extracting entry '...docx' there was an error: Unsupported OleNative AnsiUserType 'Bitmap Image' foundUnsupported OleNative AnsiUserType

Hello, while using the office extractor it returns this error message:

While extracting entry '...docx' there was an error: Unsupported OleNative AnsiUserType 'Bitmap Image' foundUnsupported OleNative AnsiUserType

We try to produce a file with this error, without sending personal data, but as a first question: is this issue known and may it be able to have this error fixed ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.