Giter VIP home page Giter VIP logo

unzip's Introduction

Unzip

CI Hex.pm docs

Elixir library to stream zip file contents. Works with remote files. Supports Zip64.

Overview

Unzip tries to solve problem of unzipping files from different types of storage (Aws S3, SFTP server, in-memory etc). It separates type of storage from zip implementation. Unzip can stream zip contents from any type of storage which implements Unzip.FileAccess protocol. You can selectively stream files from zip without reading the complete zip. This saves bandwidth and decompression time if you only need few files from the zip. For example, if a zip file contains 100 files and we only want one file then Unzip access only that particular file

Installation

def deps do
  [
    {:unzip, "~> x.x.x"}
  ]
end

Usage

# Unzip.LocalFile implements Unzip.FileAccess
zip_file = Unzip.LocalFile.open("foo/bar.zip")

# `new` reads list of files by reading central directory found at the end of the zip
{:ok, unzip} = Unzip.new(zip_file)

# Alternatively if you have the zip file in memory as binary you can
# directly pass it to `Unzip.new(binary)` to unzip
#
# {:ok, unzip} = Unzip.new(<<binary>>)

# presents already read files metadata
file_entries = Unzip.list_entries(unzip)

# returns decompressed file stream
stream = Unzip.file_stream!(unzip, "baz.png")

Supports STORED and DEFLATE compression methods. Supports zip64 specification.

Sample implementations of Unzip.FileAccess protocol

For Aws S3 using ExAws

defmodule Unzip.S3File do
  defstruct [:path, :bucket, :s3_config]
  alias __MODULE__

  def new(path, bucket, s3_config) do
    %S3File{path: path, bucket: bucket, s3_config: s3_config}
  end
end

defimpl Unzip.FileAccess, for: Unzip.S3File do
  alias ExAws.S3

  def size(file) do
    %{headers: headers} = S3.head_object(file.bucket, file.path) |> ExAws.request!(file.s3_config)

    size =
      headers
      |> Enum.find(fn {k, _} -> String.downcase(k) == "content-length" end)
      |> elem(1)
      |> String.to_integer()

    {:ok, size}
  end

  def pread(file, offset, length) do
    {_, chunk} =
      S3.Download.get_chunk(
        %S3.Download{bucket: file.bucket, path: file.path, dest: nil},
        %{start_byte: offset, end_byte: offset + length - 1},
        file.s3_config
      )

    {:ok, chunk}
  end
end


# Using S3File

aws_s3_config = ExAws.Config.new(:s3,
  access_key_id: ["key_id", :instance_role],
  secret_access_key: ["key", :instance_role]
)

file = Unzip.S3File.new("pets.zip", "pics", aws_s3_config)
{:ok, unzip} = Unzip.new(file)
files = Unzip.list_entries(unzip)

Unzip.file_stream!(unzip, "cats/kitty.png")
|> Stream.into(File.stream!("kitty.png"))
|> Stream.run()

For zip file in SFTP server

defmodule Unzip.SftpFile do
  defstruct [:channel_pid, :connection_ref, :handle, :file_path]
  alias __MODULE__

  def new(host, port, sftp_opts, file_path) do
    :ok = :ssh.start()

    {:ok, channel_pid, connection_ref} =
      :ssh_sftp.start_channel(to_charlist(host), port, sftp_opts)

    {:ok, handle} = :ssh_sftp.open(channel_pid, file_path, [:read, :raw, :binary])

    %SftpFile{
      channel_pid: channel_pid,
      connection_ref: connection_ref,
      handle: handle,
      file_path: file_path
    }
  end

  def close(file) do
    :ssh_sftp.close(file.channel_pid, file.handle)
    :ssh_sftp.stop_channel(file.channel_pid)
    :ssh.close(file.connection_ref)
    :ok
  end
end

defimpl Unzip.FileAccess, for: Unzip.SftpFile do
  def size(file) do
    {:ok, file_info} = :ssh_sftp.read_file_info(file.channel_pid, file.file_path)
    {:ok, elem(file_info, 1)}
  end

  def pread(file, offset, length) do
    :ssh_sftp.pread(file.channel_pid, file.handle, offset, length)
  end
end


# Using SftpFile

sftp_opts = [
  user_interaction: false,
  silently_accept_hosts: true,
  rekey_limit: 1_000_000_000_000,
  user: 'user',
  password: 'password'
]

file = Unzip.SftpFile.new('127.0.0.1', 22, sftp_opts, '/home/user/pics.zip')

try do
  {:ok, unzip} = Unzip.new(file)
  files = Unzip.list_entries(unzip)

  Unzip.file_stream!(unzip, "cats/kitty.png")
  |> Stream.into(File.stream!("kitty.png"))
  |> Stream.run()
after
  Unzip.SftpFile.close(file)
end

unzip's People

Contributors

akash-akya avatar fireproofsocks avatar olivermt avatar tmdvs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

unzip's Issues

CSV file chunk read with `file_stream!` |> Enum.to_list() returns deeply nested lists

Hey, I'm seeing some weird behaviours as I use this library to parse CSV files nested inside a ZIP file.

The Zip file is handled well, the list_entries method works well. When I access one of the CSV files however, instead of being a smooth process of chunked strings, each chunk is returned as a list of lists (nested 10 deep in one case), finally with the actual string buried deeply within.

I suspect this is something to do with the Stream.transform in unzip but I'm not experienced enough with stream processing to figure it out - rapidly getting more experienced by the hour at this point, ha.

Happy to supply the file.

Cheers

Suggestions of improvements for the project

Thanks for unzip, I'm happy to use your project!

I wonder if you would accept PR for the 2 following changes:

  1. Setting up a CI with GitHub actions (CI is important to help maintenance!)
  2. Removing some warnings due to recent changes in the project (see below)

Let me know what you think, happy to help at some point.

Warnings

I got the following warnings with Elixir 1.13.4:

warning: use Mix.Config is deprecated. Use the Config module instead
  config/config.exs:1

Compiling 5 files (.ex)
warning: "else" shouldn't be used as the only clause in "defp", use "case" instead
  lib/unzip.ex:173

Error isn't given on bad zip file

I've got a bad zip file where, when I try to read it off disk, Unzip.new returns a binary value instead of a {:ok, _}. Looking at the code, I think that maybe this with statement is at fault:

https://github.com/akash-akya/unzip/blob/master/lib/unzip.ex#L227

If the with can't match that binary pattern exactly, then the non-matching binary is returned. Maybe that with clause should instead call a function to parse the binary which returns {:ok, _} / {:error, _}.

I can't share the zip file, but I can say that it's a KMZ. Also, the reason it's broken (the real problem which I'm trying to fix separately) is that I try to save the file via the code below and it ends up, for some reason, with one less byte (it goes from 52827 bytes to 52826 bytes)

    dir = System.tmp_dir!()

    temp_path =
      "#{dir}#{Ecto.UUID.generate()}.kmz"
      |> IO.inspect(label: :temp_path)

    file_stream
    |> Enum.into(File.stream!(temp_path, []))

    {:ok, unzip} =
      Unzip.LocalFile.open(temp_path)
      |> IO.inspect(label: :a)
      |> Unzip.new()
      |> IO.inspect(label: :b)

"Erlang error: :data_error" on an apparently valid zip file

Hi there!

I stumbled upon a zip file which seems (I believe) in general readable, and reported as valid by unzip -t, but which raises an error during parsing with unzip.

I am not sure yet where is the culprit (in my code, in unzip, or in the underlying Erlang libraries).

Here is a reproduction with data files.

The "bogus file" can be found at:

Mix.install([
  {:req, "~> 0.4.0"},
  {:unzip, "~> 0.11.0"},
  {:nimble_csv, "~> 1.2"}
])

ExUnit.start(trace: true)

defmodule Downloader do
  def local_file_name(url), do: Path.basename(url) <> ".zip"

  def cached_download!(url, local_file) do
    unless File.exists?(local_file) do
      %{status: 200} = Req.get!(url, into: File.stream!(local_file), raw: true)
    end
  end
end

defmodule UnzippingParser do
  def open_zip_file(file) do
    zip_file = Unzip.LocalFile.open(file)
    {:ok, unzip} = Unzip.new(zip_file)
    unzip
  end

  def list_entries(unzip) do
    unzip
    |> Unzip.list_entries()
  end

  def parse_and_get_headers(unzip, "stops.txt" = entry_name) do
    unzip
    |> Unzip.file_stream!(entry_name)
    |> Stream.map(&IO.iodata_to_binary/1)
    |> NimbleCSV.RFC4180.to_line_stream()
    |> NimbleCSV.RFC4180.parse_stream(skip_headers: false)
    |> Stream.take(1)
    |> Enum.at(0)
  end
end

defmodule AssertionTest do
  use ExUnit.Case

  def run_test(url) do
    local_file = Downloader.local_file_name(url)
    Downloader.cached_download!(url, local_file)

    headers = UnzippingParser.parse_and_get_headers(UnzippingParser.open_zip_file(local_file), "stops.txt")
    assert "stop_lat" in headers
  end

  test "parses first file" do
    # "GTFS" file for https://www.data.gouv.fr/en/datasets/reseau-cars-region-isere-38/
    run_test("https://www.data.gouv.fr/fr/datasets/r/40ee9d6c-3bb9-409e-b670-986212de63f2")
  end

  test "parses second file" do
    # https://www.data.gouv.fr/fr/datasets/offre-de-transport-cc-jalles-eau-bourde/
    run_test("https://www.data.gouv.fr/fr/datasets/r/6fa85772-cc88-43ae-88b9-290e2f9345ff")
  end
end

I will try to have a closer look at some point, but do not have the time right now, so just capturing the issue for now.

Traces of files

Checksums

❯ cksum *.zip
4016894865 38060335 40287.20240401.061120.295743.zip
4016894865 38060335 40ee9d6c-3bb9-409e-b670-986212de63f2.zip
22202388 136685 6fa85772-cc88-43ae-88b9-290e2f9345ff.zip
❯ unzip -t 40ee9d6c-3bb9-409e-b670-986212de63f2.zip 
Archive:  40ee9d6c-3bb9-409e-b670-986212de63f2.zip
    testing: agency.txt               OK
    testing: calendar_dates.txt       OK
    testing: feed_info.txt            OK
    testing: routes.txt               OK
    testing: shapes.txt               OK
    testing: stops.txt                OK
    testing: stop_times.txt           OK
    testing: transfers.txt            OK
    testing: trips.txt                OK
No errors detected in compressed data of 40ee9d6c-3bb9-409e-b670-986212de63f2.zip.
❯ unzip -t 6fa85772-cc88-43ae-88b9-290e2f9345ff.zip 
Archive:  6fa85772-cc88-43ae-88b9-290e2f9345ff.zip
    testing: agency.txt               OK
    testing: trips.txt                OK
    testing: stops.txt                OK
    testing: stop_times.txt           OK
    testing: shapes.txt               OK
    testing: calendar_dates.txt       OK
    testing: calendar.txt             OK
    testing: routes.txt               OK
No errors detected in compressed data of 6fa85772-cc88-43ae-88b9-290e2f9345ff.zip.

Feature: Support for Gzip

It would be nice if there were support for dealing with Gzip and Gunzip files. Would such a feature be possible? I'm admittedly not super familiar with the file format.

Invalid zip file, missing EOCD record but unzip -t returns okay

Hello!

We're getting an error

iex> Unzip.new(Unzip.LocalFile.open("/tmp/gtfs.zip"))
{:error, "Invalid zip file, missing EOCD record"}

with the following file: gtfs.zip while unzip seems to report that the archive is okay.

$ unzip -t gtfs.zip                                                                                                                                                                                     
Archive:  gtfs.zip
    testing: agency.txt               OK
    testing: calendar.txt             OK
    testing: calendar_dates.txt       OK
    testing: routes.txt               OK
    testing: shapes.txt               OK
    testing: stop_times.txt           OK
    testing: trips.txt                OK
    testing: stops.txt                OK
No errors detected in compressed data of gtfs.zip.

Do you know what could be wrong?

Unzip from memory?

At times, I find it useful (although it won't work with huge files) to work with a binary in memory.

I wonder if it would make sense to implement a Unzip.FileAccess for a binary, so that things can work in RAM fully.

Unsure if this should come with the library, or just as a documentation snippet.

Maybe I'll provide an implementation if today's case I'm meeting shows up again in the future.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.