Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice
  3. Join us on November 16th, 2023, between 1 pm and 9 pm CET for Ask the Experts Online on Discord and on Unity Discussions.
    Dismiss Notice

Data compression usage

Discussion in 'General Discussion' started by nemequ, Aug 27, 2016.

  1. nemequ

    nemequ

    Joined:
    Aug 27, 2016
    Posts:
    8
    I'm working on creating a new corpus for testing general-purpose data compression codecs and, since I know games are major users of compression, I'd like to make sure they are represented. However, as my knowledge in game design is limited (and I'm being generous there), I'm hoping for some advice in choosing data for the corpus.

    As far as I can tell, for Unity-based games data is stored in "asset bundles". Since (again, AFAICT) random access is important, it wouldn't really be appropriate to just include a single asset bundle as that would cause the entire bundle to be compressed at once. Instead, the corpus would have multiple pieces of data which are tested independently.

    So, what file formats are common in asset bundles? What sizes are typical? Is there anything outside of asset bundles which you regularly compress? What I really need right now is a summary of all the data which a game needs to compress. A table with the extension/content-type, count, and average size (see https://github.com/nemequ/squash-corpus/issues/6#issuecomment-242892981 for an example) would be perfect.

    If you don't feel comfortable posting this information publicly, feel free to PM me here or contact me some other way.

    Once I have a handle on the types of things which are common in asset bundles, I plan to start looking for data I could use which is licensed in a usable way. It doesn't have to be open-source, though that would be best; as long as it is redistributable for the purposes of benchmarking, codec development, etc., it's acceptable. I'm getting a bit ahead of myself, but if anyone has some data they could share I'd be very interested.

    Note: I'm also trying to contact people using other engines. If you're interested in the overall results, I intend to summarize everything at https://github.com/nemequ/squash-corpus/issues/6.
     
    Last edited: Aug 27, 2016
  2. nemequ

    nemequ

    Joined:
    Aug 27, 2016
    Posts:
    8
    Obviously I was hoping for more of a response. Perhaps I shouldn't have posted on a Friday evening, or perhaps it would help to explain why helping with this may be beneficial to you in the long term… TL;DR: by participating, future compression codecs will be better tuned for the type of data you are using.

    The main benefit is that people developing compression algorithms will be using this corpus to help tune their codecs. As soon as the corpus is ready I'll be switching the Squash Compression Benchmark over to it, and I know several codec developers have been using that benchmark to help tune their codecs. There are also other benchmarks, and of course codec developers download the data to run their own tests.

    It's important to understand that compression algorithms vary wildly based on the type of data they are processing in terms of compression ratio, speed, and memory usage. This isn't just the fact that plain text compresses much better than, for example, a JPEG (though that is true, of course). Some codecs are able to compress text very well and quickly but can't compress images nearly as well as other codecs or take vastly more time/memory to do so. Some codecs do a great job with images, but have very poor results for text. Some codecs work well for small files, some for large, etc.

    The current standard corpus, the Silesia Compression Corpus, was developed in 2003 and doesn't really reflect modern usage, but it's what people are still using to benchmark and tune codecs because there isn't anything else.

    So, if you can provide a brief summary of what types of content you are (or would like to be) compressing, there is an excellent chance that future codecs (and possibly future versions of existing codecs) will compress your data better, faster, and/or using less memory. If, OTOH, game developers don't help, the new corpus will probably not include any data relevant to games and people will optimize compression codecs for the data it does include, quite possibly at the expense of the data you want to compress.
     
  3. TonyLi

    TonyLi

    Joined:
    Apr 10, 2012
    Posts:
    12,533
    The bulk of data in a game is images and audio.
     
    Martin_H and Kiwasi like this.
  4. nemequ

    nemequ

    Joined:
    Aug 27, 2016
    Posts:
    8
    Thanks, but casual observations like that are't particularly helpful for my purposes. What I really need is a summary including things like:
    • Number of files
    • Total size
    • Average file size
    For each format.

    I also need more specific information than "images" and "audio"; the file format. Different file formats have very different characteristics and compressors often perform very differently for each, often even for different subsets of the same category (e.g., Opus vs. MP3, which are both lossy audio codecs, but some codecs perform very differently). Off the top of my head, for images you can have
    • Lossy
      • JPEG
      • JPEG2000
      • Textures
        • DXTn
        • ASTC
        • ETC(2)
        • PVRTC
    • Lossless
      • PNG
      • FLIF
      • TARGA
      • JPEG2000
    • 3D image formats?
    Audio is in a similar situation. I can also think of a few niche formats which, AFAIK, are unlikely but possible.

    Furthermore, what about things like 3D models and map data? The data I got from Unvanquished shows that audio is smaller than map data for them. I have no idea whether or not Unvanquished is typical, which is why I'm here.

    If all you have is a directory listing or something I'm happy to parse that into a summary; whatever is easiest for you. Note that I'm really only interested in the distributed files, not the source files (i.e., if you generate an Opus from a WAV and don't distribute the WAV, you probably don't really care how big the WAV is).
     
  5. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,327
    I'm fairly sure that almost nobody is going to use jpeg compression in a game anymore.
    I'd expect either png (because it is still compressed despite being lossless), dds (which may be using dxt compression), or some sort of engine-specific "binary blob" (which may be storing textures using dxt compression).
    For audio I'd expect ogg or mp3.

    However, in low-tech custom engine it wouldn't surprise me to see *.tga, *.bmp and *.wav files.... or *.mod (music) files.
     
    Last edited: Aug 31, 2016
    Kiwasi likes this.
  6. nemequ

    nemequ

    Joined:
    Aug 27, 2016
    Posts:
    8
    I appreciate the information, but I really do need what I outlined in my previous posts:
    • Average size — file size has a huge impact on all aspects of performance; compression isn't generally just "X MiB/s", it's a lot more subtle than that.
      • Some codecs have a large initialization time (allocating memory, initializing it, etc.) but are fast once they get going, so they will perform much better for larger files than smaller ones.
      • Some codecs tend not to compress small files very well, but perform well on larger files. Conversely, some codecs are great for small chunks of data but aren't able to take advantage of redundancy which only occurs over longer distances.
    • Total size — this is important because it helps tell me what to prioritize. If you have a thousand 1 KiB files of one type and a single 10 MiB file of another, you probably want to focus on compressing the 10 MiB file. OTOH, if you have a thousand 512 KiB files of one type and one 10 MiB file of another type, squeezing a bit more out of those small files may well be more important.
    • Exact type (not just "textures", but the format you're really using, such as "dds", "astc", "etc2", etc.) — Different formats of the same type of data do compress a bit differently.
    Well, currently the only games I have data on are Unvanquished and Xonotic, which I suspect you would consider relatively low-tech. Unvanquished, for example, does use tga images, but they pack them in pk3 files which are renamed zips, which means they are raw image data compressed with deflate… which is basically what PNG is. If I can't get any good data from the game development community, I'm probably either going to be stuck with their data or simply not include any game data in the corpus.

    Your comments about PNG vs. TGA are interesting. If you're storing all your data in a compressed archive (like, I think, Unity's asset bundles are), it's usually better to just have one layer of compression; for example, if asset bundles use LZHAM, you may actually end up with a better compression ratio if you store the uncompressed image data in the archive directly instead of trying to store a PNG, as that would would mean trying to compress a deflate-compressed image with LZHAM. Furthermore, you only have to do the LZHAM decompression, which cuts processing time roughly in half (LZHAM decompression is competitive with deflate, but the ratio is much higher). So, if games are really storing a lot of data in PNGs which are then compressed again with a different codec (or, in the case of pk3 files, the same codec), perhaps it would make more sense to include a raw image format in the corpus and try to teach developers that PNGs in a compressed archive probably isn't the best use of resources.

    OTOH, there are lossless formats (like FLIF, GFWX, BPG, WebP's lossless profile) which achieve much higher compression ratios than general-purpose algorithms like deflate, so perhaps including raw image data doesn't make much sense, either… I don't think many modern games will be using WAV for lossless audio instead of FLAC, but PNG or TGA instead of GFWX or WebP for lossless image data seems less certain.
     
    Last edited: Aug 31, 2016
  7. TonyLi

    TonyLi

    Joined:
    Apr 10, 2012
    Posts:
    12,533
    I suggest downloading Unity and some of the example project packages from the Asset Store, such as Survival Shooter, Space Shooter, and 2D Roguelike. The Stealth tutorial is for Unity 4.x, but it's a good example of a larger project. If you make a build, you can review the editor.log file to get a breakdown of the contents. It might not be as detailed as you want, but it should serve as a good start.
     
    Dustin-Horne, Kiwasi and Martin_H like this.
  8. Martin_H

    Martin_H

    Joined:
    Jul 11, 2015
    Posts:
    4,433
    You meant lossless, right?

    Maybe Just download Dota 2 and a few other popular f2p games, and see what they have. Doesn't get more "hands on" than actually released games, right? If I tell you what I have in my project currently, that's rather useless information to you because next week it might be entirely different.


    To me the whole compression thing isn't that big a deal to be honest. I don't have the resources to create the amounts of data that would make me worry about final download size, and with runtime performance I'm pretty sure I'm not limited by unpacking performance because everything should have finished loading once the game is running I think.
    Reducing the time till the scene loads might be interesting, but I doubt that you can do very much to change that if Unity Technologies themselves don't change the way the data is handled. Afaik I don't have control over that apart from choosing different compression formats per asset. But I'm not sure, I never really researched this.
     
    Dustin-Horne and angrypenguin like this.
  9. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,516
    It depends on the use case. @TonyLi's suggestion of getting projects and doing builds is a good one, because the format on a build is different to the format in the Editor, and is also potentially different per platform.

    The use of PNGs is usually on the development side, for the reasons @neginfinity mentioned. Once you get it in the engine, though, you typically set it up as a texture, including things like resizing (inputs are often deliberately larger than the intended usage resolution) and selection of compression.

    Anyhow, the data that you're asking for is quite detailed and going to be time consuming for a developer to get. It's also going to vary a lot from game to game, so you're going to need a wide set of examples for your data not to be skewed. I'd suggest making a tool that examines a game's content (one for Unity, one for UE4, at least) and generates the report so that we can share it with you, because I'll be honest and say right now I'm not taking the time to go over my projects in the detail you're asking for when I could be working on them instead.
     
    Kiwasi, Ryiah, Martin_H and 1 other person like this.
  10. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,327
    Yes.

    That kind of detailed information would be too time consuming to gather.

    Modern games do not seem to be concerned with file sizes. The most recently released Deus Ex Mankind Divided had install size of 45 gigabytes, which is normal. The majority of files will be most likely textures followed by voice, and in case of textures the idea is not to reduce file size on disk, but to ensure that it gets loaded quickly, and preferably without any intermediate conversion step.

    The reason why "low tech" game (meaning someone's pet project written from scratch) may use bmp or tga is because those formats can usually be loaded without 3rd party library by writing your own reader in os-independent way. Another issue is that png loading tends to be slower. although compression might be great, depending on source material.

    It seems that the purpose of game archive right now is usually not to compress data, but to obscure it - so people wouldn't easily tear the game apart and rip the models/images/sounds.

    Another issue is that opening a file may be considered to be a slow operation, due to possibility of intervention of antivirus software, for example. So the good idea might be to make one solid file, grab file handle to it ONCE and then read the data from it when necessary (on 64bit system, you can just filemap it).

    I'd expect developers to understand that by default. However, I might be overly optimistic.

    Games are concerned with speed of content loading and with ease of use of authoring tools. In case of modern games, I'd expect to find fmod banks instead of wav or flac files.Readily available engines seem to load data from commonly used formats and then convert it into something internal which is used for project build. For example, in Unity it is entirely possible to just create a mesh asset in a project without ever importing it. It'll be stored using unity's internal asset file format and can be stored as text in a project (YAML-based). In final build it'll be converted into something else, and I don't exactly have a reason to check into what exactly.
     
    Martin_H, Ryiah and angrypenguin like this.
  11. imaginaryhuman

    imaginaryhuman

    Joined:
    Mar 21, 2010
    Posts:
    5,834
    You can find out what file formats etc Unity supports for import and what runtime formats it will support e.g. texture formats, mesh data etc... from Unity's website. You already listed some of them. Look at the Texture Importer for example.

    I would think there are quite a lot of different types of data in a built game and it also depends on the platform. Are you trying to compress a project file or the resulting executable build file?

    I think like others said most memory consumption is in the area of texture data, audio, and mesh data. Not everyone uses asset bundles.

    I like the suggestion someone had of downloading the 'free' projects from the asset store and taking a look at them. You can open them up in Unity and see what kinds of files there are.

    Because there are a lot of different types of data, and because Unity already uses various forms of compression (e.g. compressed normals in mesh data), any compressor trying to tackle the whole project would have to support a wide variety of compression algorithms, or otherwise it would be a general purpose benchmark like trying to zip or lzma a Unity project folder.

    I also don't think many Unity users will know or have access to finding out the kind of detailed rundown of all files types and data sizes etc that you're asking for, and it probably varies hugely. I have projects that are under 10 megabytes, some people have massive projects in the gigabytes.

    It seems to make sense to have an updated corpus though reflecting 'modern' files and their usage, but to do as deep as you need to go I think you generally will have to do the hard work of that research yourself.
     
    Ryiah likes this.
  12. Mwsc

    Mwsc

    Joined:
    Aug 26, 2012
    Posts:
    189
    To the original poster,

    It might help if you share the motivation for your research.
    As far I can tell, the use of data compression in game development is well established.
    The actual codecs used are never the best known, but are rather a result of what gets the job done, what is practical, and what is available and widely used.

    If this were an academic research paper, you could discuss a hypothetical future world where every authoring tool, engine, driver, GPU, etc, would use the best possible method. Is that the goal? If you want immediately useful results, I'm not clear what could happen. You can't exactly expect the GPU manufacturers to switch to a different compressed texture format just because theirs isn't the best. You can't expect Unity to switch codecs, certainly not on your advice alone.
     
    Kiwasi likes this.
  13. nemequ

    nemequ

    Joined:
    Aug 27, 2016
    Posts:
    8
    This sounds good in theory, but my understanding is that this type of thing is typically bundled into archives. Older formats like pk3 aren't a problem since they're really just zip files, but AFAIK most newer formats are proprietary and may not have tools to extract the contents (like, for example, Unity's asset bundles). I'd be happy to be corrected on that, of course.
     
  14. nemequ

    nemequ

    Joined:
    Aug 27, 2016
    Posts:
    8
    Ah, okay that makes sense. As I mentioned, I'm really only interested in the distributed files since that will have a much greater impact (hopefully a few more people play a game than develop it).

    If that's common then that's what should be included in the corpus. I'm a bit surprised, though; IIRC PVRTC requires power-of-two sizes so lots of padding is necessary, but ETC only requires multiple-of-N (IIRC N=4, but I'm really not sure). I haven't looked at newer formats like ETC2 and ASTC.

    Really? After talking to the Unvanquished developers about what to include, I put something together in a couple minutes. IIRC I extracted the contents of the pk3 files, then did something like:

    Code (csharp):
    1. echo "Name,Extension,Size" > ../files.csv
    2. find -type f | while read filename; do size=$(wc -c "$filename" | awk '{ print $1 }'); echo "\"$filename\",${filename##*.},$(wc -c "$filename" | awk '{ print $1 }')"; done >> ../files.csv
    3. echo -e ".mode csv files\n.import ../files.csv files\nselect Extension,sum(Size) AS total,count(*),sum(Size)/count(*) from files group by Extension order by total desc;\n" | sqlite3
    It may not be pretty, but it works. Obviously this is with Bash; I have no idea how to accomplish it in PowerShell, but hopefully it's comparable. Anyways, as I said, I'm happy to do the processing if people want to just send me a list of files and sizes (e.g., a recursive directory listing).

    That's what I'm trying to do. Otherwise I'd just go by what I found in Unvanquished.

    I'm willing to do that, but AFAICT there is no way to inspect a Unity asset bundle. If someone knows of something I'd be willing to try to throw something like this together… Is there no log file generated during the build that would be usable for this?

    I'll look into what Unreal and CryEngine use. Maybe instead of talking to game developers I should be talking to modders; it seems like they may actually have a better idea of the true composition of games than the people creating them.
     
  15. nemequ

    nemequ

    Joined:
    Aug 27, 2016
    Posts:
    8
    It seems like there are some recurring themes in the responses overnight:

    TBH, I wasn't really expecting this to be an issue. As I mentioned in comment #14, it was very quick and easy for me to do this with Unvanquished once I had a basic idea of how the project was structured, which came from a quick discussion with the developers on IRC. It seems like Unity is designed in a way that makes this much more difficult.

    Seeing as how (AFAICT) Unity doesn't offer a way to peek inside asset bundles like I did for Unvanquished (which doesn't use Unity), I don't think looking at them is feasible. I'd hoped there would be some type of log file or even intermediary files left on disk during the build, but I guess either that's not the case or nobody is familiar enough with Unity's build process to get it; either way, it seems that isn't an option either. Unless I can get help from someone at Unity I'm not sure what else I can do.

    Again, proprietary formats ruin this idea. It may be an option for other game engines (and I'll certainly look into it), but for Unity this means peeking inside asset bundles, which seems to be a no-go.

    This seems helpful at first, but AFAICT all the content is going to be in a source format which will be transformed by Unity during the build. Optimizing compression for source content instead of the versions which will be distributed is vastly less interesting because the market is comparatively tiny; basically it's just compressed backups for developers. Optimizing for the compiled versions, on the other hand, helps everyone who downloads or distributes the game.

    Basically, without some sort of cooperation from Unity so I can peek into the asset bundles, I don't think I'm going to be able to get any usable information for Unity games.
     
  16. nemequ

    nemequ

    Joined:
    Aug 27, 2016
    Posts:
    8
    I tried to explain the purpose a bit in comment #2. People developing compression algorithms are currently optimizing them for very old data (Silesia was published in 2004) which, IHMO, wasn't even particularly useful then. That's great if you're trying to compress the collected works of Charles Dickens in plain-text format, but I don't think there is a huge market for that now.

    Furthermore, people who are benchmarking compression codecs are using that same data, which means it's quite likely that the benchmark results are lying to them. The results may be telling them a codec is their best option, but in reality they would be much better off with something else.

    What I'm trying to do is put together a new corpus which helps provide a more reasonable way to evaluate compression codecs. If you're curious about what that means in general, you can take a look at the README file in the Squash Corpus project. As it relates to this conversation, though, I'm here to try to figure out what type of data modern games need to compress for distribution. My only sample right now comes from a game based on the Quake 3 engine, and is hardly a paragon of modernity.

    Okay, I think I see where you (and possibly others) are confused. This is about general-purpose lossless compression algorithms, *not* image compression, texture compression, or audio compression. All those already have special-purpose lossy and lossless codecs, and if people wanted a corpus for them the data to include would be fairly obvious.

    However, AFAICT most games compress this data *again* and add them to archives (Unity asset bundles, *.pk3 files, whatever other game engines use), so it may be appropriate to look at which formats are actually used. However, just saying that people use (for example) DDS textures isn't sufficient. I need to know how to prioritize different file types so that I know what is important to include in the corpus. If a game has a gigabyte of dxtn-compressed textures and a megabyte of audio, the textures are obviously the priority.

    If this were an academic research paper, you could discuss a hypothetical future world where every authoring tool, engine, driver, GPU, etc, would use the best possible method. Is that the goal? If you want immediately useful results, I'm not clear what could happen. You can't exactly expect the GPU manufacturers to switch to a different compressed texture format just because theirs isn't the best. You can't expect Unity to switch codecs, certainly not on your advice alone.[/QUOTE]

    This has nothing to do with choosing (for example) ASTC over DXTn. It has to do with what algorithm is being used to compress the ASTC or DXTn data again. Remember, compressed textures are far from optimal; e.g., each N×N pixel block of a compressed texture in a compressed texture is typically the same size every other, the compression is about compressing data within that block. These are then often compressed again (using a general-purpose lossless algorithm) when they are added to archives to help reduce load time, as well as on-disk size and transfer time/bandwidth.

    There is also a lot of other data which relies more heavily on general-purpose compression than audio, video, and images. For example, 3d models, and map data often benefit greatly from the general-purpose compression stage, and this is a significant chunk of data for many games (for Unvanquished it's behind only textures and images, and images is probably only so high because they use uncompressed images and rely on the general-purpose compression algorithm in their archives to acheive a reasonable size).
     
  17. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,516
    For one platform. Like I said, with built Unity games you're looking at potentially different outputs per platform.

    @TonyLi already mentioned the editor log, which outputs a list of assets and their (uncompressed?) sizes after completing a build. (Edit: Actually, that might be from their source formats rather than their output formats. I can't check at the moment.) It just doesn't include AssetBundle contents. Though I guess that it would actually do the job for many simple/casual games, since they often don't use AssetBundles anyway.

    Have you tried opening an AssetBundle in different ways? I haven't checked because I've never had to, but I expect that it's using a standard compression format.

    We have a pretty solid idea of the composition of specific projects, we're just unable to give generalised answers that are of any use because we know every project is different.

    Also keep in mind that you're finding us here in particular specifically because we've chosen to use 3rd party tools that handle this stuff for us. It might be an area of specific interest to you, for us it's just one part of a tool chain, and a part that's already serviced quite well.

    You came here asking us to collate detailed tables of information for you, and then telling people that the broad answers that they were giving weren't useful. You're far more likely to get useful responses if you:
    • tell people specifically what to do (eg: "Do a build, find this part of your editor log, and send it to me");
    • automate the fiddly parts for us; and
    • are more positive about the fact that we're taking time to discuss this with you rather than ignoring it and moving on.
    Edit: Between editor logs and the new AssetBundle management system* I'm suspect you could write a tool that automates this whole thing from the Editor, including checking output formats for each platform and sending you the collated data. Then you could send us a plugin and say "please run this".

    * If you've got access to it's source in the Editor then the generated AssetBundle itself isn't relevant.
     
    Last edited: Sep 1, 2016
    Martin_H and Kiwasi like this.
  18. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,327
    If you're only interested in unity...

    https://docs.unity3d.com/Manual/AssetBundleInternalStructure.html

    Unity abstracts away underlying platform and the engine is not opensource. For example, as far as I know, unity actually uses a non-standard way of storing a normalmap in a texture. I have no real reason to care what that way is, though, although if I'm very curious I could dig through the available shader source and check.

    Similar thing applies to other resources. I have no need to know how the resource is stored on each platform, and I have plenty of reasons to expect storage mechanism to be different for every platform.

    You could check out tools designed to rip resources from unity games, that could bring more light on the topic.

    ----

    To get data for all the other engines, you'd literally need to grab every game and attempt to investigate its data compression mechanisms, which is time consuming, as I said.

    ----

    For your thesis you could check unreal engine, because it allows you to read its engine/editor source. Keep in mind, though, that Unreal 4 is NOT an opensource, and the access to the source code is governed by license. Meaning you can't just grab a hundred lines of code from there and put them into your thesis.
     
    Last edited: Sep 1, 2016
    angrypenguin likes this.
  19. Kiwasi

    Kiwasi

    Joined:
    Dec 5, 2013
    Posts:
    16,860
    Have you considered talking to the people doing the compression? Surely the owners of the various compression tools would have the sense to build in some analytics. Unity itself might have some internal metrics.
     
  20. imaginaryhuman

    imaginaryhuman

    Joined:
    Mar 21, 2010
    Posts:
    5,834
    So it sounds like you're doing general purpose compression like a zip or pk or whatever, and want some idea of a 'typical' project or the 'average case' or whatever?