首页 > 其他分享 >机器学习样本标记 示意代码

机器学习样本标记 示意代码

时间:2023-05-31 12:03:11浏览次数:28  
标签:-% domain 示意 标记 样本 break print path history

目标:根据各个字段数据的分布(例如srcIP和dstIP的top 10)以及其他特征来进行样本标注,最终将几类样本分别标注在black/white/ddos/mddos/cdn/unknown几类。

效果示意:

-------------choose one--------------
sub domain: DNSQueryName(N)
ip: srcip(S) or dstip(D)
length: DNSRequestLength(R1) or DNSReplyLength(R2)
length too: DNSRequestErrLength(R3) or DNSReplyErrLength(R4)
port: sourcePort(P1) or destPort(P2) or DNSReplyTTL(T)
code: DNSReplyCode(C2) or DNSRequestRRType(C1)
other: DNSRRClass(RR) or DNSReplyIPv4(V)
-------------label or quit------------
black(B) or white(W) or cdn(CDN) or ddos(DDOS) or mddos(M) or unknown(U) or white-like(L)
next(Q) or exit(E)?
***************************************
domain: workgroup. flow count: 206
***************************************
------------srcip-----------------
count                 206
unique                  9
top       162.105.129.122
freq                  150
Name: sourceIP, dtype: object
--------------destip---------------
count             206
unique             12
top       199.7.83.42
freq               82
Name: destIP, dtype: object

 

代码:

import sys
import json
import os
import pandas as pd
import tldextract
# import numpy as np


medata_field = '''
3 = sourceIP
4 = destIP
5 = sourcePort
6 = destPort
7 = protocol
12 = flowStartSeconds
13 = flowEndSecond
54 = DNSReplyCode
55 = DNSQueryName
56 = DNSRequestRRType
57 = DNSRRClass
58 = DNSDelay
59 = DNSReplyTTL
60 = DNSReplyIPv4
61 = DNSReplyIPv6
62 = DNSReplyRRType
77 = DNSReplyName
81 = payload
88 = DNSRequestLength
89 = DNSRequestErrLength
90 = DNSReplyLength
91 = DNSReplyErrLength
'''

medata_field_num = []
medata_field_info = []
for l in medata_field.split("\n"):
    if len(l) == 0: continue
    num, info = l.split(" = ")
    medata_field_num.append(int(num)-1)
    medata_field_info.append(info)
print medata_field_num
print medata_field_info


def extract_domain(domain):
    try:
        ext = tldextract.extract(domain)
        subdomain = ext.subdomain
        if ext.domain == "":
            mdomain = ext.suffix
        else:
            mdomain = ".".join(ext[1:])
        return mdomain
    except Exception,e:
        print "extract_domain error:", e
        return "unknown"


def parse_metadata(path):
    df = pd.read_csv(path, sep="^", header=None)
    dns_df = df.iloc[:, medata_field_num].copy()
    dns_df.columns = medata_field_info
    # print dns_df.tail()

    dns_df["mdomain"] = dns_df["DNSQueryName"].apply(extract_domain)
    # print dns_df.groupby('mdomain').describe()
    # print dns_df.groupby('mdomain').groups
    return dns_df.groupby('mdomain')


def get_data_dist(df, col="sourceIP"):
    # group count by ip dist
    grouped = df.groupby(col)
    # print grouped.head(10)[col]
    print type(grouped.size())
    size = grouped.size()
    print size
    print "-----------top 10-------------"
    print size.nlargest(10)


def get_ipv4_dist(df, col="DNSReplyLength"):
    # group count by ip dist
    df2 = df[df[col] > 0]
    print "filter before length:", len(df), "filter after length:", len(df2)
    grouped = df2.groupby(by="DNSReplyIPv4")
    # print grouped.head(10)[col]
    size = grouped.size()
    print size
    print "-----------top 10-------------"
    print size.nlargest(10)



def move_to(srcpath, domain, dst_path):
    with open(dst_path, "w") as w:
        with open(srcpath) as r:
            for line in r:
                if extract_domain(line.split("^")[55-1]) == domain:
                    w.write(line)


def main():
    history_op = {}
    if os.path.exists("history_op.json"):
        with open("history_op.json") as h:
            history_op = json.load(h)
            print history_op
    for day in range(24, 27):
        for hour in range(0, 24):
            path = "/home/bonelee/latest_metadata_sample/sampled/unknown_sample/debugdogcom-medata_wanted-2017-09-%d-%d.txt" % (day, hour)
            if not os.path.exists(path) or os.path.getsize(path) == 0:
                print path, "passed, file not exists or empty file."
                continue
            print path, "running..."
            try:
                domains_info = parse_metadata(path)
            except IOError, e:
                print e
                continue
            for domain, group in domains_info:
                print "***************************************"
                print "domain:", domain, "flow count:", len(group)
                print "***************************************"
                # print type(group) #<class 'pandas.core.frame.DataFrame'>
                print "------------srcip-----------------"
                print group["sourceIP"].describe()
                print "--------------destip---------------"
                print group["destIP"].describe()
                print "----------------------------------------"
                print "ipv4 address return dist:"
                get_ipv4_dist(group)
                print "----------------------------------------"

                has_judged = False
                need_break = False
                while True:
                    print "-------------choose one--------------"
                    print "sub domain: DNSQueryName(N)"
                    print "ip: srcip(S) or dstip(D)"
                    print "length: DNSRequestLength(R1) or DNSReplyLength(R2)"
                    print "length too: DNSRequestErrLength(R3) or DNSReplyErrLength(R4)"
                    print "port: sourcePort(P1) or destPort(P2) or DNSReplyTTL(T)"
                    print "code: DNSReplyCode(C2) or DNSRequestRRType(C1)"
                    print "other: DNSRRClass(RR) or DNSReplyIPv4(V)"
                    dist_dict = {"R1": "DNSRequestLength",
                     "R2": "DNSReplyLength",
                     "R3": "DNSRequestErrLength",
                     "R4": "DNSReplyErrLength",
                     "P1": "sourcePort",
                     "P2": "destPort",
                     "T": "DNSReplyTTL",
                     "C2": "DNSReplyCode",
                     "C1": "DNSRequestRRType",
                     "RR": "DNSRRClass",
                     "V": "DNSReplyIPv4",
                     "S": "sourceIP",
                     "D": "destIP",
                     "N": "DNSQueryName"
                     }

                    print "-------------label or quit------------"
                    print "black(B) or white(W) or cdn(CDN) or ddos(DDOS) or mddos(M) or unknown(U) or white-like(L)"
                    print "next(Q) or exit(E)?"
                    domain = domain.lower()
                    if "win" == domain[-len("win"):] or "site" == domain[-len("site"):] or "vip" == domain[-len("vip"):]:
                        check = "U"
                        need_break = True
                    elif "lan" in domain or "local" in domain or "dhcp" in domain or "workgroup" in domain or "home" in domain:
                        check = "DDOS"
                        need_break = True
                    elif "cdn" in domain:
                        check = "CDN"
                        need_break = True
                    else:
                        if domain in history_op and not has_judged:
                            print "found history op:", history_op[domain]
                            if not raw_input("OK(Enter for Y)?"):
                                check = history_op[domain]
                                need_break = True
                            else:
                                check = raw_input("Input:")
                        else:
                            check = raw_input("Input:")
                    has_judged = True
                    if check == "Q":
                        print path, "next OK!"
                        break
                    elif check == "E":
                        print path, "Exit!"
                        with open("history_op.json", "w") as f:
                            json.dump(history_op, f)
                            print "saved history_op.json"
                        sys.exit()
                    elif check == "B":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_black/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "B"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "W":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_white/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "W"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "L":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_white_like/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "L"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "CDN":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_cdn/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "CDN"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "DDOS":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_ddos/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "DDOS"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "M":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_mddos/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "M"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "U":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_unknown/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "U"
                        print "Saved OK!"
                        if need_break: break
                    else:
                        if check in dist_dict:
                            get_data_dist(group, dist_dict[check])
                        else:
                            print "unknown input!Choose the following one:"
            print "*******************************"
            print path, "check over..."
            print "*******************************"


if __name__ == "__main__":
    main()

 

标签:-%,domain,示意,标记,样本,break,print,path,history
From: https://blog.51cto.com/u_11908275/6385867

相关文章

  • 零样本学习(Zero-shot Learning)
    零样本学习是一种机器学习的问题设置,其中模型可以对从未在训练过程中见过的类别的样本进行分类,使用一些形式的辅助信息来关联已见和未见的类别。例如,一个模型可以根据动物的文本描述来识别动物,即使它从未见过那些动物的图像。实现零样本学习有不同的方法,取决于辅助信息的类型和学......
  • python dijkstra 最短路算法示意代码
     defdijkstra(graph,from_node,to_node):q,seen=[(0,from_node,[])],set()whileq:cost,node,path=heappop(q)seen.add(node)path=path+[node]ifnode==to_node:returncost,pathfora......
  • 列级约束和标记约束
    1.列级约束在定义列语句中,例snointprimarykeyauto_increment,    --(学生表学号,int型,约束:主键(非空且唯一)+自增)sexchar(1)default'男'check(sex='男'orsex='女'),  --(性别默认为男,性别只能为男或女)scoredouble(3,1)notnull,    --(得分为浮点数......
  • [更新中] [论文阅读 & 实践] 小样本下的GAN探究记(下)
    GAN的进一步探究书接上回,因为本人在GAN才刚入门,尽管直接调用别人的ACGAN得到的效果已经很棒,但是我还是想要继续看一下不同参数对于生成图像的影响,帮助自己去更好理解.学习率调节为了平衡D和G的过程,我们可以适当更改学习率.这个小trick甚至有个专门的名字叫TwoTime-......
  • Doxygen Comment Tags 编程辅助工具注释标记 @brief @param @return
    以@brief@param@return等形式出现的注释标记被称为Doxygen注释标记(DoxygenCommentTags)或者简称为Doxygen标记。这玩意常用来作为接口文档说明或者接口源码注释,又比如EmmyLua用这个来作为lua函数的注释以提供智能提示。常见的doxygen注释标记及其作用:/***@biref简介*......
  • 查找xml cdata 节点并去掉 CDATA 节点标记
    内容如下<?xmlversion="1.0"encoding="UTF-8"?><transstatus><transname>C-QXZSPJ-QXZSPJ-015-1</transname><id>2ca8df47-3b2d-4d8f-8b70-be68448dd97d</id><status_desc>Finished</status......
  • 第十六篇——学会标记函数,简单实现通达信指标公式做标记(从零起步编写通达信指标公式系
    前面两篇文章介绍了通达信指标公式的画线函数,今天给大家介绍绘图函数的第二种类型——标记函数,讲解DRAWICON、DRAWTEXT、DRAWNUMBER的具体用法。标记函数可以给指标发出的信号做醒目的标记,方便我们查看信号。 一、DRAWICON函数 含义:绘制图标 使用方法: DRAWI......
  • org.apache.jasper.JasperException: /pages/role-list.jsp (行.: [145], 列: [8]) 根
    org.apache.jasper.JasperException:/pages/role-list.jsp(行.:[145],列:[8])根据标记文件中的TLD或attribute指令,attribute[items]不接受任何表达式 web.xml中版本号不兼容产生的问题;解决方法:<%@taglibprefix=“c”uri=“http://java.sun.com/jstl/core”%>改为<%@t......
  • 文本标记-补充
    文本标记问题-挖洞404-博客园(cnblogs.com),根据前面的阐述,进一步解决标记问题。1、两种场景一是基于命令行,可以通过直接给出各参数点的起止索引,可以给出参数名称进而标记对应的值,可以给出文本匹配进行标记,可以自动的根据策略进行标记。二是基于gui,除了以上四种方式,还可以......
  • 《JavaScript权威指南第七版》13.3.4实现细节,关于“ES2017解释器可以把函数体分割成一
    读到“ES2017解释器可以把函数体分割成一系列独立的子函数,每个子函数都被传给位于他前面以await标记的那个期约的then方法”这一部分是比较困惑,也没有代码示例,很抽象,不易理解。自己写了个例子来复述一下这段话:functiongetPosts(){returnnewPromise(function(resolve,......