Spark-mllib特征提取算法
Spark MLlib 提供三种文本特征提取方法,分别为TF-IDF、Word2Vec以及CountVectorizer,其原理与调用代码整理如下: ## TF-IDF ### 算法介绍: 词频-逆向文件频率(TF-IDF)是一种在文本挖掘中广泛使用的特征向量化技术。它将文本中的每个词转换为一个数字,该数字表示该词在文档集合中的重要性。 ### 调用: ```scala import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel} val df = spark.createDataFrame(Seq( (0, Array("a", "b", "c")), (1, Array("a", "b", "b", "c", "a")) )).toDF("id", "words") // fit a CountVectorizerModel from the corpus val cvModel: CountVectorizerModel = new CountVectorizer() .setInputCol("words") .setOutputCol("features") .setVocabSize(3) .setMinDF(2) .fit(df) // alternatively, define CountVectorizerModel with a-priori vocabulary val cvm = new CountVectorizerModel(Array("a", "b", "c")) .setInputCol("words") .setOutputCol("features") cvModel.transform(df).select("features").show() Word2Vec
算法介绍:
Word2Vec是一种用于将词语映射到高维向量空间的技术。通过训练一个神经网络模型,它可以学习到词语之间的相似性。
调用:
import org.apache.spark.ml.feature.Word2Vec // Input data: Each row is a bag of words from a sentence or document. val documentDF = spark.createDataFrame(Seq( "Hi I heard about Spark".split(" "), "I wish Java could use case classes".split(" "), "Logistic regression models are neat".split(" ") ).map(Tuple1.apply)).toDF("text") // Learn a mapping from words to Vectors. val word2Vec = new Word2Vec() .setInputCol("text") .setOutputCol("result") .setVectorSize(3) .setMinCount(0) val model = word2Vec.fit(documentDF) val result = model.transform(documentDF) result.select("result").take(3).foreach(println) Countvectorizer
算法介绍:
CountVectorizer和CountVectorizerModel旨在通过计数来将一个文档转换为向量。当不存在先验字典时,CountVectorizer可作为Estimator来提取词汇,并生成一个CountVectorizermodel。
调用:
import org.apache.spark.ml.feature.CountVectorizer val df = spark.createDataFrame(Seq( (0, Array("a", "b", "c")), (1, Array("a", "b", "b", "c", "a")) )).toDF("id", "words") // fit a CountVectorizerModel from the corpus val cvModel: CountVectorizerModel = new CountVectorizer() .setInputCol("words") .setOutputCol("features") .setVocabSize(3) .setMinDF(2) .fit(df) // alternatively, define CountVectorizerModel with a-priori vocabulary val cvm = new CountVectorizerModel(Array("a", "b", "c")) .setInputCol("words") .setOutputCol("features") cvModel.transform(df).select("features").show()